LLM API Cost: A Practical Breakdown
LLM API cost is measured in tokens, billed separately for input and output, and varies by a factor of 90x or more between the cheapest and most expensive current models. For a production workload processing 1,000 documents per day at 10,000 input tokens each, that spread translates to the difference between a $42/month bill and a $3,900/month bill, for the same task. The single biggest lever you have is choosing the right model tier for the job, not tweaking prompts.
For the broader picture of how to evaluate, build, and govern AI features in a startup context, see my guide on AI development for startups.
How LLM API Pricing Actually Works
Every major LLM provider prices on tokens, not characters, words, or API calls. A token is roughly 0.75 words in English. You pay one rate for input tokens (the prompt, system instructions, conversation history, retrieved context) and a different, higher rate for output tokens (the model’s generated response).
This distinction matters more than most teams realize. Output tokens cost 4 to 8 times more than input tokens across all major providers, with Anthropic’s ratio sitting consistently at 5x output versus input across all current Claude models. If you are building a feature that generates long completions, like report drafting or code generation, your cost model looks completely different from a classification or extraction task that returns a short structured answer.
Current Model Pricing Table (Mid-2026)
Prices shown are standard on-demand rates per million tokens. Batch and cached rates are covered in the next section.
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 |
| Anthropic | Claude Opus 4.7 | $5.00 | $25.00 |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 |
| Gemini 3.1 Pro | $2.00 | $12.00 | |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 |
The range from Gemini 2.5 Flash-Lite input ($0.10/M) to Claude Opus 4.7 output ($25.00/M) is a 250x spread. Picking “the best model” for every task in your pipeline is not a quality decision. It is a financial one with significant consequences at scale.
Worked Monthly Cost Example
Let me walk through a real scenario so the math is concrete before you build anything.
Workload: A document processing pipeline that summarizes customer support tickets. You process 1,000 tickets per day, each with 10,000 input tokens (the ticket thread, context, and instructions) and a 500-token output (the summary).
Daily token volume:
- Input: 1,000 x 10,000 = 10,000,000 input tokens
- Output: 1,000 x 500 = 500,000 output tokens
Monthly cost by model tier (30 days):
| Model | Monthly Input Cost | Monthly Output Cost | Total/Month |
|---|---|---|---|
| Claude Opus 4.7 | $1,500 | $375 | $1,875 |
| GPT-4o | $750 | $150 | $900 |
| Claude Sonnet 4.6 | $900 | $112.50 | $1,012.50 |
| Claude Haiku 4.5 | $300 | $37.50 | $337.50 |
| Gemini 3.1 Pro | $600 | $90 | $690 |
| Gemini 2.5 Flash-Lite | $30 | $3 | $33 |
This tracks with independent benchmarks showing GPT-5-class models running around $3,900/month vs Gemini Flash variants at around $42/month for this type of workload. For summarization tasks where a capable mid-tier model gets the job done, you are looking at a 40 to 90x cost difference with no meaningful quality loss.
The question to ask for every node in your pipeline: does this task actually require frontier-model reasoning, or is it pattern-following that a smaller model handles just as well?
The Four Cost Levers That Actually Move the Needle
1. Model Tiering
This is the highest-leverage decision. Route tasks to the cheapest model that produces acceptable output. Use frontier models only for tasks that demonstrably require them: nuanced judgment calls, complex multi-step reasoning, ambiguous instructions.
In practice, most pipelines have two or three task types. Run evals on a sample of real production inputs using each model tier. Pick the cheapest one that clears your quality bar. I have never seen a team do this exercise and not find at least one task currently running on an over-specified model.
2. Prompt Caching
If your prompts include a large, stable system prompt (instructions, personas, documents, examples), provider-level prompt caching can eliminate most of that cost on repeated calls.
Anthropic’s prompt caching gives a 90% reduction on cached input reads, pricing cache reads at 10% of the standard input rate. Google offers context caching at similar discounts on Gemini models. The official Anthropic caching documentation covers the write vs. read pricing tiers and the 5-minute versus 1-hour TTL options in detail.
The requirement is a stable prefix. If your system prompt changes on every request, caching does not help. If you have a 5,000-token instruction block that does not change per user, caching cuts that chunk to 10% of its listed cost on every repeat call. At volume, this compounds significantly.
3. Batch API
OpenAI, Anthropic, and Google all offer a batch mode where you submit requests asynchronously and accept up to a 24-hour completion window in exchange for a 50% discount on all tokens. The batch API cuts both input and output costs by 50% across all eligible models.
The OpenAI Batch API documentation details the request format, eligible models, and the JSONL file structure you need to implement it. Anthropic and Google have analogous endpoints with the same discount structure.
Tasks that are batch-eligible: nightly data enrichment, bulk document classification, background summarization jobs, report generation that runs overnight, embeddings pipelines. Tasks that are not: real-time chat, anything with a user waiting synchronously for a response.
Combining caching plus batching on a stable-prompt workload gets your effective per-token cost to roughly 25% of standard on-demand rates for the input component.
4. Output Token Reduction
Because output tokens are 4 to 8x more expensive than input tokens, reducing average output length has an outsized effect on your bill. This does not mean truncating quality. It means being explicit in your prompts.
Tactics that work:
- Instruct the model to return structured JSON instead of prose explanations.
- Set a max token limit per call and size it to what you actually need.
- For classification tasks, ask for the label only, not a justification.
- Replace chain-of-thought reasoning in the output with chain-of-thought in the prompt (think first, then answer briefly).
A pipeline returning 2,000-token outputs that could return 300-token structured outputs without quality loss is paying 6x too much on the output side.
Why Your Cost Estimates Are Probably Already Stale
LLM inference prices are falling faster than almost any other technology in recent history. Epoch AI research found a median annual price decline of 50x, accelerating to 200x per year when measured from January 2024 onward. A model tier that was expensive 12 months ago may now be your cheapest option.
This creates two practical problems. First, cost projections made during initial architecture decisions go stale quickly. A model that cost $15/M tokens when you wrote your business case may cost $3/M today, which means that expensive mid-tier model you settled on might now be pointless compared to the next tier down. Second, the landscape of available models is shifting constantly, so the comparison you did six months ago needs to be redone.
I run a model cost audit every quarter on production pipelines. It takes half a day and has paid for itself every time.
The Spend Problem Getting Worse at the Enterprise Level
Despite prices falling, total AI API spend is rising fast because usage is growing faster than prices drop. Enterprise monthly AI API spend increased from $63,000/month in 2024 to $85,500/month in 2025, a 36% year-over-year rise, with 40% of companies now spending over $10 million per year on AI.
The structural issue is observability. Most teams do not track AI spend at the per-request or per-feature level. They see a monthly API bill and cannot trace it back to which pipeline, which model call, or which prompt change caused a spike. Without per-transaction cost tracking, you cannot optimize what you cannot see.
The fix is straightforward: log token usage per call, tag it by feature and model, and pipe it into whatever spend tracking tool you use. This is basic engineering hygiene that most teams skip because the first prototype does not need it and then it never gets added.
Frequently Asked Questions
How much does it cost to run 1 million tokens through GPT-4o?
GPT-4o is priced at $2.50 per million input tokens and $10.00 per million output tokens. So 1 million input tokens costs $2.50. If those tokens generate 1 million output tokens (which would imply very long completions), the output cost is $10.00. Most real workloads produce far fewer output tokens than input tokens, so the blended cost per million total tokens usually lands between $2.50 and $4.
What is the cheapest LLM API for high-volume production use?
Gemini 2.5 Flash-Lite is currently the cheapest capable model at $0.10 per million input tokens and $0.40 per million output tokens. For tasks that do not require frontier-level reasoning (classification, extraction, summarization, simple generation), it outperforms the cost-per-quality ratio of every major competitor by a substantial margin. Haiku 4.5 from Anthropic ($1.00/$5.00) is the runner-up if you need Claude’s instruction-following characteristics.
How does prompt caching actually reduce LLM API costs?
Prompt caching works by storing a prefix of your input (typically the system prompt and any static context) server-side so it does not have to be re-processed on repeated calls. Anthropic charges 10% of the standard input rate for cache reads, a 90% discount. The catch is the prefix must be stable: if you change the system prompt per user or per session, there is no cache hit and no savings. The Anthropic caching docs explain cache write pricing and TTL details.
What is the difference between input tokens and output tokens in API pricing?
Input tokens are everything you send to the model: the system prompt, the user message, conversation history, retrieved documents, and any examples. Output tokens are the tokens the model generates in its response. Output tokens cost 4 to 8 times more than input tokens across all major providers. This means generative tasks (long-form writing, code generation) are inherently more expensive than retrieval or classification tasks, even at the same total context length.
Is the OpenAI Batch API worth it, and what tasks qualify?
Yes, for any async workload. The Batch API gives a 50% discount on all input and output tokens in exchange for accepting up to a 24-hour completion window. Tasks that qualify include overnight data enrichment, bulk labeling, background summarization, embedding generation, and any pipeline that does not have a user waiting for a real-time response. The OpenAI Batch API documentation covers the implementation details. Anthropic and Google offer equivalent batch endpoints with the same discount structure.
How do I estimate my monthly LLM API bill before going to production?
Start with your expected daily request volume, then measure the average input and output token count per request on a sample of real data (not synthetic prompts). Multiply: (daily requests x avg input tokens / 1,000,000 x input rate) + (daily requests x avg output tokens / 1,000,000 x output rate), then multiply by 30. Add a 30 to 50% buffer for prompt engineering iterations and traffic variance. Build this calculation in a spreadsheet and run it for each model tier you are considering before committing to an architecture.
Which is cheaper: Claude, GPT-4o, or Gemini for the same workload?
It depends on the model tier within each family. At the frontier end, Claude Sonnet 4.6 at $3/$15 and GPT-4o at $2.50/$10 are in the same range. Gemini 3.1 Pro at $2/$12 is slightly cheaper. But within each provider family, the cheapest tier (Haiku, Flash-Lite) is dramatically less expensive than the flagship. The real comparison is not provider vs. provider at the same tier. It is finding which model across all providers clears your quality bar at the lowest cost, which requires running evals on your specific task and data.
How I Help Startups Build AI Pipelines That Do Not Bleed Money
I have seen this pattern at a dozen companies: the prototype runs on GPT-4 (or the current frontier model), it gets to production, and nobody ever questions the model choice again. Two years later the AI API line item is material and nobody knows why. The architecture that made sense for a 100-request demo is now processing 100,000 requests a day at a price point that made sense in 2023.
At Sparkable, my team builds AI features and pipelines with cost observability baked in from day one. We design model routing, implement caching where the architecture supports it, and set up the token-level logging that lets you actually see what your pipeline is spending and where. You own all the IP and infrastructure, and we stay involved as your usage grows.
If you are about to build an AI feature and want a sanity check on your model choices and cost projections before you commit to an architecture, book a free 30-minute consultation at sparkable.dev/consult. No pitch. Just the math.