Codex Hidden Cost: 2x Billing for Long Context Over 272K

Q: What is the OpenAI long-context surcharge?

When a GPT-5.4 or Codex request exceeds 272,000 input tokens, all input tokens in that session are billed at 2x and output tokens at 1.5x the standard rate.

Q: Why does the surcharge trigger at exactly 272K tokens?

400,000 total context minus 128,000 reserved for output equals 272,000 efficient input tokens. Beyond that, OpenAI uses more expensive memory clusters and passes the cost to users.

Q: Does the 2x multiplier apply to the whole prompt or just the overage?

The entire prompt session. If you send 275,000 tokens, you pay 2x for all 275,000 tokens — not just the 3,000 tokens over the limit.

Q: How can I stop Codex from exceeding the token limit?

Use .codexignore to block token-heavy files like package-lock.json and compiled assets. Restart Codex CLI sessions frequently to prevent conversation history from bloating the context window.

How to Avoid the GPT-5.4 Token Surcharge Trap

Sk Jabedul Haque

Apr 22, 2026 • 5 min read • 226 views

Codex Hidden Cost: 2x Billing for Long Context Over 272K

Navigation

10 Sections

Get Updates on WhatsApp

OpenAI's GPT-5.4 and Codex offer a massive 1-million token context window, but using it comes with a massive hidden cost. Once your input prompt exceeds 272,000 tokens, a long-context surcharge is triggered: you will be billed 2x for all input tokens and 1.5x for all output tokens for that entire session. This guide breaks down why this threshold exists, how it secretly inflates your API and Codex CLI bills, and how to optimize your codebase context to stay under the 272K limit.

The 272K Token Trap: How the Surcharge Works

When OpenAI launched GPT-5.4 in March 2026, the headline feature was the 1-million token context window—enough to ingest entire codebases, massive legal documents, and extensive data sets simultaneously. However, developer forums and GitHub issues have recently exploded with complaints about unexpectedly high API and Codex bills, shortly after the Codex not supported error wave.

The culprit is a tiered pricing structure that activates at exactly 272,000 input tokens. Here is how the billing actually works:

Input Token Volume	Input Token Cost	Output Token Cost
Up to 272,000 tokens	Standard Rate (e.g., $2.50 / 1M)	Standard Rate (e.g., $10.00 / 1M)
272,001+ tokens	2x Multiplier ($5.00 / 1M)	1.5x Multiplier ($15.00 / 1M)

The critical catch: This multiplier is retroactive for the entire session. If you send 275,000 tokens, you aren't just paying double for the 3,000 tokens over the limit—you are paying double for all 275,000 tokens. This can instantly double the cost of a long-running coding session in the Codex CLI.

Why Exactly 272K?

The 272,000 number seems completely arbitrary until you look at the underlying model architecture. GPT-5.4 has an internal memory optimization tiering system. Many variants (like GPT-5.4 mini) actually operate on a native 400K total context window. But the model reserves a strict 128K tokens for maximum output generation.

400,000 (Total Context) - 128,000 (Reserved Output) = 272,000 (Max Efficient Input)

When you push past 272K input tokens, OpenAI has to shift the request to more expensive, un-compacted memory clusters to maintain attention coherence across the massive context. They pass this infrastructure cost directly to you via the 2x surcharge.

How This Impacts Codex Developers

If you are using the Codex CLI for agentic coding, you are at high risk of hitting this limit silently.

Workspace Scanning: When you ask Codex to "refactor the auth system," it often auto-ingests your entire workspace (similar to how Claude 4.7 handles file-based memory). A mid-sized Next.js project with `node_modules` ignored can easily exceed 300K tokens.
Context Accumulation: Even if your initial prompt is small, a long conversational session accumulates context. By message #15, your entire chat history is being sent back to the API, silently crossing the 272K line and doubling your costs for the rest of the session.
ChatGPT Business Seats: As of April 2026, ChatGPT Business workspaces can mix standard seats with usage-based Codex seats. If your team is using usage-based seats and routinely dumping giant codebases into the context, your monthly invoice will skyrocket.

How to Avoid the 2x Surcharge

You can manage your context and protect your billing by implementing these best practices:

1. Disable Auto-Compaction Opt-In (API Users)

If you are hitting the API directly, the 1M context window is actually an opt-in feature. Ensure you are not accidentally requesting the extended window if you don't need it. Do not set model_context_window to 1M unless absolutely necessary.

2. Use .codexignore Aggressively

Just like `.gitignore`, you need to explicitly block Codex from reading dense, token-heavy files that don't contribute to the logic. Always ignore:

package-lock.json and yarn.lock (Can easily be 50K+ tokens)
Compiled assets, dist/, and build/ directories
Massive JSON mock data files
SVG icon libraries

3. Restart Sessions Frequently

Don't use a single Codex CLI session for an entire week. The conversation history bloats the input token count. Use the clear command or start a new terminal session when shifting to a new task to reset the context window.

4. Switch to GPT-5.4-Mini for Routine Tasks

OpenAI released GPT-5.4 mini alongside the flagship model. It is significantly cheaper, highly optimized for coding, and has a 400K context window. If you must process a 300K token codebase, doing it on the mini model (even with a multiplier) is vastly cheaper than doing it on the Pro model.

Conclusion

The 1-million token context window is a massive technological achievement, but treating it as an infinite playground is a costly mistake. Developers using Codex and GPT-5.4 must treat the 272K token mark as a hard boundary. By curating your context, utilizing ignore files, and understanding OpenAI's tiered billing architecture, you can leverage agentic coding without the nasty billing surprises.

Frequently Asked Questions

What is the OpenAI long-context surcharge?

The long-context surcharge is a pricing tier for OpenAI models like GPT-5.4. When an API or Codex request exceeds 272,000 input tokens, the cost of all input tokens in that session is multiplied by 2x, and output tokens by 1.5x.

Why does the surcharge trigger at exactly 272K tokens?

It is based on the model's memory architecture. The base context window is 400,000 tokens, minus a reserved 128,000 tokens for maximum output generation. Exceeding the remaining 272,000 input tokens requires expensive un-compacted memory clusters.

Does the 2x multiplier apply to the whole prompt or just the overage?

It applies to the entire prompt session. If you send 275,000 tokens, you pay the 2x multiplied rate for all 275,000 tokens, not just the 3,000 tokens over the limit.

How can I stop Codex from exceeding the token limit?

Use a `.codexignore` file to block the AI from reading token-heavy files like package-lock.json and compiled assets. Additionally, restart your Codex CLI sessions frequently to prevent conversation history from bloating the context window.

Published: April 23, 2026 | Last Updated: April 23, 2026 | Author: SK Jabedul Haque

Sk Jabedul Haque

Founder & Chief Editor

Building India's most trusted finance education platform — simplifying news, calculators, and market trends so anyone can understand and invest confidently.

Read full bio →

AI Models AI Tools

in Technology