The context window is the hard ceiling on how much the model can process in a single request. It’s measured in tokens — roughly 4 characters per token in English — and it covers both input and output combined.
Input vs output tokens
Input tokens come from everything you send: the messages array (full conversation history), the system prompt, and all tool definitions. Output tokens are what the model generates in response.
The max_tokens parameter only limits output. It does not constrain input. If your input is 190K tokens in a 200K context window, you have 10K tokens left for the response — but max_tokens doesn’t reduce your input.
Why costs grow with conversation length
Because the API is stateless, every request includes the full conversation history. At turn 1, you send 1 message. At turn 50, you send all 100 messages (50 user + 50 assistant). Each turn adds both the user’s new message and the model’s previous response to the history.
This means input tokens grow roughly quadratically with conversation length. A 50-turn conversation doesn’t cost 50× the first turn — it costs much more, because each successive request carries all prior turns as baggage.
This is the fundamental reason long conversations need context management strategies: summarization, fact extraction, or sliding windows. Without them, costs and latency grow unsustainably.
The context window is shared
Input tokens and output tokens share the same context window. A 200K context window with 195K of input leaves only 5K for output. The model doesn’t get a separate output allocation — it all comes from one pool.
One-liner: Input tokens grow with every turn because the full history is resent each time — max_tokens only limits output, and both share the same context window.