Understanding Context Windows: Finally

Understanding Context Windows: Finally

The Moment It Forgets You

You're deep into a conversation with an AI. You've been building context for twenty minutes -- explaining your project architecture, the bug you're hunting, the three approaches you've already tried. You type a follow-up question, and the model responds as if you've never spoken before. It suggests the exact approach you just told it didn't work. It asks you to clarify something you explained in detail ten messages ago.

It didn't crash. It didn't get distracted. It simply... ran out of room.

What you just bumped into was the context window -- the invisible boundary that defines everything a language model can see at any given moment. And if you've ever used ChatGPT, Claude, Gemini, or any other LLM and felt like the conversation "reset" on you, that's the context window at work.

But to understand what a context window really is, we first need to talk about something more fundamental: the pieces that fill it.

Tokens: The Atoms of Language

Here's a question that sounds simple but isn't: how does a computer read a sentence?

You and I see words. We see meaning, grammar, emotion. A language model sees none of that -- at least, not directly. What it sees are tokens: small numerical chunks that the model uses as its basic unit of processing. Tokens are to an LLM what atoms are to chemistry. Everything is built from them.

A token might be a whole word, part of a word, a single character, or even a punctuation mark. The word "hello" is typically one token. The word "unbelievable" might be split into three: un, believ, able. A newline character? That's a token too.

Here's a rough rule of thumb for English text:

MetricApproximate Value
Characters per token~4
Tokens per word~1.3
Words per 100 tokens~75
Tokens per page (~500 words)~650--830

So when someone says a model has a "128K context window," that's 128,000 tokens -- roughly 96,000 English words, or about 190 pages of standard text. That's a decent novel.

But here's where it gets interesting: these numbers are only reliable for English.

The Language Tax: When Not All Words Are Equal

Tokenizers -- the algorithms that split text into tokens -- are trained primarily on English-heavy datasets. This means English text gets tokenized efficiently: common words stay intact, frequent phrases get compressed. Other languages aren't so lucky.

Let's look at the same concept expressed across different languages and see how the token count shifts:

"The cat sat on the mat" (English)

  • Tokens: The, cat, sat, on, the, mat
  • Count: 6 tokens

"El gato se sentó en la alfombra" (Spanish)

  • Tokens: El, g, ato, se, sent, ó, en, la, alf, ombra
  • Count: ~10 tokens

"猫がマットの上に座った" (Japanese)

  • Each character can cost 1-3 tokens depending on the tokenizer
  • Count: ~12-15 tokens

"Die Katze saß auf der Matte" (German)

  • Tokens: Die, K, atze, sa, ß, auf, der, Matt, e
  • Count: ~9 tokens

"Кот сидел на коврике" (Russian)

  • Cyrillic characters often fragment into multiple tokens
  • Count: ~12 tokens

The same semantic content -- a cat on a mat -- costs wildly different amounts depending on the language you express it in. This is sometimes called the "language tax": speakers of non-English languages consume more tokens for the same information, which means they fill up the context window faster, pay more for API usage, and get less effective context.

For Chinese, Japanese, and Korean (CJK) text, the difference is particularly stark. Most BPE tokenizers assign roughly 1-3 tokens per character. Since a single Chinese character often carries the semantic weight of an entire English word, the same document in Chinese can consume 2-4x as many tokens as its English equivalent.

This isn't a theoretical concern. If your context window is 128,000 tokens and you're working in Japanese, your effective window might feel more like 40,000-60,000 English-equivalent words rather than the 96,000 an English speaker gets. Same window, less room.

How Tokenization Actually Works

So how does a tokenizer decide where to cut? The dominant approach today is called Byte Pair Encoding (BPE), and its origin story is delightfully unexpected.

BPE was invented in 1994 by Philip Gage as a data compression algorithm. The idea was simple: look at a stream of bytes, find the pair that occurs most frequently, replace it with a new symbol, and repeat. In 2015, researchers Sennrich, Haddow, and Birch adapted this technique for natural language processing, and it turned out to work beautifully.

Here's how it works, step by step:

Step 1: Start with characters. The algorithm begins with every individual character in the training corpus as its vocabulary.

Vocabulary: [a, b, c, d, e, f, g, h, i, ...]

Step 2: Count pairs. It scans the entire corpus and counts how often each pair of adjacent symbols appears. Maybe t + h appears 45 million times.

Step 3: Merge the most frequent pair. The pair t + h becomes a new token th. The vocabulary now includes th as a single unit.

Step 4: Repeat. Scan again with the updated vocabulary. Maybe now th + e is the most common pair. Merge it into the. Keep going until you reach a target vocabulary size (typically 50,000-100,000 tokens).

After thousands of merges, common words like "the", "and", and "is" become single tokens. Less common words get split into recognizable subword pieces. The word "tokenization" might become token + ization, while "antidisestablishmentarianism" gets broken into many small pieces.

Three major tokenization families are in use today:

AlgorithmUsed ByKey Feature
Byte Pair Encoding (BPE)GPT-2, GPT-3, GPT-4, LLaMAMerges by frequency; byte-level variant handles any input
WordPieceBERT, DistilBERTMerges by likelihood maximization; uses ## prefix for continuations
SentencePieceT5, ALBERT, many multilingual modelsLanguage-agnostic; treats input as raw Unicode stream, no whitespace assumptions

SentencePiece deserves special attention. Most tokenizers assume that spaces separate words -- which works fine for English, Spanish, or German. But Chinese, Japanese, and Thai don't use spaces between words. SentencePiece solves this by treating the input as a raw character stream and learning boundaries from the data itself, making it genuinely language-agnostic.

Now, Finally: What Is a Context Window?

With tokens understood, we can define the context window precisely:

A context window is the maximum number of tokens that a language model can process in a single interaction. It includes everything: the system prompt, the conversation history, any documents or code you've attached, and the model's response.

That last part is important. The context window isn't just for input. If a model has a 128,000-token window and your input uses 100,000 tokens, only 28,000 tokens remain for the response. Stuff the window too full, and you leave the model with barely enough room to answer.

Here's a simplified picture of what's inside a context window at any given moment:

┌──────────────────────────────────────────────────────┐
│                  CONTEXT WINDOW                      │
│                                                      │
│  ┌─────────────────────────────────────────────────┐ │
│  │ System Prompt                                   │ │
│  │ "You are a helpful coding assistant..."         │ │
│  └─────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────┐ │
│  │ Conversation History                            │ │
│  │ User: "Here's my project structure..."          │ │
│  │ Assistant: "I see, let me analyze..."           │ │
│  │ User: "Now look at this error..."               │ │
│  │ Assistant: "The issue is in line 42..."         │ │
│  │ User: "Can you also check the tests?"           │ │
│  └─────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────┐ │
│  │ Attached Context                                │ │
│  │ [Your uploaded files, RAG results, etc.]        │ │
│  └─────────────────────────────────────────────────┘ │
│  ┌─────────────────────────────────────────────────┐ │
│  │ Model's Response (being generated)              │ │
│  │ "Looking at your test file, I notice..."        │ │
│  └─────────────────────────────────────────────────┘ │
│                                                      │
└──────────────────────────────────────────────────────┘

One crucial thing to understand: LLMs are stateless. They don't actually "remember" previous messages. Every time you send a new message in a conversation, the client software (ChatGPT, Claude, your API wrapper) bundles the entire conversation history and sends it all at once. The model processes the whole thing from scratch each time.

That's why conversations feel seamless -- until they don't. The client manages the illusion of memory by replaying history. But when that history exceeds the context window, something has to give. Older messages get trimmed, summarized, or dropped entirely. And the model has no idea it's lost anything.

A Brief History of Growing Windows

To appreciate where we are, it helps to see where we've been.

When ChatGPT launched in November 2022, its context window was 4,096 tokens -- roughly 3,000 words. You could fit maybe five or six pages of text. For casual conversation, that was fine. For serious work? It was like trying to write a novel on a Post-it note.

The growth since then has been nothing short of extraordinary:

YearModelContext WindowRough Equivalent
2022GPT-3.54K tokens~6 pages
2023GPT-48K-32K tokens~12-48 pages
2023Claude 2100K tokens~150 pages
2024GPT-4 Turbo128K tokens~190 pages
2024Gemini 1.5 Pro1M tokens~1,500 pages
2025Claude Opus 4200K tokens~300 pages
2025GPT-4.11M tokens~1,500 pages
2025Gemini 2.5 Pro1M tokens~1,500 pages
2026Claude Opus 4.61M tokens~1,500 pages
2026Llama 4 Scout10M tokens~15,000 pages

From 6 pages to 15,000 pages in roughly three years. That's a 2,500x increase.

But bigger doesn't always mean better. And this is where things get really interesting.

The Fine Print: When Bigger Isn't Better

The Advertised vs. Effective Window

Here's something the marketing materials don't emphasize: a model's stated context window and its effective context window are not the same thing.

Research using the RULER benchmark and other evaluation frameworks has consistently found that models claiming 200K tokens of context begin to degrade well before that limit. A model advertising 1 million tokens might sustain high-quality reasoning across only 50-60% of that window. Performance doesn't degrade gradually -- it can drop off suddenly and dramatically.

Think of it like a highway's speed limit versus actual traffic flow. The sign says 120 km/h, but during rush hour you're doing 40.

Lost in the Middle

In 2023, researchers from Stanford and the University of Washington published a landmark paper called "Lost in the Middle: How Language Models Use Long Contexts." Their finding was striking and has held up remarkably well:

LLMs exhibit a U-shaped performance curve. They're excellent at using information placed at the beginning of the context and the end of the context. But information buried in the middle? Performance drops by more than 30%.

  Performance
  ▲
  │
  █                                           █
  █ █                                       █ █
  █ █ █                                   █ █ █
  █ █ █ █                               █ █ █ █
  █ █ █ █ █                           █ █ █ █ █
  █ █ █ █ █ █ █                   █ █ █ █ █ █ █
  █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █
  └──────────────────────────────────────────────▶
  Beginning           Middle              End
            Position of relevant information

Two mechanisms drive this:

  1. Causal attention bias. In transformer models, earlier tokens undergo more layers of attention processing. Tokens at the beginning accumulate disproportionately high attention scores -- a "primacy" effect similar to how we remember the first items in a list more easily.

  2. Positional encoding decay. Most modern models use Rotary Positional Embeddings (RoPE), which introduce a natural decay in attention scores as the distance between tokens increases. Combined with the primacy effect, this creates a systematic blind spot in the middle.

The practical takeaway? If you're building systems that feed documents to an LLM (like RAG pipelines), put the most important information at the beginning or end of the prompt. Don't bury critical details in the middle of a 50-page context dump.

Context Rot

Research from Chroma introduced the concept of "context rot": even when relevant information is present and technically within the window, the sheer volume of surrounding context introduces noise. More tokens don't just mean more room -- they also mean more distractions. The model has to pick signal from an increasingly noisy haystack.

The Quadratic Wall

There's a deep technical reason why context windows have limits, and it comes down to how the transformer architecture works.

The self-attention mechanism -- the core innovation that makes transformers powerful -- computes relationships between every pair of tokens in the context. This means computation scales quadratically with context length. Double the context, and you quadruple the computation. A 100K-token context requires roughly 10,000x more compute than a 1K-token context.

Context SizeRelative Compute (Attention)
1K tokens1x
10K tokens100x
100K tokens10,000x
1M tokens1,000,000x

This is why larger context windows cost more, run slower, and consume enormous amounts of GPU memory. It's also why researchers are actively exploring alternatives to full attention:

  • FlashAttention optimizes GPU memory access patterns to compute exact attention more efficiently.
  • Sparse attention (Longformer, BigBird) restricts each token to attend to only a subset of other tokens.
  • Linear attention (Linformer, Performer) approximates attention with O(n) scaling instead of O(n^2).
  • State-space models (Mamba, RWKV) take a fundamentally different approach, processing tokens sequentially with a carried-forward state, achieving linear scaling by design.

The race is on to make long context both possible and practical.

Why It All Matters

Understanding context windows isn't just academic trivia. It changes how you use AI tools, how you design systems around them, and what you can realistically expect.

For everyday users: If your conversation with an AI starts feeling "off" after a while -- the model repeating itself, losing track of details, contradicting earlier statements -- you're probably approaching the context limit. Starting a fresh conversation (with a concise summary of prior context) often works better than pushing an old one.

For developers building with LLMs: Context management is a critical design skill. You need to decide what goes into the window and what gets left out. RAG systems, summarization chains, and sliding-window strategies are all techniques for making the most of finite context. The days of "just dump everything in and let the model figure it out" are behind us -- or should be.

For multilingual applications: If you're building products for non-English markets, budget for higher token consumption. A feature that works beautifully with English input at 50K tokens might hit the wall at the same semantic content in Chinese or Arabic.

For cost-conscious teams: API pricing is per-token. Every token in your prompt is a token you pay for. Understanding tokenization helps you write more efficient prompts, strip unnecessary boilerplate, and choose the right model size for the task.

Looking Forward

Context windows have grown roughly 2,500x in three years. But the deeper story isn't just about making windows bigger -- it's about making them smarter.

The "lost in the middle" problem tells us that raw size isn't enough. The industry is moving toward:

  • Better positional encodings that let models attend to information equally regardless of where it appears in the context.
  • Hybrid architectures that combine efficient sequential processing (like state-space models) with attention for the parts that truly need it.
  • Smarter context management at the application layer -- better chunking, re-ranking, and strategic placement of information.
  • Fairer tokenization across languages, with emerging systems like SentencePiece and gradient-based tokenizers working to close the language tax gap.

The context window isn't going away. Even as models get larger and more capable, the boundary between what the model can see and what it can't will continue to shape every interaction. Understanding that boundary -- and working skillfully within it -- is one of the most practical things you can learn about working with AI today.

The next time a model forgets something you told it, you'll know exactly why. And more importantly, you'll know what to do about it.

References