MCP Servers and the Context Window Problem: When AI Agents Suffer from Information Overload
MCP Servers and the Context Window Problem: When AI Agents Suffer from Information Overload
Introduction: The AI Agent That Couldn't See Straight
Picture this: You're at an all-you-can-eat buffet. There's sushi, tacos, pizza, curry, and seventeen different desserts. Sounds amazing, right? Now imagine someone forcing you to take a bite of everything before you can sit down and actually eat. By the time you get to your table, you're too stuffed and overwhelmed to enjoy anything, and you've forgotten which dishes you actually wanted in the first place.
Welcome to the world of AI agents connected to multiple MCP servers.
In late 2024, Anthropic launched the Model Context Protocol (MCP)—an open standard that lets AI agents connect to external systems and tools. The community went wild. Within months, thousands of MCP servers popped up like mushrooms after rain. GitHub integrations! Database connectors! File systems! API wrappers! Calendar managers! The toolbox explosion was magnificent.
But here's the hilarious (and expensive) plot twist: Most MCP implementations shot themselves in the foot. They gave AI agents access to hundreds or thousands of tools by loading all tool definitions directly into the model's context window before the agent even read the user's request. Imagine hiring a Swiss Army knife expert, but before they can help you open a bottle, you force them to memorize the instruction manual for every blade, screwdriver, and corkscrew—including the weird fish scaler nobody uses.
The result? Agents consuming hundreds of thousands of tokens before processing a single query. Performance tanked. Costs skyrocketed. And the agents, overwhelmed by the cognitive equivalent of information poisoning, started making dumb mistakes. We accidentally invented "mind intoxication"—a self-inflicted denial-of-service attack where more capabilities made agents less capable.
The good news? Anthropic's engineering team figured out a solution that's so elegant it feels obvious in hindsight: let agents write code to interact with MCP servers instead of shoving everything into their brains upfront. It's like teaching someone to use a library instead of making them memorize the card catalog.
Let's unpack how we got here, why context windows matter more than you think, and how code execution is changing the game.
What Are MCP Servers, Really?
The Promise: Plug-and-Play Superpowers
The Model Context Protocol is Anthropic's answer to a fundamental AI problem: How do we let language models interact with the real world without building custom integrations for every single tool?
Before MCP, if you wanted an AI agent to check your calendar, query a database, read files, and send emails, you'd write bespoke code for each integration. Every tool required custom glue logic, authentication handling, and prompt engineering. It was the software equivalent of hand-crafting individual adapters for every outlet in your house.
MCP standardized the plug. It's a protocol that lets developers build "MCP servers"—lightweight services that expose tools, resources, and data to AI agents in a consistent format. Think of it as USB-C for AI: one universal connector, infinite devices.
How MCP Servers Work
An MCP server is a JSON-RPC service that exposes:
- Tools: Functions the agent can call (e.g.,
create_github_issue,query_database,send_email) - Resources: Read-only data sources (e.g., file contents, API schemas, documentation)
- Prompts: Pre-built templates or workflows the agent can invoke
When an AI agent (the "MCP client") connects to these servers, it receives a manifest of available capabilities. In theory, the agent can then intelligently choose which tools to use based on the user's request.
The protocol is transport-agnostic (works over stdio, HTTP, WebSockets) and language-agnostic (servers can be written in any language). Since its November 2024 launch, adoption has been explosive—the community has built MCP servers for Slack, Notion, Jira, PostgreSQL, file systems, web scrapers, and more.
The Ecosystem Explosion
By early 2025, you could spin up an AI agent with access to:
- Your entire codebase (via filesystem MCP)
- Your company's databases (via SQL MCP servers)
- Third-party APIs (GitHub, Linear, Figma, etc.)
- Cloud services (AWS, GCP, Azure)
- Custom internal tools
It's the AI equivalent of Tony Stark's workshop. The problem? Most agents were drowning in their own toolboxes.
Context Windows: The AI Agent's Working Memory
What Is a Context Window?
A language model's context window is its working memory—the amount of text it can "see" and reason about at once. Think of it as the model's short-term memory capacity or the size of its desk.
Modern frontier models have impressive context windows:
- GPT-4 Turbo: 128,000 tokens (~300 pages of text)
- Claude 3 Opus: 200,000 tokens (~500 pages)
- Gemini 1.5 Pro: 1,000,000 tokens (~2,500 pages)
A "token" is roughly 3-4 characters in English, so 128K tokens ≈ 96,000 words.
Sounds huge, right? It is—for reading. But here's the catch: everything that goes into the model consumes that precious space:
- System instructions: The agent's "personality" and behavioral rules (1-5K tokens)
- Conversation history: Previous messages and responses (grows over time)
- Tool definitions: JSON schemas for every available function (100-500 tokens per tool)
- Retrieved context: RAG results, file contents, search results (can be massive)
- The actual user query: The thing you wanted the agent to help with (variable)
Once you hit the context limit, the model either truncates old information (loses continuity) or refuses to process the request (fails outright).
Why Context Window Management Matters
Performance: Models slow down as context size grows. Attention mechanisms scale poorly—processing 100K tokens takes exponentially longer than 10K tokens.
Cost: Most LLM APIs charge by the token. Input tokens (what you send) and output tokens (what the model generates) both count. Wasting 50,000 tokens on tool definitions you'll never use is like burning cash.
Accuracy: The "needle in a haystack" problem—models struggle to attend to relevant information when buried in massive context. Empirical research shows degradation beyond 30-50K tokens for complex reasoning tasks.
Latency: More tokens = slower responses. Users notice. A 2-second response feels snappy; a 15-second delay feels broken.
The Math Gets Ugly Fast
Let's say you connect an AI agent to 10 MCP servers, each exposing 20 tools. That's 200 tools total.
- Average tool definition: ~300 tokens (function name, parameters, descriptions, examples)
- Total overhead: 60,000 tokens just for tool definitions
You've consumed nearly half of GPT-4 Turbo's 128K context window before the agent even reads the user's request. Add conversation history, retrieved documents, and system instructions, and you're choking the model before it can think.
Now imagine 50 MCP servers with 500 tools. You're instantly over the context limit. The agent can't even start.
The Mind Intoxication Problem: How MCP Servers Self-Sabotage
Tool Definition Overload: Death by a Thousand Schemas
The Anthropic engineering team discovered a fundamental flaw in most MCP client implementations: they load all tool definitions upfront, directly into the model's context window.
This is the buffet problem from the intro. Instead of discovering tools on-demand ("What do I need for this task?"), the agent is force-fed the entire menu. The result:
- Agents with 1,000+ available tools must process hundreds of thousands of tokens before reading a user's request
- The model's "attention budget" is blown on irrelevant tool schemas
- Reasoning quality degrades—the agent forgets what it's supposed to do
- Costs spiral out of control
Real-world example: An agent connected to a filesystem MCP server (with tools for reading, writing, listing, deleting files), a GitHub server (20+ tools for repos, issues, PRs), a database server (query, insert, update, delete), and a few API wrappers could easily hit 50-100 tool definitions. That's 15,000-30,000 tokens of overhead—just for describing capabilities it might never use.
Intermediate Result Consumption: The Double-Tax Problem
But wait, there's more pain.
Every time an AI agent calls a tool, the result flows back through the model's context window. This is fine for small results ("The current temperature is 72°F"). It's a disaster for large ones.
Scenario: An agent needs to retrieve a 20-page document from one system and upload it to another.
Naive MCP flow:
- Agent calls
read_document(doc_id)→ returns 20 pages of text (30K tokens) - That text gets added to the context window (so the agent can "see" it)
- Agent calls
upload_document(content)→ passes the same 30K tokens as input - The upload result (success message, new doc ID) gets added to context
Token consumption: ~60,000 tokens (the document passed through context twice) plus tool definitions, conversation history, etc.
The Anthropic team observed agents blowing through 150,000 tokens for workflows that should cost a fraction. They called this the "intermediate result consumption" problem—data that's only needed for machine-to-machine handoffs gets unnecessarily routed through the LLM's precious cognitive workspace.
The "Mind Intoxication" Effect
Combining tool overload and intermediate result bloat creates what we might call cognitive saturation or, more colorfully, mind intoxication.
The symptoms:
- Distraction: The agent forgets the user's original request because it's buried under tool schemas
- Hallucination: Overwhelmed by irrelevant context, the model "guesses" instead of reasoning carefully
- Slowness: 10-20 second response times because the model is churning through bloated context
- Cost explosion: 0.01-0.10
- Failure: The agent hits context limits and can't proceed
It's a self-inflicted denial-of-service attack. More capabilities → worse performance. The irony is painful.
Why Did This Happen?
It's an architecture pattern borrowed from early chatbot designs: "Load all tools into the prompt so the model can see its options." That worked fine when agents had 5-10 tools. It breaks catastrophically at 100+.
The community moved fast ("Let's connect agents to everything!") without revisiting the foundational assumption ("Load all tools upfront"). Classic scaling failure—what works at 10× doesn't work at 100×.
Anthropic's Solution: Code Execution with MCP
The Big Idea: Tools as a Filesystem
Anthropic's engineering team asked a deceptively simple question: What if agents discovered tools on-demand instead of loading everything upfront?
Their solution: Organize MCP tools as a filesystem hierarchy and give agents the ability to write code to explore and interact with them.
Instead of this:
SYSTEM: You have access to 500 tools: [massive JSON schema dump]
USER: Please check my calendar for tomorrow.
AGENT: [drowning in tool definitions] Um... let me think...
Do this:
SYSTEM: You can explore tools by reading /tools/ as a directory tree.
USER: Please check my calendar for tomorrow.
AGENT: [writes Python code]
import os
tools = os.listdir('/tools/calendar/')
# finds get_events, create_event, delete_event
# loads definition for get_events only
result = get_events(date='tomorrow')
The agent discovers tools lazily—only loads definitions for tools it actually needs, only when it needs them.
How It Works
-
Tools exposed as filesystem paths: Each MCP server presents its tools as a directory structure (e.g.,
/tools/github/,/tools/database/,/tools/files/) -
Agent explores incrementally: The agent can
ls /tools/to see top-level categories, then drill down (e.g.,ls /tools/github/→ seescreate_issue,list_repos,get_pr_status) -
Just-in-time loading: The agent reads tool definitions only when needed (e.g.,
cat /tools/github/create_issue.json→ gets schema for that one function) -
Code execution environment: The agent writes Python (or another language) to interact with tools. The code runs in a secure sandbox with access to MCP servers as importable libraries or API clients
-
Data stays out of context: Intermediate results live in the execution environment. A 20-page document retrieved from one MCP server can be directly passed to another without flowing through the model's context window
Dramatic Efficiency Gains
Anthropic's case study showed:
- Before (naive MCP): 150,000 tokens for a document retrieval + upload workflow
- After (code execution): 2,000 tokens for the same workflow
- Savings: 98.7% reduction in token usage
That's not a typo. By keeping intermediate data out of the context window and loading only relevant tool definitions, they achieved two orders of magnitude efficiency improvement.
Cost impact: If GPT-4 Turbo charges $10 per million input tokens:
- Naive approach: 150K tokens = $1.50 per request
- Code execution approach: 2K tokens = $0.02 per request
At 10,000 requests/day, that's 200/day. The difference pays for an engineer's salary.
Secondary Benefits
Composability: Agents can write loops, filters, and data transformations in code instead of clunky tool chains. Example: "Analyze all customer reviews from last month" → the agent writes a loop to fetch reviews in batches, filters locally, and aggregates results—all without round-tripping through the LLM for every record.
Privacy: Sensitive intermediate data (PII, credentials, internal IDs) stays in the execution sandbox by default instead of passing through the model's context (and potentially training data).
Debugging: Code is easier to inspect than opaque tool call chains. You can see exactly what the agent did (read the code), replay it, and unit-test edge cases.
State persistence: The filesystem metaphor extends to state management—agents can write temporary files, cache results, and maintain progress across multi-step workflows.
The Trade-Offs
Code execution isn't free lunch. You need:
- Secure sandboxing: Execute untrusted agent-generated code safely (containers, VMs, WebAssembly)
- Resource limits: Cap CPU, memory, disk, and network usage to prevent runaway agents
- Monitoring: Track execution time, errors, and suspicious patterns (e.g., infinite loops, attempted sandbox escapes)
- Operational complexity: Running code execution infrastructure is harder than REST API calls
For complex, high-value workflows, the trade-off is worth it. For simple chatbots, maybe not.
Practical Takeaways
When to Use Code Execution with MCP
Good fits:
- High tool count: 50+ available tools across multiple MCP servers
- Large data transfers: Documents, datasets, or API responses that would bloat the context window
- Multi-step workflows: ETL pipelines, data analysis, complex automations
- Cost sensitivity: High request volume where token costs matter
Overkill for:
- Simple agents: 5-10 tools, lightweight queries
- Real-time chat: Latency-sensitive apps where code execution overhead matters
- Minimal infrastructure teams: If you can't run secure sandboxes, stick to direct tool calls
Design Principles for MCP Efficiency
- Lazy loading: Don't send tool definitions until the agent needs them
- Hierarchical organization: Group tools by domain (calendar, email, files) so agents can explore incrementally
- Data locality: Process large datasets in the execution environment, not the model context
- Concise schemas: Write tight tool descriptions—every token counts
- Caching: Reuse tool definitions across requests (most frameworks support this)
Measuring Success
Track these metrics:
- Tokens per request: Aim to minimize, especially input tokens
- Tool definitions loaded: How many tools does the agent actually use vs. how many are available?
- Latency: Context size directly impacts response time
- Cost per request: Input + output tokens × your LLM's pricing
- Success rate: Did the agent complete the task correctly?
If your token usage scales linearly with the number of available tools (even when the agent uses 5% of them), you have a loading problem. Fix it.
The Bigger Picture: Context Windows Are Finite (For Now)
The Context Ceiling
Despite marketing hype about "million-token context windows," practical limits remain:
- Attention degradation: Models lose accuracy on complex reasoning tasks with massive context
- Cost scaling: Bigger context = proportionally higher costs
- Latency: Users won't wait 30 seconds for a response
- Memory constraints: Serving a 1M-token context requires gigabytes of GPU RAM—expensive and scarce
Code execution doesn't eliminate these limits, but it sidesteps them for a huge class of problems. It's the AI equivalent of "work smarter, not harder."
The Architectural Lesson
The MCP context window problem teaches a broader lesson: More capabilities don't automatically make agents better.
Bad architecture:
- Expose 1,000 tools → load all definitions → agent drowns
Good architecture:
- Expose 1,000 tools → organize hierarchically → agent discovers on-demand → agent stays focused
This mirrors decades of software engineering wisdom:
- Microservices: Decompose monoliths, load services on-demand
- Lazy evaluation: Don't compute results until needed
- Pagination: Don't fetch the entire database, fetch one page at a time
AI agents are software. The same principles apply.
Conclusion: Less Is More (When It's Smarter)
The MCP context window problem is a beautiful case study in unintended consequences. The community raced to give AI agents access to everything, and accidentally created a new bottleneck: information overload.
Anthropic's code execution solution is elegant because it respects a fundamental constraint: context windows are finite, and wasting them is expensive.
By letting agents write code to explore tools lazily and process data locally, we get:
- 98%+ reduction in token usage
- Faster responses
- Lower costs
- Better accuracy (less cognitive noise)
- More scalable architectures
The lesson for anyone building AI agents: Don't feed your agent the entire buffet upfront. Teach it to order from the menu.
As the MCP ecosystem matures, expect code execution patterns to become standard. Frameworks like LangChain, LlamaIndex, and AutoGPT are already experimenting with sandboxed execution environments. The future of AI agents isn't just "more tools"—it's smarter tool discovery.
And maybe, just maybe, we'll stop accidentally intoxicating our own AI systems with too much of a good thing.
Sources: Anthropic Engineering Blog ("Code Execution with MCP"), Model Context Protocol specification (November 2024), LangChain/LangGraph documentation, OpenAI API pricing, and empirical benchmarks from production MCP deployments.