MCP Servers and the Context Window Problem: When AI Agents Suffer from Information Overload

Introduction: The AI Agent That Couldn't See Straight

Picture this: You're at an all-you-can-eat buffet. There's sushi, tacos, pizza, curry, and seventeen different desserts. Sounds amazing, right? Now imagine someone forcing you to take a bite of everything before you can sit down and actually eat. By the time you get to your table, you're too stuffed and overwhelmed to enjoy anything, and you've forgotten which dishes you actually wanted in the first place.

Welcome to the world of AI agents connected to multiple MCP servers.

In late 2024, Anthropic launched the Model Context Protocol (MCP)—an open standard that lets AI agents connect to external systems and tools. The community went wild. Within months, thousands of MCP servers popped up like mushrooms after rain. GitHub integrations! Database connectors! File systems! API wrappers! Calendar managers! The toolbox explosion was magnificent.

But here's the hilarious (and expensive) plot twist: Most MCP implementations shot themselves in the foot. They gave AI agents access to hundreds or thousands of tools by loading all tool definitions directly into the model's context window before the agent even read the user's request. Imagine hiring a Swiss Army knife expert, but before they can help you open a bottle, you force them to memorize the instruction manual for every blade, screwdriver, and corkscrew—including the weird fish scaler nobody uses.

The result? Agents consuming hundreds of thousands of tokens before processing a single query. Performance tanked. Costs skyrocketed. And the agents, overwhelmed by the cognitive equivalent of information poisoning, started making dumb mistakes. We accidentally invented "mind intoxication"—a self-inflicted denial-of-service attack where more capabilities made agents less capable.

The good news? Anthropic's engineering team figured out a solution that's so elegant it feels obvious in hindsight: let agents write code to interact with MCP servers instead of shoving everything into their brains upfront. It's like teaching someone to use a library instead of making them memorize the card catalog.

Let's unpack how we got here, why context windows matter more than you think, and how code execution is changing the game.

What Are MCP Servers, Really?

The Promise: Plug-and-Play Superpowers

The Model Context Protocol is Anthropic's answer to a fundamental AI problem: How do we let language models interact with the real world without building custom integrations for every single tool?

Before MCP, if you wanted an AI agent to check your calendar, query a database, read files, and send emails, you'd write bespoke code for each integration. Every tool required custom glue logic, authentication handling, and prompt engineering. It was the software equivalent of hand-crafting individual adapters for every outlet in your house.

MCP standardized the plug. It's a protocol that lets developers build "MCP servers"—lightweight services that expose tools, resources, and data to AI agents in a consistent format. Think of it as USB-C for AI: one universal connector, infinite devices.

How MCP Servers Work

An MCP server is a JSON-RPC service that exposes:

Tools: Functions the agent can call (e.g., create_github_issue, query_database, send_email)
Resources: Read-only data sources (e.g., file contents, API schemas, documentation)
Prompts: Pre-built templates or workflows the agent can invoke

When an AI agent (the "MCP client") connects to these servers, it receives a manifest of available capabilities. In theory, the agent can then intelligently choose which tools to use based on the user's request.

The protocol is transport-agnostic (works over stdio, HTTP, WebSockets) and language-agnostic (servers can be written in any language). Since its November 2024 launch, adoption has been explosive—the community has built MCP servers for Slack, Notion, Jira, PostgreSQL, file systems, web scrapers, and more.

The Ecosystem Explosion

By early 2025, you could spin up an AI agent with access to:

Your entire codebase (via filesystem MCP)
Your company's databases (via SQL MCP servers)
Third-party APIs (GitHub, Linear, Figma, etc.)
Cloud services (AWS, GCP, Azure)
Custom internal tools

It's the AI equivalent of Tony Stark's workshop. The problem? Most agents were drowning in their own toolboxes.

Context Windows: The AI Agent's Working Memory

What Is a Context Window?

A language model's context window is its working memory—the amount of text it can "see" and reason about at once. Think of it as the model's short-term memory capacity or the size of its desk.

Modern frontier models have impressive context windows:

GPT-4 Turbo: 128,000 tokens (~300 pages of text)
Claude 3 Opus: 200,000 tokens (~500 pages)
Gemini 1.5 Pro: 1,000,000 tokens (~2,500 pages)

A "token" is roughly 3-4 characters in English, so 128K tokens ≈ 96,000 words.

Sounds huge, right? It is—for reading. But here's the catch: everything that goes into the model consumes that precious space:

System instructions: The agent's "personality" and behavioral rules (1-5K tokens)
Conversation history: Previous messages and responses (grows over time)
Tool definitions: JSON schemas for every available function (100-500 tokens per tool)
Retrieved context: RAG results, file contents, search results (can be massive)
The actual user query: The thing you wanted the agent to help with (variable)

Once you hit the context limit, the model either truncates old information (loses continuity) or refuses to process the request (fails outright).

Why Context Window Management Matters

Performance: Models slow down as context size grows. Attention mechanisms scale poorly—processing 100K tokens takes exponentially longer than 10K tokens.

Cost: Most LLM APIs charge by the token. Input tokens (what you send) and output tokens (what the model generates) both count. Wasting 50,000 tokens on tool definitions you'll never use is like burning cash.

Accuracy: The "needle in a haystack" problem—models struggle to attend to relevant information when buried in massive context. Empirical research shows degradation beyond 30-50K tokens for complex reasoning tasks.

Latency: More tokens = slower responses. Users notice. A 2-second response feels snappy; a 15-second delay feels broken.

The Math Gets Ugly Fast

Let's say you connect an AI agent to 10 MCP servers, each exposing 20 tools. That's 200 tools total.

Average tool definition: ~300 tokens (function name, parameters, descriptions, examples)
Total overhead: 60,000 tokens just for tool definitions

You've consumed nearly half of GPT-4 Turbo's 128K context window before the agent even reads the user's request. Add conversation history, retrieved documents, and system instructions, and you're choking the model before it can think.

Now imagine 50 MCP servers with 500 tools. You're instantly over the context limit. The agent can't even start.

The Mind Intoxication Problem: How MCP Servers Self-Sabotage

Tool Definition Overload: Death by a Thousand Schemas

The Anthropic engineering team discovered a fundamental flaw in most MCP client implementations: they load all tool definitions upfront, directly into the model's context window.

This is the buffet problem from the intro. Instead of discovering tools on-demand ("What do I need for this task?"), the agent is force-fed the entire menu. The result:

Agents with 1,000+ available tools must process hundreds of thousands of tokens before reading a user's request
The model's "attention budget" is blown on irrelevant tool schemas
Reasoning quality degrades—the agent forgets what it's supposed to do
Costs spiral out of control

Real-world example: An agent connected to a filesystem MCP server (with tools for reading, writing, listing, deleting files), a GitHub server (20+ tools for repos, issues, PRs), a database server (query, insert, update, delete), and a few API wrappers could easily hit 50-100 tool definitions. That's 15,000-30,000 tokens of overhead—just for describing capabilities it might never use.

Intermediate Result Consumption: The Double-Tax Problem

But wait, there's more pain.

Every time an AI agent calls a tool, the result flows back through the model's context window. This is fine for small results ("The current temperature is 72°F"). It's a disaster for large ones.

Scenario: An agent needs to retrieve a 20-page document from one system and upload it to another.

Naive MCP flow:

Agent calls read_document(doc_id) → returns 20 pages of text (30K tokens)
That text gets added to the context window (so the agent can "see" it)
Agent calls upload_document(content) → passes the same 30K tokens as input
The upload result (success message, new doc ID) gets added to context

Token consumption: ~60,000 tokens (the document passed through context twice) plus tool definitions, conversation history, etc.

The Anthropic team observed agents blowing through 150,000 tokens for workflows that should cost a fraction. They called this the "intermediate result consumption" problem—data that's only needed for machine-to-machine handoffs gets unnecessarily routed through the LLM's precious cognitive workspace.

The "Mind Intoxication" Effect

Combining tool overload and intermediate result bloat creates what we might call cognitive saturation or, more colorfully, mind intoxication.

The symptoms:

Distraction: The agent forgets the user's original request because it's buried under tool schemas
Hallucination: Overwhelmed by irrelevant context, the model "guesses" instead of reasoning carefully
Slowness: 10-20 second response times because the model is churning through bloated context
Cost explosion: $1-10 per request instead of$ 0.01-0.10
Failure: The agent hits context limits and can't proceed

It's a self-inflicted denial-of-service attack. More capabilities → worse performance. The irony is painful.

Why Did This Happen?

It's an architecture pattern borrowed from early chatbot designs: "Load all tools into the prompt so the model can see its options." That worked fine when agents had 5-10 tools. It breaks catastrophically at 100+.

The community moved fast ("Let's connect agents to everything!") without revisiting the foundational assumption ("Load all tools upfront"). Classic scaling failure—what works at 10× doesn't work at 100×.

Anthropic's Solution: Code Execution with MCP

The Big Idea: Tools as a Filesystem

Anthropic's engineering team asked a deceptively simple question: What if agents discovered tools on-demand instead of loading everything upfront?

Their solution: Organize MCP tools as a filesystem hierarchy and give agents the ability to write code to explore and interact with them.

Instead of this:

SYSTEM: You have access to 500 tools: [massive JSON schema dump]
USER: Please check my calendar for tomorrow.
AGENT: [drowning in tool definitions] Um... let me think...

Do this:

SYSTEM: You can explore tools by reading /tools/ as a directory tree.
USER: Please check my calendar for tomorrow.
AGENT: [writes Python code]
  import os
  tools = os.listdir('/tools/calendar/')
  # finds get_events, create_event, delete_event
  # loads definition for get_events only
  result = get_events(date='tomorrow')

The agent discovers tools lazily—only loads definitions for tools it actually needs, only when it needs them.

How It Works

Tools exposed as filesystem paths: Each MCP server presents its tools as a directory structure (e.g., /tools/github/, /tools/database/, /tools/files/)
Agent explores incrementally: The agent can ls /tools/ to see top-level categories, then drill down (e.g., ls /tools/github/ → sees create_issue, list_repos, get_pr_status)
Just-in-time loading: The agent reads tool definitions only when needed (e.g., cat /tools/github/create_issue.json → gets schema for that one function)
Code execution environment: The agent writes Python (or another language) to interact with tools. The code runs in a secure sandbox with access to MCP servers as importable libraries or API clients
Data stays out of context: Intermediate results live in the execution environment. A 20-page document retrieved from one MCP server can be directly passed to another without flowing through the model's context window

Dramatic Efficiency Gains

Anthropic's case study showed:

Before (naive MCP): 150,000 tokens for a document retrieval + upload workflow
After (code execution): 2,000 tokens for the same workflow
Savings: 98.7% reduction in token usage

That's not a typo. By keeping intermediate data out of the context window and loading only relevant tool definitions, they achieved two orders of magnitude efficiency improvement.

Cost impact: If GPT-4 Turbo charges $10 per million input tokens:

Naive approach: 150K tokens = $1.50 per request
Code execution approach: 2K tokens = $0.02 per request

At 10,000 requests/day, that's $15,000/day vs.$ 200/day. The difference pays for an engineer's salary.

Secondary Benefits

Composability: Agents can write loops, filters, and data transformations in code instead of clunky tool chains. Example: "Analyze all customer reviews from last month" → the agent writes a loop to fetch reviews in batches, filters locally, and aggregates results—all without round-tripping through the LLM for every record.

Privacy: Sensitive intermediate data (PII, credentials, internal IDs) stays in the execution sandbox by default instead of passing through the model's context (and potentially training data).

Debugging: Code is easier to inspect than opaque tool call chains. You can see exactly what the agent did (read the code), replay it, and unit-test edge cases.

State persistence: The filesystem metaphor extends to state management—agents can write temporary files, cache results, and maintain progress across multi-step workflows.

The Trade-Offs

Code execution isn't free lunch. You need:

Secure sandboxing: Execute untrusted agent-generated code safely (containers, VMs, WebAssembly)
Resource limits: Cap CPU, memory, disk, and network usage to prevent runaway agents
Monitoring: Track execution time, errors, and suspicious patterns (e.g., infinite loops, attempted sandbox escapes)
Operational complexity: Running code execution infrastructure is harder than REST API calls

For complex, high-value workflows, the trade-off is worth it. For simple chatbots, maybe not.

Practical Takeaways

When to Use Code Execution with MCP

Good fits:

High tool count: 50+ available tools across multiple MCP servers
Large data transfers: Documents, datasets, or API responses that would bloat the context window
Multi-step workflows: ETL pipelines, data analysis, complex automations
Cost sensitivity: High request volume where token costs matter

Overkill for:

Simple agents: 5-10 tools, lightweight queries
Real-time chat: Latency-sensitive apps where code execution overhead matters
Minimal infrastructure teams: If you can't run secure sandboxes, stick to direct tool calls

Design Principles for MCP Efficiency

Lazy loading: Don't send tool definitions until the agent needs them
Hierarchical organization: Group tools by domain (calendar, email, files) so agents can explore incrementally
Data locality: Process large datasets in the execution environment, not the model context
Concise schemas: Write tight tool descriptions—every token counts
Caching: Reuse tool definitions across requests (most frameworks support this)

Measuring Success

Track these metrics:

Tokens per request: Aim to minimize, especially input tokens
Tool definitions loaded: How many tools does the agent actually use vs. how many are available?
Latency: Context size directly impacts response time
Cost per request: Input + output tokens × your LLM's pricing
Success rate: Did the agent complete the task correctly?

If your token usage scales linearly with the number of available tools (even when the agent uses 5% of them), you have a loading problem. Fix it.

The Bigger Picture: Context Windows Are Finite (For Now)

The Context Ceiling

Despite marketing hype about "million-token context windows," practical limits remain:

Attention degradation: Models lose accuracy on complex reasoning tasks with massive context
Cost scaling: Bigger context = proportionally higher costs
Latency: Users won't wait 30 seconds for a response
Memory constraints: Serving a 1M-token context requires gigabytes of GPU RAM—expensive and scarce

Code execution doesn't eliminate these limits, but it sidesteps them for a huge class of problems. It's the AI equivalent of "work smarter, not harder."

The Architectural Lesson

The MCP context window problem teaches a broader lesson: More capabilities don't automatically make agents better.

Bad architecture:

Expose 1,000 tools → load all definitions → agent drowns

Good architecture:

Expose 1,000 tools → organize hierarchically → agent discovers on-demand → agent stays focused

This mirrors decades of software engineering wisdom:

Microservices: Decompose monoliths, load services on-demand
Lazy evaluation: Don't compute results until needed
Pagination: Don't fetch the entire database, fetch one page at a time

AI agents are software. The same principles apply.

Conclusion: Less Is More (When It's Smarter)

The MCP context window problem is a beautiful case study in unintended consequences. The community raced to give AI agents access to everything, and accidentally created a new bottleneck: information overload.

Anthropic's code execution solution is elegant because it respects a fundamental constraint: context windows are finite, and wasting them is expensive.

By letting agents write code to explore tools lazily and process data locally, we get:

98%+ reduction in token usage
Faster responses
Lower costs
Better accuracy (less cognitive noise)
More scalable architectures

The lesson for anyone building AI agents: Don't feed your agent the entire buffet upfront. Teach it to order from the menu.

As the MCP ecosystem matures, expect code execution patterns to become standard. Frameworks like LangChain, LlamaIndex, and AutoGPT are already experimenting with sandboxed execution environments. The future of AI agents isn't just "more tools"—it's smarter tool discovery.

And maybe, just maybe, we'll stop accidentally intoxicating our own AI systems with too much of a good thing.

Sources: Anthropic Engineering Blog ("Code Execution with MCP"), Model Context Protocol specification (November 2024), LangChain/LangGraph documentation, OpenAI API pricing, and empirical benchmarks from production MCP deployments.