sixdegree

Building MCP Servers That Models Can Actually Use

Model Context Protocol has moved fast, but implementation quality lags behind adoption. This post covers tool design and context quality: the two areas where most MCP servers fail.

Craig Tracey
Building MCP Servers That Models Can Actually Use

Part 1 of 2. Part 2 covers security, authorization, and observability.

Model Context Protocol has moved fast. In the span of a few months it went from an Anthropic research artifact to the de facto standard for connecting AI agents to external systems. Adoption is accelerating. Implementation quality is not.

The most common failure is not technical. It is a design assumption: that building an MCP server is mostly a matter of wrapping existing infrastructure. You expose your APIs, define some schemas, write descriptions. The server works. The model cannot use it.

This post covers the two areas where that assumption breaks down hardest: tool design and context quality. Get these right and you have a server worth operating. Get them wrong and no amount of prompt engineering will fix it.

If you are new to agent architecture, Building AI Agents: The Fundamentals is a good starting point before diving into MCP-specific design.


Too many tools is a reliability problem, not a UX problem

There is a tempting pattern in early MCP development: expose everything. Every API endpoint becomes a tool. Every database table gets a query tool. The schema is comprehensive. The server looks complete.

The model cannot use it reliably.

We tested this directly with Boundary, an open-source eval harness for MCP. As we wrote in We Gave LLMs 150 Tools: Here's What Broke, the results were consistent across six models from three providers: accuracy degrades as tool count increases. Not linearly. It degrades unevenly, with certain tools becoming effectively invisible as the tool list grows. A tool at position 40 in a 50-tool server will underperform compared to the same tool at position 5 in a 10-tool server, even when it is the most relevant tool for the task.

The fix is not clever. Keep tool count low. If you have more than 15-20 tools in a single server, you have an architecture problem. Split by domain. Use dynamic tool registration where the context determines which tools are surfaced. Treat tool count as a first-class reliability metric.

You cannot just map your API to tools

This is the most common reason tool count spirals, and it causes problems beyond just count.

REST APIs are built around a resource model. One endpoint, one entity, one operation. GET /user/{id}. PATCH /ticket/{id}. That design is deliberate. It keeps APIs composable and predictable for deterministic callers.

Agents are not deterministic callers, and they do not think in resources.

An agent resolving an incident is thinking about a situation: what broke, what changed, who owns it, what similar failures looked like. That maps to a dozen resources across at least as many endpoints. If your tools are scoped to single-resource operations, the agent has to make a dozen sequential calls, interpret each response, and mentally join the results. Each step is a decision point. Errors compound. Context gets lost across the chain.

APIs also encode assumptions that are invisible to the model. The developer knows that create_record requires a prior call to initialize_session, or that type=3 means "archived," or that filter_by_owner expects a user ID that comes from a different tool in a non-obvious way. The model does not know any of this. It will guess. It will guess confidently. It will often be wrong.

Batch tools are the right primitive

What agents actually need are tools that fetch across resources in a single call and return a coherent, pre-joined result. Not get_service, get_recent_deployments, get_open_alerts, but get_incident_context that returns all of it, structured for the task.

This is not a UX convenience. It is a reliability intervention. Fewer tool calls means fewer decision points, fewer places for the model to go wrong, and less compounding error across a task. A tool that does meaningful work in one call consistently outperforms three thin tools that together do the same work.

Building batch tools also forces the right design question: what does the agent actually need for this task? That question produces better tools than "what does our API expose?"

The corollary: even with a small tool count, if each tool only returns a single resource, the agent is still doing the joining work. That is where multi-step failures accumulate. Batch design and low tool count reinforce each other.

GraphQL and graph query models as the backend primitive

If you are designing the data layer that backs your MCP server, GraphQL is worth serious consideration: not because it is fashionable, but because its model fits what agents need.

A GraphQL query is declarative. You describe the shape of data you want across types, relationships, and depth, and get back exactly that shape. For an agent that needs to understand a service, its owners, its recent changes, and its dependencies, a single well-formed query returns everything. No sequential calls. No client-side joins.

The same logic applies to anything with a graph-aware query model: Datalog, SPARQL, property graph queries over something like Apache AGE. The specific technology matters less than the capability: express a multi-resource, relationship-aware query and get back a structured result the model can actually use.

Contrast this with REST, where getting the same data requires request chaining, and the joining logic lives in your tool implementation or, worse, gets delegated back to the model. That is work the model should not have to do.

Design tools around agent tasks. Choose a backend query model that can satisfy those tools without forcing artificial decomposition into single-resource hops. REST APIs will often be the system of record underneath. That is fine. Do not let the REST resource model dictate your tool shape. If your underlying APIs cannot satisfy batch tools directly, a graph query layer between your tools and APIs is worth the investment.

Naming is a form of prompting

Tool names and descriptions are not documentation. They are instructions. The model reads them to decide what to call. If your tool is named get_record and described as "retrieves a record," you have given the model nothing to work with. It will guess wrong.

Name tools for intent, not implementation. get_customer_by_id is a database operation. look_up_customer_account is something an agent does. The second version gives the model a semantic hook.

Descriptions should state what the tool does, what it needs, and when to use it. Keep them under 100 words. Anything longer is noise the model will partially ignore.

Think of tool design as closer to designing a good CLI than a good API. Commands do meaningful work. They are named for intent. They accept human-readable inputs and return human-readable outputs. This tension between MCP and CLI is worth reading more about. We explored it in MCP vs CLI: You're Asking the Wrong Question.

Schema design determines whether context actually flows

Input schemas are where most MCP developers stop thinking about the model entirely. They write JSON Schema that accurately represents the data structure and call it done.

The problem: optional fields with no description, enum values with no context, required fields with names that only make sense if you already know the internal system. The model will fill these fields based on what the names suggest. If your field is named type and accepts values A, B, and C, the model will guess. It will guess confidently. It will often be wrong.

Every field needs a description. Every enum needs human-readable values or a description that maps them. Mark required fields as actually required. Do not use internal abbreviations or system codes as field values. The schema is a contract between your system and the model. Write it like the model has to read it cold, because it does.


Garbage in, garbage out applies harder here

MCP is a context delivery mechanism. What you put in determines what the model can reason about. Most implementations focus almost entirely on the tool layer and treat context as an afterthought.

The typical failure looks like this: a tool returns a blob of semi-structured text pulled from a database, the model extracts what it can, and the result is approximately correct. In demos, approximately correct looks like it is working. In production, approximately correct compounds across multi-step tasks until the final output is confidently wrong.

Return structured data. Use consistent field names across tools. Include metadata the model needs for reasoning: timestamps, confidence scores, relationship identifiers, status values. Do not return raw text when you can return typed fields.

Static context is usually wrong by the time it is used

A common pattern in early MCP implementations: a resource that returns a snapshot of organizational state, loaded at startup or cached for an hour. The model treats it as ground truth. It is not.

Context needs to reflect current state. Not real-time in every case, but current enough that the model is not reasoning from stale data. For anything involving people, teams, systems, or processes, the context should be live or near-live.

This is harder than it sounds. It requires that the underlying data be organized in a way that supports fast, structured retrieval. Most organizational data is not. It lives in disconnected systems, inconsistently labeled, without the relational structure needed to answer agent queries accurately. That is a data problem that sits upstream of MCP. Solving it at the tool layer does not work. We explored this directly in Internal Developer Portals vs Context Layers. IDPs were built for humans browsing catalogs, not agents that need queryable, live relationships.

Context is not just retrieval

There is a tendency to equate context delivery with retrieval augmented generation. Load documents, chunk them, embed them, search. MCP can do this. It should not always do this.

As we argued in Your RAG Passed Every Test and Failed Every User, most RAG failures are not retrieval failures but substrate failures. The organizational data underneath the retrieval layer is flat, static, and disconnected. Fixing the retrieval pipeline does not fix that.

For structured queries about organizational state, retrieval is the wrong primitive entirely. You want a system that can answer "what services does team X own," "who was on call when incident Y happened," or "what changed in service Z in the last 48 hours" without embedding a blob of Markdown and hoping the right chunk comes back. Those queries need a structured context layer, not a vector database.

Know which problems retrieval solves and which problems it does not.


Build It Right First

The tool design and context quality decisions you make early are hard to undo. A server built around API-shaped tools and unstructured context does not get fixed by tuning prompts or swapping models. The problems are structural.

The good news: the right patterns are not complicated. Fewer tools, shaped around tasks. Batch queries over graph-aware backends. Structured returns with consistent fields. Live context where it matters.

Get these right and you have a server that models can actually use. Part 2 covers what comes next: keeping it secure, understanding who has access to what, and building the observability to know when something is wrong before your users do.

We're onboarding design partners now.

Shape the product. Lock in early pricing. Direct founder access. Limited spots available.