sixdegree

Operating MCP Servers in Production

Getting an MCP server to work is not the hard part. The hard part is operating one: knowing who can access what, catching failures before users do, and building enough visibility to improve over time.

Craig Tracey
Operating MCP Servers in Production

Part 2 of 2. Part 1 covers tool design and context quality.

Getting an MCP server to work is not the hard part. The hard part is operating one: knowing who can access what, catching failures before users do, and building enough visibility to improve over time.

Most MCP implementations skip this entirely. They ship a server that passes the happy path and call it done. Security is an afterthought. Observability does not exist. The first sign something is wrong is a user reporting it.

This post covers the two areas that separate servers that work in demos from servers that hold up in production: security and authorization, and observability.


The trust model is not obvious and most implementations ignore it

MCP introduces a new trust surface. The server executes actions on behalf of the model, which is operating on behalf of the user. That chain has failure modes at every link.

Most MCP servers are built with implicit full trust. Any connected client can call any tool. Any tool can perform any action the underlying system allows. There is no verification that the model's intent matches what was actually requested. There is no audit trail.

This is fine for local development tools. It is not fine for anything that touches real data, real systems, or real users.

Before you expose any MCP server in production, answer these questions: What can each tool do at its maximum blast radius? Who can connect? What authorizes a connection? What audit log exists for tool calls? If you cannot answer them, you are not ready for production.

Batch tools break traditional authZ

Single-resource APIs have a clean authorization model. The caller requests a resource, the system checks whether that caller has permission to access it, and either returns it or returns a 403. Simple, auditable, composable.

Batch tools do not fit that model.

When a tool fetches across multiple resources in a single call (which is what you want, as we argued in Part 1), authorization is no longer a binary decision. A user might have access to some of the underlying data and not others. They might have read access to a service record but not the on-call roster attached to it. They might be permitted to see incident history but not the PII embedded in certain fields. The batch tool reaches across all of it in one query.

What do you do? Fail the entire tool call because one resource is out of scope? Return partial data with no indication of what was redacted? Return everything and assume the tool was scoped narrowly enough that this never matters?

Most implementations pick the third option implicitly, by not thinking about it at all.

The failure modes are non-obvious

Overpermission is the obvious risk. A batch tool that fetches more than the caller should see will return data the underlying API would have blocked at the endpoint level. The resource-level authZ never fires because the call never hits that endpoint directly. The data leaks through the tool.

Underpermission is subtler and arguably worse for reliability. A tool call that partially succeeds (fetching some resources and silently failing on others) returns incomplete context with no signal to the model that anything is missing. The model reasons from a partial picture as if it were complete. The output looks correct. It is not.

Neither failure is loud. Both compound silently in production.

RBAC needs to move down the stack

The root problem is that most RBAC systems are designed around API operations on individual resources. Role X can call endpoint Y on resource type Z. That model does not compose well when the tool abstracts the individual resource operations away.

Authorization needs to move down to the data layer, not sit at the API boundary. Field-level, row-level, and relationship-level access controls that apply regardless of how the data is queried. When a batch tool executes a graph traversal, the access check should fire on every node and edge in the result, not on the tool call itself.

This is not a new idea. Row-level security exists in most databases. Attribute-based access control frameworks have been around for years. But most systems do not implement them rigorously because single-resource APIs provide enough coarse-grained protection. Batch tools remove that buffer.

Agent identity complicates this further

Traditional RBAC assumes a human principal making a deliberate decision. Agents introduce a new layer: the agent is acting on behalf of a user, but the user did not explicitly authorize every individual action the agent takes.

This creates a question that existing authZ frameworks do not have a clean answer for: what is the effective permission set of an agent acting on behalf of a user?

The intuitive answer (intersect agent permissions with user permissions) is a reasonable starting point but does not cover dynamic delegation, time-bounded access, or context-specific scoping. If an agent is operating in incident response mode, it arguably needs different access than the same agent answering a general question from the same user.

The industry has not solved this yet. What is clear is that MCP implementations need to plan for it rather than assume existing RBAC models extend cleanly. They will not.

For now: implement data-layer access controls at the field and row level, make partial authorization failures explicit in tool output rather than silent, and treat agent identity as distinct from user identity even when they are closely linked. That is not a complete solution. It is a foundation that does not create new problems while the ecosystem figures out the rest.

Scope tools to minimum capability

Follow least privilege at the tool level, not just the infrastructure level. A tool that can read all customer records when it only needs to read one is a problem waiting to happen. A tool that can write when it should only read is a larger problem.

Design tools around the specific action the agent needs to take, not around the capabilities of the underlying system. If your tool wraps an API that supports 30 filter parameters but the agent only ever needs three, expose those three. The rest is attack surface and confusion surface.

Prompt injection is a real attack vector

If your MCP tools return data from external or user-controlled sources, that data can contain instructions. The model will sometimes follow them. This is not theoretical.

Sanitize tool output before it is returned to the model. Flag content that originates from user-controlled sources. Do not return arbitrary text from untrusted systems as if it were trusted context. If you are pulling data from emails, documents, tickets, or any surface where users write free text, you are already in the injection risk zone.


You cannot improve what you cannot see

Most MCP servers ship with zero observability. You know when the tool was called. You do not know whether the model chose the right tool, whether the schema was correctly interpreted, whether the returned context was actually useful, or whether the task succeeded.

Log tool calls with full input and output at the server level. Trace which tools were called in sequence for multi-step tasks. Measure tool call accuracy: when the model calls a tool, did it pass correct inputs? Did the task succeed?

Without this data you are flying blind. You will not know which tools are being misused, which descriptions are misleading the model, or which schema fields are being filled incorrectly.

Run evals before you ship changes

Tool changes break things in non-obvious ways. A description reword, a field rename, a new required parameter. Any of these can change model behavior. Without evals, you will not catch regressions until users do.

Build a small eval suite for every MCP server you operate. It does not need to be comprehensive. It needs to cover the core use cases and the edge cases that have burned you before. Run it before any change ships.

This is the most skipped step in MCP development. It is also one of the highest-leverage investments you can make in reliability. Boundary is designed for exactly this. It gives you a reproducible harness for testing tool selection accuracy, cross-service confusion, and how your server behaves as tool count scales.

Evals are not just for models

The instinct is to run evals to measure model performance. That is useful. But evals for MCP servers should also measure tool selection accuracy, schema interpretation, and context utilization. Is the model choosing the right tool? Is it passing the right inputs? Is it using the context it was given?

These are server-side and design-side failures, not model failures. Knowing the difference is what lets you fix the right thing. As the Boundary benchmark data shows, the most expensive model is not always the most accurate one, and the failures often trace back to tool design, not model capability.


The Bigger Picture

MCP is infrastructure. It has the same operational requirements as any other piece of infrastructure: reliability, observability, security, and the ability to evolve without breaking everything downstream.

The hardest problems here (batch authorization, agent identity, data-layer access control) do not have clean industry solutions yet. The a16z infrastructure thesis we wrote about in a16z Just Described What We've Been Building points in the right direction: canonical entities, identity resolution, and governance are foundational. But the specific authZ model for agents operating over batch context is still being worked out across the industry.

What is not being worked out is the failure mode. The MCP implementations that skip security and observability planning will hit these problems in production. The ones that planned for them will recover faster.

Build with that in mind.

We're onboarding design partners now.

Shape the product. Lock in early pricing. Direct founder access. Limited spots available.