Error Handling in MCP Servers - Best Practices Guide

Error handling in custom MCP servers

Kashish Hora

Co-founder of MCPcat

The Quick Answer

MCP servers must handle errors gracefully to maintain stability and provide meaningful feedback to AI clients. Return structured error responses using the isError flag:

@app.call_tool()
async def handle_tool(name: str, arguments: dict):
    try:
        result = await process_tool(name, arguments)
        return CallToolResult(content=[TextContent(text=str(result))])
    except Exception as e:
        return CallToolResult(
            isError=True,
            content=[TextContent(text=f"Error: {str(e)}")]
        )

This pattern prevents server crashes and helps LLMs understand failures, enabling them to retry operations or request user intervention when needed.

Prerequisites

Python 3.8+ or Node.js 18+
MCP SDK installed (pip install mcp or npm install @modelcontextprotocol/sdk)
Basic understanding of JSON-RPC protocol
Familiarity with async/await patterns

Understanding MCP Error Architecture

MCP servers operate on a three-tier error model, each requiring different handling approaches:

1. Transport-Level Errors occur during connection establishment or data transmission. These include network timeouts, broken pipes, or authentication failures. The transport layer (stdio, HTTP, or SSE) handles these before MCP protocol engagement.

2. Protocol-Level Errors involve JSON-RPC 2.0 violations. When a client sends malformed JSON, calls non-existent methods, or provides invalid parameters, the server must respond with standardized error codes. These errors follow the JSON-RPC specification:

{
  "jsonrpc": "2.0",
  "id": "request-123",
  "error": {
    "code": -32601,
    "message": "Method not found",
    "data": "The method 'unknown_tool' does not exist"
  }
}

3. Application-Level Errors occur within your tool implementations. These include business logic failures, external API errors, or resource constraints. Unlike protocol errors, these use the isError flag in tool responses, allowing the LLM to understand and potentially recover from the failure.

Understanding this hierarchy helps you implement appropriate error handling at each level, ensuring robust server operation and meaningful client feedback.

JSON-RPC Error Codes

Use standardized error codes to help clients handle failures appropriately:

| Error Code | Name | When to Use | Client Action | |------------|------|-------------|---------------| | -32700 | Parse Error | Invalid JSON received | Fix request format | | -32600 | Invalid Request | Missing required fields (jsonrpc, method, id) | Check request structure | | -32601 | Method Not Found | Unknown method called | Use valid method names | | -32602 | Invalid Params | Parameter validation failed | Correct parameters | | -32603 | Internal Error | Server exception | Report bug, don't retry | | -32800* | Request Cancelled | Client cancelled operation | No action needed | | -32801* | Content Too Large | Payload exceeds limits | Reduce request size | | -32802* | Resource Unavailable | Temporary resource failure | Retry with backoff |

*MCP-specific extensions

The error code signals whether failures are temporary (can retry) or permanent (need fixes). Clients use these codes to determine retry strategies and user notifications.

Implementing Error Handlers

Effective error handling follows a layered approach: validate inputs, catch specific exceptions, then handle unexpected errors. Here's the essential pattern:

Python Implementation

@app.call_tool()
async def handle_tool(name: str, arguments: dict):
    try:
        # Validate inputs
        if not name:
            raise ValueError("Tool name is required")
        
        result = await execute_tool(name, arguments)
        return CallToolResult(content=[TextContent(text=str(result))])
        
    except ValueError as e:
        # Known validation errors
        return CallToolResult(
            isError=True,
            content=[TextContent(text=f"Invalid input: {e}")]
        )
    except Exception as e:
        # Log full error, return safe message
        logger.exception(f"Error in {name}")
        return CallToolResult(
            isError=True,
            content=[TextContent(text="Operation failed")]
        )

TypeScript Implementation

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  try {
    const result = await executeTool(request.params.name, request.params.arguments);
    return { content: [{ type: "text", text: JSON.stringify(result) }] };
  } catch (error) {
    console.error("Tool failed:", error);
    return {
      isError: true,
      content: [{ type: "text", text: error instanceof Error ? error.message : "Unknown error" }]
    };
  }
});

Key Principles

Validate Early: Check inputs before processing to provide clear error messages
Catch Specific First: Handle known exceptions with targeted responses
Log Internally: Record full error details for debugging without exposing them
Sanitize Responses: Return user-safe messages that don't leak system information
Maintain Type Safety: Use TypeScript's type system to catch compile-time errors

Error Recovery Strategies

Build resilient MCP servers by implementing smart recovery patterns that handle transient failures gracefully:

Retry Logic with Exponential Backoff

When external services experience temporary issues, intelligent retry strategies prevent overwhelming them while maximizing success rates:

async def retry_with_backoff(operation, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return await operation()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise
            delay = 2 ** attempt + random.uniform(0, 1)  # Add jitter
            await asyncio.sleep(delay)

Key considerations:

Only retry idempotent operations (safe to repeat)
Add jitter to prevent synchronized retry storms
Set reasonable attempt limits (typically 3-5)
Return clear messages when all retries fail

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by "opening" when a service repeatedly fails, allowing it time to recover:

States:

Closed: Normal operation, requests pass through
Open: Service failing, reject requests immediately
Half-Open: Testing recovery with limited requests

Implementation approach:

Track consecutive failures against a threshold
Open circuit after threshold exceeded
Periodically test with single requests
Reset on successful responses

Graceful Degradation

When non-critical services fail, provide partial functionality rather than complete failure:

Return cached data with freshness warnings
Offer limited features when dependencies unavailable
Provide helpful fallback responses
Clearly communicate degraded state to users

These patterns work together to create resilient servers that handle real-world failures gracefully while maintaining user trust.

Logging and Monitoring

Effective debugging requires structured logging that balances detail with security:

Structured Logging Principles

# Log with context, sanitize sensitive data
logger.info("Request received", extra={
    "method": method,
    "request_id": request_id,
    "params": sanitize(params)  # Remove passwords, tokens
})

What to log:

Request IDs for tracing
Error types and sanitized messages
Performance metrics (response times)
External service failures

What NOT to log:

Passwords, tokens, API keys
Personal identifiable information
Full request/response payloads
Stack traces in production

Debug Mode

Enable detailed error information during development while protecting production systems:

debug_mode = os.getenv("MCP_DEBUG", "false").lower() == "true"

if debug_mode:
    # Development: include stack traces
    return {"error": str(e), "trace": traceback.format_exc()}
else:
    # Production: safe messages only
    return {"error": "Operation failed"}

Monitoring Best Practices

Use correlation IDs: Track requests across distributed systems
Set up alerts: Monitor error rates and response times
Implement health checks: Expose /health endpoints for monitoring
Track metrics: Count errors by type, monitor retry rates
Regular log rotation: Prevent disk space issues

Proper logging and monitoring transform mysterious failures into actionable insights while maintaining security.

Transport-Specific Considerations

Different MCP transports require unique error handling approaches:

Stdio Transport

Errors appear in stderr stream
Process exit codes signal failures
No built-in retry mechanism
Authentication via process environment

HTTP/SSE Transport

Return appropriate HTTP status codes (400, 401, 500)
Include JSON-RPC error in response body
Support resumable streams with event IDs
Implement OAuth 2.0 for authentication

Windows-Specific Issues

Windows command interpreters require special handling:

{
  "command": "cmd",
  "args": ["/c", "npx", "-y", "my-mcp-server"]
}

Or bypass the wrapper entirely:

{
  "command": "node",
  "args": ["path/to/server.js"]
}

Authentication Errors

Return 401 with WWW-Authenticate header for HTTP
Include clear error messages for missing/invalid tokens
Don't expose token validation details
Log authentication failures for security monitoring

Common Issues

Q: Why do I see "Method not found" errors during startup?

A: MCP clients probe for optional capabilities like prompts/list and resources/list. This is normal behavior. To silence these errors, implement stub handlers that return empty arrays:

@app.list_prompts()
async def handle_list_prompts():
    return []  # No prompts provided

Q: How do I fix "Request timed out" errors?

A: Timeouts usually indicate blocking operations. Common causes:

Using synchronous I/O instead of async (time.sleep vs await asyncio.sleep)
CPU-intensive operations blocking the event loop
External API calls without timeout settings

Solution: Use async operations for I/O and run CPU-intensive tasks in thread pools.

Q: Why does my server fail with "Client closed" on Windows?

A: Windows requires special handling for npx commands. Use one of these configurations:

// Option 1: Use cmd wrapper
{"command": "cmd", "args": ["/c", "npx", "-y", "server-name"]}

// Option 2: Call Node directly (faster)
{"command": "node", "args": ["path/to/server.js"]}

Q: How should I handle authentication errors?

A: Return appropriate error codes without exposing system details:

HTTP transport: Return 401 with WWW-Authenticate header
Include generic message: "Authentication required"
Log full details server-side for debugging
Never expose why authentication failed

Q: What's the best way to handle rate limits?

A: Implement gradual backoff:

Return clear rate limit error messages
Include retry-after information when possible
Use circuit breakers to prevent overwhelming services
Consider caching frequent requests

Complete Example

Here's a comprehensive example demonstrating all error handling patterns in a weather API server:

import os
import asyncio
import logging
from mcp.server import Server
from mcp.server.models import CallToolResult, TextContent
import httpx

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = Server("weather-service")

# Simple circuit breaker
class CircuitBreaker:
    def __init__(self, threshold=3):
        self.failure_count = 0
        self.threshold = threshold
        self.is_open = False
    
    async def call(self, func):
        if self.is_open:
            raise Exception("Service unavailable")
        try:
            result = await func()
            self.failure_count = 0  # Reset on success
            return result
        except Exception:
            self.failure_count += 1
            if self.failure_count >= self.threshold:
                self.is_open = True
            raise

# API client with retry logic
async def fetch_weather(location: str, api_key: str):
    circuit = CircuitBreaker()
    
    async def _fetch():
        async with httpx.AsyncClient(timeout=5.0) as client:
            response = await client.get(
                "https://api.weather.com/v1/current",
                params={"location": location, "key": api_key}
            )
            response.raise_for_status()
            return response.json()
    
    # Retry with exponential backoff
    for attempt in range(3):
        try:
            return await circuit.call(_fetch)
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                return {"error": "Rate limit exceeded"}
            raise
        except Exception as e:
            if attempt < 2:
                await asyncio.sleep(2 ** attempt)
            else:
                raise

@app.call_tool()
async def handle_weather(name: str, arguments: dict):
    """Main tool handler with comprehensive error handling"""
    request_id = arguments.get("request_id", "unknown")
    
    # Log sanitized request
    logger.info(f"Weather request {request_id} for tool: {name}")
    
    try:
        # Validate inputs
        if name != "get_weather":
            raise ValueError(f"Unknown tool: {name}")
        
        location = arguments.get("location")
        if not location:
            raise ValueError("Location is required")
        
        # Fetch with error handling
        api_key = os.getenv("WEATHER_API_KEY")
        if not api_key:
            raise EnvironmentError("API key not configured")
        
        result = await fetch_weather(location, api_key)
        
        return CallToolResult(
            content=[TextContent(text=str(result))]
        )
        
    except ValueError as e:
        # Client errors - helpful messages
        return CallToolResult(
            isError=True,
            content=[TextContent(text=str(e))]
        )
    except Exception as e:
        # Server errors - log details, return safe message
        logger.exception(f"Error in request {request_id}")
        return CallToolResult(
            isError=True,
            content=[TextContent(text="Service temporarily unavailable")]
        )

# Optional method stubs
@app.list_prompts()
async def list_prompts():
    return []

@app.list_resources()
async def list_resources():
    return []

This example demonstrates:

Input validation with clear error messages
External API integration with timeouts
Retry logic with exponential backoff
Circuit breaker pattern for failure protection
Structured logging with request IDs
Environment-based configuration
Graceful error responses

Best Practices Summary

Use appropriate error codes - Choose JSON-RPC codes that accurately represent the error type
Implement defense in depth - Validate inputs, check permissions, and limit resources
Log comprehensively - Capture full error context server-side while protecting sensitive data
Return helpful messages - Guide users toward solutions without exposing system internals
Handle failures gracefully - Use circuit breakers and retries for external dependencies
Test error scenarios - Simulate failures to verify error handling works correctly
Monitor error rates - Track error patterns to identify systemic issues early

Robust error handling transforms brittle MCP servers into reliable production systems that gracefully handle the unexpected while maintaining security and performance.

Related Guides

Debugging message serialization errors in MCP protocol

Debug and fix MCP message serialization errors with proven troubleshooting techniques.

Fixing "MCP error -32000: Connection closed" errors

Resolve MCP error 32000 connection closed issues with platform-specific solutions and debugging steps.

Implementing connection health checks and monitoring

Implement health checks and monitoring for MCP servers to ensure reliable production deployments.

Keep an eye on AI.

Get rich user analytics and tracing on every user interacting with your MCP server.

Get started