Error handling in custom MCP servers

Kashish Hora

Kashish Hora

Co-founder of MCPcat

Try out MCPcat

The Quick Answer

MCP servers must handle errors gracefully to maintain stability and provide meaningful feedback to AI clients. Return structured error responses using the isError flag:

@app.call_tool()
async def handle_tool(name: str, arguments: dict):
    try:
        result = await process_tool(name, arguments)
        return CallToolResult(content=[TextContent(text=str(result))])
    except Exception as e:
        return CallToolResult(
            isError=True,
            content=[TextContent(text=f"Error: {str(e)}")]
        )

This pattern prevents server crashes and helps LLMs understand failures, enabling them to retry operations or request user intervention when needed.

Prerequisites

  • Python 3.8+ or Node.js 18+
  • MCP SDK installed (pip install mcp or npm install @modelcontextprotocol/sdk)
  • Basic understanding of JSON-RPC protocol
  • Familiarity with async/await patterns

Understanding MCP Error Architecture

MCP servers operate on a three-tier error model, each requiring different handling approaches:

1. Transport-Level Errors occur during connection establishment or data transmission. These include network timeouts, broken pipes, or authentication failures. The transport layer (stdio, HTTP, or SSE) handles these before MCP protocol engagement.

2. Protocol-Level Errors involve JSON-RPC 2.0 violations. When a client sends malformed JSON, calls non-existent methods, or provides invalid parameters, the server must respond with standardized error codes. These errors follow the JSON-RPC specification:

{
  "jsonrpc": "2.0",
  "id": "request-123",
  "error": {
    "code": -32601,
    "message": "Method not found",
    "data": "The method 'unknown_tool' does not exist"
  }
}

3. Application-Level Errors occur within your tool implementations. These include business logic failures, external API errors, or resource constraints. Unlike protocol errors, these use the isError flag in tool responses, allowing the LLM to understand and potentially recover from the failure.

Understanding this hierarchy helps you implement appropriate error handling at each level, ensuring robust server operation and meaningful client feedback.

JSON-RPC Error Codes

Use standardized error codes to help clients handle failures appropriately:

| Error Code | Name | When to Use | Client Action | |------------|------|-------------|---------------| | -32700 | Parse Error | Invalid JSON received | Fix request format | | -32600 | Invalid Request | Missing required fields (jsonrpc, method, id) | Check request structure | | -32601 | Method Not Found | Unknown method called | Use valid method names | | -32602 | Invalid Params | Parameter validation failed | Correct parameters | | -32603 | Internal Error | Server exception | Report bug, don't retry | | -32800* | Request Cancelled | Client cancelled operation | No action needed | | -32801* | Content Too Large | Payload exceeds limits | Reduce request size | | -32802* | Resource Unavailable | Temporary resource failure | Retry with backoff |

*MCP-specific extensions

The error code signals whether failures are temporary (can retry) or permanent (need fixes). Clients use these codes to determine retry strategies and user notifications.

Implementing Error Handlers

Effective error handling follows a layered approach: validate inputs, catch specific exceptions, then handle unexpected errors. Here's the essential pattern:

Python Implementation

@app.call_tool()
async def handle_tool(name: str, arguments: dict):
    try:
        # Validate inputs
        if not name:
            raise ValueError("Tool name is required")
        
        result = await execute_tool(name, arguments)
        return CallToolResult(content=[TextContent(text=str(result))])
        
    except ValueError as e:
        # Known validation errors
        return CallToolResult(
            isError=True,
            content=[TextContent(text=f"Invalid input: {e}")]
        )
    except Exception as e:
        # Log full error, return safe message
        logger.exception(f"Error in {name}")
        return CallToolResult(
            isError=True,
            content=[TextContent(text="Operation failed")]
        )

TypeScript Implementation

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  try {
    const result = await executeTool(request.params.name, request.params.arguments);
    return { content: [{ type: "text", text: JSON.stringify(result) }] };
  } catch (error) {
    console.error("Tool failed:", error);
    return {
      isError: true,
      content: [{ type: "text", text: error instanceof Error ? error.message : "Unknown error" }]
    };
  }
});

Key Principles

  • Validate Early: Check inputs before processing to provide clear error messages
  • Catch Specific First: Handle known exceptions with targeted responses
  • Log Internally: Record full error details for debugging without exposing them
  • Sanitize Responses: Return user-safe messages that don't leak system information
  • Maintain Type Safety: Use TypeScript's type system to catch compile-time errors

Error Recovery Strategies

Build resilient MCP servers by implementing smart recovery patterns that handle transient failures gracefully:

Retry Logic with Exponential Backoff

When external services experience temporary issues, intelligent retry strategies prevent overwhelming them while maximizing success rates:

async def retry_with_backoff(operation, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return await operation()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise
            delay = 2 ** attempt + random.uniform(0, 1)  # Add jitter
            await asyncio.sleep(delay)

Key considerations:

  • Only retry idempotent operations (safe to repeat)
  • Add jitter to prevent synchronized retry storms
  • Set reasonable attempt limits (typically 3-5)
  • Return clear messages when all retries fail

Circuit Breaker Pattern

Circuit breakers prevent cascading failures by "opening" when a service repeatedly fails, allowing it time to recover:

States:

  • Closed: Normal operation, requests pass through
  • Open: Service failing, reject requests immediately
  • Half-Open: Testing recovery with limited requests

Implementation approach:

  • Track consecutive failures against a threshold
  • Open circuit after threshold exceeded
  • Periodically test with single requests
  • Reset on successful responses

Graceful Degradation

When non-critical services fail, provide partial functionality rather than complete failure:

  • Return cached data with freshness warnings
  • Offer limited features when dependencies unavailable
  • Provide helpful fallback responses
  • Clearly communicate degraded state to users

These patterns work together to create resilient servers that handle real-world failures gracefully while maintaining user trust.

Logging and Monitoring

Effective debugging requires structured logging that balances detail with security:

Structured Logging Principles

# Log with context, sanitize sensitive data
logger.info("Request received", extra={
    "method": method,
    "request_id": request_id,
    "params": sanitize(params)  # Remove passwords, tokens
})

What to log:

  • Request IDs for tracing
  • Error types and sanitized messages
  • Performance metrics (response times)
  • External service failures

What NOT to log:

  • Passwords, tokens, API keys
  • Personal identifiable information
  • Full request/response payloads
  • Stack traces in production

Debug Mode

Enable detailed error information during development while protecting production systems:

debug_mode = os.getenv("MCP_DEBUG", "false").lower() == "true"

if debug_mode:
    # Development: include stack traces
    return {"error": str(e), "trace": traceback.format_exc()}
else:
    # Production: safe messages only
    return {"error": "Operation failed"}

Monitoring Best Practices

  • Use correlation IDs: Track requests across distributed systems
  • Set up alerts: Monitor error rates and response times
  • Implement health checks: Expose /health endpoints for monitoring
  • Track metrics: Count errors by type, monitor retry rates
  • Regular log rotation: Prevent disk space issues

Proper logging and monitoring transform mysterious failures into actionable insights while maintaining security.

Transport-Specific Considerations

Different MCP transports require unique error handling approaches:

Stdio Transport

  • Errors appear in stderr stream
  • Process exit codes signal failures
  • No built-in retry mechanism
  • Authentication via process environment

HTTP/SSE Transport

  • Return appropriate HTTP status codes (400, 401, 500)
  • Include JSON-RPC error in response body
  • Support resumable streams with event IDs
  • Implement OAuth 2.0 for authentication

Windows-Specific Issues

Windows command interpreters require special handling:

{
  "command": "cmd",
  "args": ["/c", "npx", "-y", "my-mcp-server"]
}

Or bypass the wrapper entirely:

{
  "command": "node",
  "args": ["path/to/server.js"]
}

Authentication Errors

  • Return 401 with WWW-Authenticate header for HTTP
  • Include clear error messages for missing/invalid tokens
  • Don't expose token validation details
  • Log authentication failures for security monitoring

Common Issues

Q: Why do I see "Method not found" errors during startup?

A: MCP clients probe for optional capabilities like prompts/list and resources/list. This is normal behavior. To silence these errors, implement stub handlers that return empty arrays:

@app.list_prompts()
async def handle_list_prompts():
    return []  # No prompts provided

Q: How do I fix "Request timed out" errors?

A: Timeouts usually indicate blocking operations. Common causes:

  • Using synchronous I/O instead of async (time.sleep vs await asyncio.sleep)
  • CPU-intensive operations blocking the event loop
  • External API calls without timeout settings

Solution: Use async operations for I/O and run CPU-intensive tasks in thread pools.

Q: Why does my server fail with "Client closed" on Windows?

A: Windows requires special handling for npx commands. Use one of these configurations:

// Option 1: Use cmd wrapper
{"command": "cmd", "args": ["/c", "npx", "-y", "server-name"]}

// Option 2: Call Node directly (faster)
{"command": "node", "args": ["path/to/server.js"]}

Q: How should I handle authentication errors?

A: Return appropriate error codes without exposing system details:

  • HTTP transport: Return 401 with WWW-Authenticate header
  • Include generic message: "Authentication required"
  • Log full details server-side for debugging
  • Never expose why authentication failed

Q: What's the best way to handle rate limits?

A: Implement gradual backoff:

  • Return clear rate limit error messages
  • Include retry-after information when possible
  • Use circuit breakers to prevent overwhelming services
  • Consider caching frequent requests

Complete Example

Here's a comprehensive example demonstrating all error handling patterns in a weather API server:

import os
import asyncio
import logging
from mcp.server import Server
from mcp.server.models import CallToolResult, TextContent
import httpx

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = Server("weather-service")

# Simple circuit breaker
class CircuitBreaker:
    def __init__(self, threshold=3):
        self.failure_count = 0
        self.threshold = threshold
        self.is_open = False
    
    async def call(self, func):
        if self.is_open:
            raise Exception("Service unavailable")
        try:
            result = await func()
            self.failure_count = 0  # Reset on success
            return result
        except Exception:
            self.failure_count += 1
            if self.failure_count >= self.threshold:
                self.is_open = True
            raise

# API client with retry logic
async def fetch_weather(location: str, api_key: str):
    circuit = CircuitBreaker()
    
    async def _fetch():
        async with httpx.AsyncClient(timeout=5.0) as client:
            response = await client.get(
                "https://api.weather.com/v1/current",
                params={"location": location, "key": api_key}
            )
            response.raise_for_status()
            return response.json()
    
    # Retry with exponential backoff
    for attempt in range(3):
        try:
            return await circuit.call(_fetch)
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                return {"error": "Rate limit exceeded"}
            raise
        except Exception as e:
            if attempt < 2:
                await asyncio.sleep(2 ** attempt)
            else:
                raise

@app.call_tool()
async def handle_weather(name: str, arguments: dict):
    """Main tool handler with comprehensive error handling"""
    request_id = arguments.get("request_id", "unknown")
    
    # Log sanitized request
    logger.info(f"Weather request {request_id} for tool: {name}")
    
    try:
        # Validate inputs
        if name != "get_weather":
            raise ValueError(f"Unknown tool: {name}")
        
        location = arguments.get("location")
        if not location:
            raise ValueError("Location is required")
        
        # Fetch with error handling
        api_key = os.getenv("WEATHER_API_KEY")
        if not api_key:
            raise EnvironmentError("API key not configured")
        
        result = await fetch_weather(location, api_key)
        
        return CallToolResult(
            content=[TextContent(text=str(result))]
        )
        
    except ValueError as e:
        # Client errors - helpful messages
        return CallToolResult(
            isError=True,
            content=[TextContent(text=str(e))]
        )
    except Exception as e:
        # Server errors - log details, return safe message
        logger.exception(f"Error in request {request_id}")
        return CallToolResult(
            isError=True,
            content=[TextContent(text="Service temporarily unavailable")]
        )

# Optional method stubs
@app.list_prompts()
async def list_prompts():
    return []

@app.list_resources()
async def list_resources():
    return []

This example demonstrates:

  • Input validation with clear error messages
  • External API integration with timeouts
  • Retry logic with exponential backoff
  • Circuit breaker pattern for failure protection
  • Structured logging with request IDs
  • Environment-based configuration
  • Graceful error responses

Best Practices Summary

  1. Use appropriate error codes - Choose JSON-RPC codes that accurately represent the error type
  2. Implement defense in depth - Validate inputs, check permissions, and limit resources
  3. Log comprehensively - Capture full error context server-side while protecting sensitive data
  4. Return helpful messages - Guide users toward solutions without exposing system internals
  5. Handle failures gracefully - Use circuit breakers and retries for external dependencies
  6. Test error scenarios - Simulate failures to verify error handling works correctly
  7. Monitor error rates - Track error patterns to identify systemic issues early

Robust error handling transforms brittle MCP servers into reliable production systems that gracefully handle the unexpected while maintaining security and performance.