Implementing connection health checks and monitoring

Kashish Hora

Kashish Hora

Co-founder of MCPcat

Try out MCPcat

The Quick Answer

Implement MCP server health checks using built-in ping utility and custom monitoring tools. Create a health endpoint that tracks connection status, uptime, and resource usage:

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "health_check") {
    return {
      toolResult: {
        status: 'healthy',
        timestamp: new Date().toISOString(),
        uptime: Math.floor((Date.now() - startTime) / 1000),
        connections: activeConnections
      }
    };
  }
});

This enables proactive monitoring and automatic recovery when connections fail.

Prerequisites

  • Node.js 18+ or Python 3.10+ installed
  • MCP SDK for your chosen language (@modelcontextprotocol/sdk for TypeScript, mcp for Python)
  • Basic understanding of MCP server architecture and JSON-RPC protocol
  • Development environment with async/await support

Implementation Strategy

MCP servers require robust health monitoring to ensure reliable operation in production environments. Connection failures, timeouts, and resource exhaustion are common issues that can disrupt service availability. A comprehensive health check system helps detect problems early and enables automatic recovery.

The MCP protocol supports health monitoring through multiple mechanisms. The built-in ping utility provides basic connectivity testing, while custom health check tools offer detailed status information. Combining these approaches creates a resilient monitoring solution that tracks server health, connection stability, and resource utilization.

Basic Health Check Implementation

Start with a simple health check tool that exposes server status through the MCP protocol. This approach integrates seamlessly with existing MCP clients and allows monitoring through the same connection channel used for regular operations.

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { 
  CallToolRequestSchema, 
  ListToolsRequestSchema 
} from "@modelcontextprotocol/sdk/types.js";

const startTime = Date.now();
let requestCount = 0;

const server = new Server({
  name: "monitored-server",
  version: "1.0.0",
}, {
  capabilities: { tools: {} }
});

// Register health check tool
server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [{
      name: "health_check",
      description: "Get server health status",
      inputSchema: {
        type: "object",
        properties: {},
        required: []
      }
    }]
  };
});

// Handle health check requests
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  requestCount++;
  
  if (request.params.name === "health_check") {
    const memUsage = process.memoryUsage();
    return {
      toolResult: {
        status: 'healthy',
        timestamp: new Date().toISOString(),
        uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
        request_count: requestCount,
        memory_mb: Math.round(memUsage.heapUsed / 1024 / 1024)
      }
    };
  }
});

Python servers implement similar functionality using the MCP SDK. The async architecture allows non-blocking health checks that don't interfere with normal operations:

from mcp.server import Server
from mcp.types import Tool
import time
import psutil

app = Server("monitored-server")
start_time = time.time()
request_count = 0

@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="health_check",
            description="Get server health status",
            inputSchema={"type": "object", "properties": {}}
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> dict:
    global request_count
    request_count += 1
    
    if name == "health_check":
        process = psutil.Process()
        return {
            "status": "healthy",
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "uptime_seconds": int(time.time() - start_time),
            "request_count": request_count,
            "memory_mb": round(process.memory_info().rss / 1024 / 1024)
        }

Connection Monitoring

Tracking active connections provides visibility into server load and helps identify connection leaks. MCP servers should monitor both transport-level connections and logical client sessions to understand usage patterns and detect anomalies.

interface ConnectionInfo {
  id: string;
  connectedAt: Date;
  lastActivity: Date;
  requestCount: number;
  clientInfo?: any;
}

class ConnectionMonitor {
  private connections = new Map<string, ConnectionInfo>();
  private connectionIdCounter = 0;
  
  addConnection(transport: any): string {
    const id = `conn-${++this.connectionIdCounter}`;
    this.connections.set(id, {
      id,
      connectedAt: new Date(),
      lastActivity: new Date(),
      requestCount: 0
    });
    return id;
  }
  
  updateActivity(id: string): void {
    const conn = this.connections.get(id);
    if (conn) {
      conn.lastActivity = new Date();
      conn.requestCount++;
    }
  }
  
  removeConnection(id: string): void {
    this.connections.delete(id);
  }
  
  getMetrics(): object {
    const now = Date.now();
    const active = Array.from(this.connections.values());
    
    return {
      total_connections: active.length,
      oldest_connection_seconds: active.length > 0 
        ? Math.floor((now - Math.min(...active.map(c => c.connectedAt.getTime()))) / 1000)
        : 0,
      total_requests: active.reduce((sum, c) => sum + c.requestCount, 0),
      idle_connections: active.filter(c => 
        now - c.lastActivity.getTime() > 60000
      ).length
    };
  }
}

Implement connection monitoring in your server lifecycle hooks. Track connections from establishment through closure, updating metrics on each request:

const monitor = new ConnectionMonitor();

// Track new connections
transport.onconnect = () => {
  const connId = monitor.addConnection(transport);
  transport.connectionId = connId;
};

// Update on each request
server.setRequestHandler(CallToolRequestSchema, async (request, { transport }) => {
  if (transport.connectionId) {
    monitor.updateActivity(transport.connectionId);
  }
  // ... handle request
});

// Clean up on disconnect
transport.onclose = () => {
  if (transport.connectionId) {
    monitor.removeConnection(transport.connectionId);
  }
};

Timeout and Recovery Patterns

MCP connections can fail due to network issues, client crashes, or server overload. Implementing proper timeout handling and recovery mechanisms ensures your server remains responsive and can recover from transient failures.

The MCP specification recommends establishing timeouts for all requests. When a timeout occurs, send a cancelRequest notification and clean up resources:

class TimeoutManager {
  private pendingRequests = new Map<string, NodeJS.Timeout>();
  private defaultTimeout = 30000; // 30 seconds
  
  trackRequest(requestId: string, timeoutMs?: number): void {
    const timeout = setTimeout(() => {
      this.handleTimeout(requestId);
    }, timeoutMs || this.defaultTimeout);
    
    this.pendingRequests.set(requestId, timeout);
  }
  
  completeRequest(requestId: string): void {
    const timeout = this.pendingRequests.get(requestId);
    if (timeout) {
      clearTimeout(timeout);
      this.pendingRequests.delete(requestId);
    }
  }
  
  private async handleTimeout(requestId: string): Promise<void> {
    console.error(`Request ${requestId} timed out`);
    
    // Send cancellation notification
    await server.notification({
      method: "$/cancelRequest",
      params: { id: requestId }
    });
    
    // Clean up resources
    this.pendingRequests.delete(requestId);
  }
}

Python implementation using asyncio for timeout management:

import asyncio
from typing import Dict, Optional

class TimeoutManager:
    def __init__(self, default_timeout: float = 30.0):
        self.default_timeout = default_timeout
        self.pending_tasks: Dict[str, asyncio.Task] = {}
    
    async def with_timeout(self, request_id: str, coro, timeout: Optional[float] = None):
        """Execute coroutine with timeout"""
        timeout_value = timeout or self.default_timeout
        
        try:
            task = asyncio.create_task(coro)
            self.pending_tasks[request_id] = task
            
            result = await asyncio.wait_for(task, timeout=timeout_value)
            return result
            
        except asyncio.TimeoutError:
            # Send cancellation notification
            await self.send_cancel_notification(request_id)
            raise
            
        finally:
            self.pending_tasks.pop(request_id, None)
    
    async def send_cancel_notification(self, request_id: str):
        """Send $/cancelRequest notification"""
        await server.send_notification(
            method="$/cancelRequest",
            params={"id": request_id}
        )

Common Issues

Error: Connection closed unexpectedly

SSE (Server-Sent Events) connections may close after periods of inactivity. The root cause is often intermediate proxies or load balancers that terminate idle connections. Implement keep-alive messages to maintain the connection:

// Send periodic ping to keep connection alive
setInterval(async () => {
  try {
    await transport.send({ type: 'ping' });
  } catch (error) {
    console.error('Keep-alive failed:', error);
    // Trigger reconnection logic
  }
}, 30000); // Every 30 seconds

Prevent idle timeouts by configuring your transport layer appropriately and sending periodic activity.

Error: Request timeout after 30 seconds

Long-running operations may exceed default timeout values. MCP servers should handle this gracefully by implementing progress notifications and chunked responses. For operations that legitimately take longer:

// Report progress to reset client timeout
async function longOperation(request) {
  const steps = 10;
  for (let i = 0; i < steps; i++) {
    // Send progress notification
    await server.notification({
      method: "$/progress",
      params: {
        id: request.id,
        percentage: (i / steps) * 100
      }
    });
    
    // Do work...
    await processStep(i);
  }
}

Configure appropriate timeouts based on your use case and implement progress reporting for long operations.

Error: Too many connections

Resource exhaustion occurs when servers accept unlimited connections. Implement connection limits and queueing to prevent overload:

class ConnectionLimiter {
  private maxConnections = 100;
  private queue: Array<() => void> = [];
  
  async acquireSlot(): Promise<void> {
    if (this.connections.size >= this.maxConnections) {
      // Queue the connection
      return new Promise(resolve => {
        this.queue.push(resolve);
      });
    }
  }
  
  releaseSlot(): void {
    const next = this.queue.shift();
    if (next) next();
  }
}

Set reasonable limits based on your server capacity and implement graceful degradation when limits are reached.

Examples

Production Health Monitoring System

This example demonstrates a complete health monitoring implementation for a production MCP server. It includes detailed metrics collection, alerting thresholds, and integration with monitoring systems:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { EventEmitter } from 'events';

class HealthMonitor extends EventEmitter {
  private metrics = {
    uptime: 0,
    requests: { total: 0, failed: 0, duration: [] },
    connections: { active: 0, total: 0 },
    resources: { cpu: 0, memory: 0 },
    errors: new Map<string, number>()
  };
  
  private thresholds = {
    errorRate: 0.05,      // 5% error rate
    responseTime: 1000,   // 1 second
    memory: 500,          // 500MB
    connections: 100      // Max connections
  };
  
  recordRequest(duration: number, success: boolean): void {
    this.metrics.requests.total++;
    if (!success) this.metrics.requests.failed++;
    
    this.metrics.requests.duration.push(duration);
    if (this.metrics.requests.duration.length > 100) {
      this.metrics.requests.duration.shift();
    }
    
    this.checkThresholds();
  }
  
  private checkThresholds(): void {
    const errorRate = this.metrics.requests.failed / this.metrics.requests.total;
    if (errorRate > this.thresholds.errorRate) {
      this.emit('alert', { 
        type: 'high_error_rate', 
        value: errorRate,
        threshold: this.thresholds.errorRate 
      });
    }
    
    const avgResponse = this.metrics.requests.duration.reduce((a, b) => a + b, 0) 
      / this.metrics.requests.duration.length;
    if (avgResponse > this.thresholds.responseTime) {
      this.emit('alert', { 
        type: 'slow_response', 
        value: avgResponse,
        threshold: this.thresholds.responseTime 
      });
    }
  }
  
  getHealthStatus(): object {
    const errorRate = this.metrics.requests.total > 0 
      ? this.metrics.requests.failed / this.metrics.requests.total 
      : 0;
      
    const avgResponse = this.metrics.requests.duration.length > 0
      ? this.metrics.requests.duration.reduce((a, b) => a + b, 0) / this.metrics.requests.duration.length
      : 0;
    
    const status = errorRate < 0.01 && avgResponse < 500 
      ? 'healthy' 
      : errorRate < 0.05 && avgResponse < 1000 
        ? 'degraded' 
        : 'unhealthy';
    
    return {
      status,
      metrics: {
        uptime_seconds: this.metrics.uptime,
        error_rate: errorRate,
        avg_response_ms: Math.round(avgResponse),
        active_connections: this.metrics.connections.active,
        memory_mb: this.metrics.resources.memory
      },
      thresholds: this.thresholds
    };
  }
}

// ... Integration with MCP server ...

Production deployments benefit from comprehensive monitoring that tracks multiple health indicators. This implementation provides real-time alerts when thresholds are exceeded, enabling rapid response to issues. The health status categorization (healthy/degraded/unhealthy) helps operators quickly assess system state.

Auto-Recovery Implementation

Automatic recovery mechanisms help MCP servers self-heal from transient failures. This example shows how to implement connection retry logic with exponential backoff:

import asyncio
from typing import Optional
import random

class ResilientMCPClient:
    def __init__(self, server_params):
        self.server_params = server_params
        self.session: Optional[ClientSession] = None
        self.reconnect_attempts = 0
        self.max_reconnect_attempts = 5
        self.base_delay = 1.0
        
    async def connect(self):
        """Connect with automatic retry on failure"""
        while self.reconnect_attempts < self.max_reconnect_attempts:
            try:
                self.session = await create_client_session(self.server_params)
                self.reconnect_attempts = 0
                
                # Start health monitoring
                asyncio.create_task(self.monitor_health())
                return
                
            except Exception as e:
                self.reconnect_attempts += 1
                delay = self.calculate_backoff()
                
                print(f"Connection failed (attempt {self.reconnect_attempts}): {e}")
                print(f"Retrying in {delay:.1f} seconds...")
                
                await asyncio.sleep(delay)
        
        raise Exception("Max reconnection attempts exceeded")
    
    def calculate_backoff(self) -> float:
        """Calculate exponential backoff with jitter"""
        delay = self.base_delay * (2 ** (self.reconnect_attempts - 1))
        jitter = random.uniform(0, delay * 0.1)
        return min(delay + jitter, 60.0)  # Cap at 60 seconds
    
    async def monitor_health(self):
        """Continuously monitor connection health"""
        consecutive_failures = 0
        
        while self.session:
            try:
                # Perform health check
                result = await self.session.call_tool(
                    name="health_check", 
                    arguments={}
                )
                
                if result.get('status') != 'healthy':
                    consecutive_failures += 1
                else:
                    consecutive_failures = 0
                
                # Trigger reconnect if multiple failures
                if consecutive_failures >= 3:
                    print("Multiple health check failures, reconnecting...")
                    await self.reconnect()
                    
            except Exception as e:
                print(f"Health check error: {e}")
                await self.reconnect()
                
            await asyncio.sleep(30)
    
    async def reconnect(self):
        """Handle reconnection"""
        if self.session:
            await self.session.close()
            self.session = None
        
        await self.connect()

Production systems require resilient connection handling that can recover from network interruptions, server restarts, and transient failures. The exponential backoff strategy prevents overwhelming the server during recovery while jitter helps avoid thundering herd problems when multiple clients reconnect simultaneously.

Distributed Health Aggregation

Large deployments often run multiple MCP server instances. This example shows how to aggregate health data across a server fleet:

interface ServerHealth {
  id: string;
  endpoint: string;
  status: 'healthy' | 'degraded' | 'unhealthy';
  lastCheck: Date;
  metrics: any;
}

class FleetHealthMonitor {
  private servers: Map<string, ServerHealth> = new Map();
  private checkInterval = 10000; // 10 seconds
  
  addServer(id: string, endpoint: string): void {
    this.servers.set(id, {
      id,
      endpoint,
      status: 'healthy',
      lastCheck: new Date(),
      metrics: {}
    });
    
    // Start monitoring
    this.monitorServer(id);
  }
  
  private async monitorServer(id: string): Promise<void> {
    const server = this.servers.get(id);
    if (!server) return;
    
    try {
      // Create temporary connection for health check
      const client = await createClient({ endpoint: server.endpoint });
      const health = await client.call('health_check', {});
      
      server.status = health.status;
      server.lastCheck = new Date();
      server.metrics = health.metrics;
      
      await client.close();
      
    } catch (error) {
      server.status = 'unhealthy';
      server.lastCheck = new Date();
      console.error(`Health check failed for ${id}:`, error);
    }
    
    // Schedule next check
    setTimeout(() => this.monitorServer(id), this.checkInterval);
  }
  
  getFleetHealth(): object {
    const servers = Array.from(this.servers.values());
    const healthy = servers.filter(s => s.status === 'healthy').length;
    const degraded = servers.filter(s => s.status === 'degraded').length;
    const unhealthy = servers.filter(s => s.status === 'unhealthy').length;
    
    return {
      summary: {
        total: servers.length,
        healthy,
        degraded,
        unhealthy,
        health_percentage: (healthy / servers.length) * 100
      },
      servers: servers.map(s => ({
        id: s.id,
        status: s.status,
        last_check: s.lastCheck.toISOString(),
        response_time: s.metrics.avg_response_ms
      }))
    };
  }
}

// Usage
const fleet = new FleetHealthMonitor();
fleet.addServer('server-1', 'http://mcp1.internal:3000');
fleet.addServer('server-2', 'http://mcp2.internal:3000');

// Expose fleet health via HTTP endpoint
app.get('/health/fleet', (req, res) => {
  res.json(fleet.getFleetHealth());
});

Distributed monitoring provides visibility across your entire MCP infrastructure. By aggregating health data from multiple servers, operators can identify patterns, detect partial outages, and make informed decisions about traffic routing and capacity planning.