The Quick Answer
Implement MCP server health checks using built-in ping utility and custom monitoring tools. Create a health endpoint that tracks connection status, uptime, and resource usage:
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === "health_check") {
return {
toolResult: {
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: Math.floor((Date.now() - startTime) / 1000),
connections: activeConnections
}
};
}
});
This enables proactive monitoring and automatic recovery when connections fail.
Prerequisites
- Node.js 18+ or Python 3.10+ installed
- MCP SDK for your chosen language (
@modelcontextprotocol/sdk
for TypeScript,mcp
for Python) - Basic understanding of MCP server architecture and JSON-RPC protocol
- Development environment with async/await support
Implementation Strategy
MCP servers require robust health monitoring to ensure reliable operation in production environments. Connection failures, timeouts, and resource exhaustion are common issues that can disrupt service availability. A comprehensive health check system helps detect problems early and enables automatic recovery.
The MCP protocol supports health monitoring through multiple mechanisms. The built-in ping utility provides basic connectivity testing, while custom health check tools offer detailed status information. Combining these approaches creates a resilient monitoring solution that tracks server health, connection stability, and resource utilization.
Basic Health Check Implementation
Start with a simple health check tool that exposes server status through the MCP protocol. This approach integrates seamlessly with existing MCP clients and allows monitoring through the same connection channel used for regular operations.
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema
} from "@modelcontextprotocol/sdk/types.js";
const startTime = Date.now();
let requestCount = 0;
const server = new Server({
name: "monitored-server",
version: "1.0.0",
}, {
capabilities: { tools: {} }
});
// Register health check tool
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: [{
name: "health_check",
description: "Get server health status",
inputSchema: {
type: "object",
properties: {},
required: []
}
}]
};
});
// Handle health check requests
server.setRequestHandler(CallToolRequestSchema, async (request) => {
requestCount++;
if (request.params.name === "health_check") {
const memUsage = process.memoryUsage();
return {
toolResult: {
status: 'healthy',
timestamp: new Date().toISOString(),
uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
request_count: requestCount,
memory_mb: Math.round(memUsage.heapUsed / 1024 / 1024)
}
};
}
});
Python servers implement similar functionality using the MCP SDK. The async architecture allows non-blocking health checks that don't interfere with normal operations:
from mcp.server import Server
from mcp.types import Tool
import time
import psutil
app = Server("monitored-server")
start_time = time.time()
request_count = 0
@app.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="health_check",
description="Get server health status",
inputSchema={"type": "object", "properties": {}}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> dict:
global request_count
request_count += 1
if name == "health_check":
process = psutil.Process()
return {
"status": "healthy",
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ"),
"uptime_seconds": int(time.time() - start_time),
"request_count": request_count,
"memory_mb": round(process.memory_info().rss / 1024 / 1024)
}
Connection Monitoring
Tracking active connections provides visibility into server load and helps identify connection leaks. MCP servers should monitor both transport-level connections and logical client sessions to understand usage patterns and detect anomalies.
interface ConnectionInfo {
id: string;
connectedAt: Date;
lastActivity: Date;
requestCount: number;
clientInfo?: any;
}
class ConnectionMonitor {
private connections = new Map<string, ConnectionInfo>();
private connectionIdCounter = 0;
addConnection(transport: any): string {
const id = `conn-${++this.connectionIdCounter}`;
this.connections.set(id, {
id,
connectedAt: new Date(),
lastActivity: new Date(),
requestCount: 0
});
return id;
}
updateActivity(id: string): void {
const conn = this.connections.get(id);
if (conn) {
conn.lastActivity = new Date();
conn.requestCount++;
}
}
removeConnection(id: string): void {
this.connections.delete(id);
}
getMetrics(): object {
const now = Date.now();
const active = Array.from(this.connections.values());
return {
total_connections: active.length,
oldest_connection_seconds: active.length > 0
? Math.floor((now - Math.min(...active.map(c => c.connectedAt.getTime()))) / 1000)
: 0,
total_requests: active.reduce((sum, c) => sum + c.requestCount, 0),
idle_connections: active.filter(c =>
now - c.lastActivity.getTime() > 60000
).length
};
}
}
Implement connection monitoring in your server lifecycle hooks. Track connections from establishment through closure, updating metrics on each request:
const monitor = new ConnectionMonitor();
// Track new connections
transport.onconnect = () => {
const connId = monitor.addConnection(transport);
transport.connectionId = connId;
};
// Update on each request
server.setRequestHandler(CallToolRequestSchema, async (request, { transport }) => {
if (transport.connectionId) {
monitor.updateActivity(transport.connectionId);
}
// ... handle request
});
// Clean up on disconnect
transport.onclose = () => {
if (transport.connectionId) {
monitor.removeConnection(transport.connectionId);
}
};
Timeout and Recovery Patterns
MCP connections can fail due to network issues, client crashes, or server overload. Implementing proper timeout handling and recovery mechanisms ensures your server remains responsive and can recover from transient failures.
The MCP specification recommends establishing timeouts for all requests. When a timeout occurs, send a cancelRequest notification and clean up resources:
class TimeoutManager {
private pendingRequests = new Map<string, NodeJS.Timeout>();
private defaultTimeout = 30000; // 30 seconds
trackRequest(requestId: string, timeoutMs?: number): void {
const timeout = setTimeout(() => {
this.handleTimeout(requestId);
}, timeoutMs || this.defaultTimeout);
this.pendingRequests.set(requestId, timeout);
}
completeRequest(requestId: string): void {
const timeout = this.pendingRequests.get(requestId);
if (timeout) {
clearTimeout(timeout);
this.pendingRequests.delete(requestId);
}
}
private async handleTimeout(requestId: string): Promise<void> {
console.error(`Request ${requestId} timed out`);
// Send cancellation notification
await server.notification({
method: "$/cancelRequest",
params: { id: requestId }
});
// Clean up resources
this.pendingRequests.delete(requestId);
}
}
Python implementation using asyncio for timeout management:
import asyncio
from typing import Dict, Optional
class TimeoutManager:
def __init__(self, default_timeout: float = 30.0):
self.default_timeout = default_timeout
self.pending_tasks: Dict[str, asyncio.Task] = {}
async def with_timeout(self, request_id: str, coro, timeout: Optional[float] = None):
"""Execute coroutine with timeout"""
timeout_value = timeout or self.default_timeout
try:
task = asyncio.create_task(coro)
self.pending_tasks[request_id] = task
result = await asyncio.wait_for(task, timeout=timeout_value)
return result
except asyncio.TimeoutError:
# Send cancellation notification
await self.send_cancel_notification(request_id)
raise
finally:
self.pending_tasks.pop(request_id, None)
async def send_cancel_notification(self, request_id: str):
"""Send $/cancelRequest notification"""
await server.send_notification(
method="$/cancelRequest",
params={"id": request_id}
)
Common Issues
Error: Connection closed unexpectedly
SSE (Server-Sent Events) connections may close after periods of inactivity. The root cause is often intermediate proxies or load balancers that terminate idle connections. Implement keep-alive messages to maintain the connection:
// Send periodic ping to keep connection alive
setInterval(async () => {
try {
await transport.send({ type: 'ping' });
} catch (error) {
console.error('Keep-alive failed:', error);
// Trigger reconnection logic
}
}, 30000); // Every 30 seconds
Prevent idle timeouts by configuring your transport layer appropriately and sending periodic activity.
Error: Request timeout after 30 seconds
Long-running operations may exceed default timeout values. MCP servers should handle this gracefully by implementing progress notifications and chunked responses. For operations that legitimately take longer:
// Report progress to reset client timeout
async function longOperation(request) {
const steps = 10;
for (let i = 0; i < steps; i++) {
// Send progress notification
await server.notification({
method: "$/progress",
params: {
id: request.id,
percentage: (i / steps) * 100
}
});
// Do work...
await processStep(i);
}
}
Configure appropriate timeouts based on your use case and implement progress reporting for long operations.
Error: Too many connections
Resource exhaustion occurs when servers accept unlimited connections. Implement connection limits and queueing to prevent overload:
class ConnectionLimiter {
private maxConnections = 100;
private queue: Array<() => void> = [];
async acquireSlot(): Promise<void> {
if (this.connections.size >= this.maxConnections) {
// Queue the connection
return new Promise(resolve => {
this.queue.push(resolve);
});
}
}
releaseSlot(): void {
const next = this.queue.shift();
if (next) next();
}
}
Set reasonable limits based on your server capacity and implement graceful degradation when limits are reached.
Examples
Production Health Monitoring System
This example demonstrates a complete health monitoring implementation for a production MCP server. It includes detailed metrics collection, alerting thresholds, and integration with monitoring systems:
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { EventEmitter } from 'events';
class HealthMonitor extends EventEmitter {
private metrics = {
uptime: 0,
requests: { total: 0, failed: 0, duration: [] },
connections: { active: 0, total: 0 },
resources: { cpu: 0, memory: 0 },
errors: new Map<string, number>()
};
private thresholds = {
errorRate: 0.05, // 5% error rate
responseTime: 1000, // 1 second
memory: 500, // 500MB
connections: 100 // Max connections
};
recordRequest(duration: number, success: boolean): void {
this.metrics.requests.total++;
if (!success) this.metrics.requests.failed++;
this.metrics.requests.duration.push(duration);
if (this.metrics.requests.duration.length > 100) {
this.metrics.requests.duration.shift();
}
this.checkThresholds();
}
private checkThresholds(): void {
const errorRate = this.metrics.requests.failed / this.metrics.requests.total;
if (errorRate > this.thresholds.errorRate) {
this.emit('alert', {
type: 'high_error_rate',
value: errorRate,
threshold: this.thresholds.errorRate
});
}
const avgResponse = this.metrics.requests.duration.reduce((a, b) => a + b, 0)
/ this.metrics.requests.duration.length;
if (avgResponse > this.thresholds.responseTime) {
this.emit('alert', {
type: 'slow_response',
value: avgResponse,
threshold: this.thresholds.responseTime
});
}
}
getHealthStatus(): object {
const errorRate = this.metrics.requests.total > 0
? this.metrics.requests.failed / this.metrics.requests.total
: 0;
const avgResponse = this.metrics.requests.duration.length > 0
? this.metrics.requests.duration.reduce((a, b) => a + b, 0) / this.metrics.requests.duration.length
: 0;
const status = errorRate < 0.01 && avgResponse < 500
? 'healthy'
: errorRate < 0.05 && avgResponse < 1000
? 'degraded'
: 'unhealthy';
return {
status,
metrics: {
uptime_seconds: this.metrics.uptime,
error_rate: errorRate,
avg_response_ms: Math.round(avgResponse),
active_connections: this.metrics.connections.active,
memory_mb: this.metrics.resources.memory
},
thresholds: this.thresholds
};
}
}
// ... Integration with MCP server ...
Production deployments benefit from comprehensive monitoring that tracks multiple health indicators. This implementation provides real-time alerts when thresholds are exceeded, enabling rapid response to issues. The health status categorization (healthy/degraded/unhealthy) helps operators quickly assess system state.
Auto-Recovery Implementation
Automatic recovery mechanisms help MCP servers self-heal from transient failures. This example shows how to implement connection retry logic with exponential backoff:
import asyncio
from typing import Optional
import random
class ResilientMCPClient:
def __init__(self, server_params):
self.server_params = server_params
self.session: Optional[ClientSession] = None
self.reconnect_attempts = 0
self.max_reconnect_attempts = 5
self.base_delay = 1.0
async def connect(self):
"""Connect with automatic retry on failure"""
while self.reconnect_attempts < self.max_reconnect_attempts:
try:
self.session = await create_client_session(self.server_params)
self.reconnect_attempts = 0
# Start health monitoring
asyncio.create_task(self.monitor_health())
return
except Exception as e:
self.reconnect_attempts += 1
delay = self.calculate_backoff()
print(f"Connection failed (attempt {self.reconnect_attempts}): {e}")
print(f"Retrying in {delay:.1f} seconds...")
await asyncio.sleep(delay)
raise Exception("Max reconnection attempts exceeded")
def calculate_backoff(self) -> float:
"""Calculate exponential backoff with jitter"""
delay = self.base_delay * (2 ** (self.reconnect_attempts - 1))
jitter = random.uniform(0, delay * 0.1)
return min(delay + jitter, 60.0) # Cap at 60 seconds
async def monitor_health(self):
"""Continuously monitor connection health"""
consecutive_failures = 0
while self.session:
try:
# Perform health check
result = await self.session.call_tool(
name="health_check",
arguments={}
)
if result.get('status') != 'healthy':
consecutive_failures += 1
else:
consecutive_failures = 0
# Trigger reconnect if multiple failures
if consecutive_failures >= 3:
print("Multiple health check failures, reconnecting...")
await self.reconnect()
except Exception as e:
print(f"Health check error: {e}")
await self.reconnect()
await asyncio.sleep(30)
async def reconnect(self):
"""Handle reconnection"""
if self.session:
await self.session.close()
self.session = None
await self.connect()
Production systems require resilient connection handling that can recover from network interruptions, server restarts, and transient failures. The exponential backoff strategy prevents overwhelming the server during recovery while jitter helps avoid thundering herd problems when multiple clients reconnect simultaneously.
Distributed Health Aggregation
Large deployments often run multiple MCP server instances. This example shows how to aggregate health data across a server fleet:
interface ServerHealth {
id: string;
endpoint: string;
status: 'healthy' | 'degraded' | 'unhealthy';
lastCheck: Date;
metrics: any;
}
class FleetHealthMonitor {
private servers: Map<string, ServerHealth> = new Map();
private checkInterval = 10000; // 10 seconds
addServer(id: string, endpoint: string): void {
this.servers.set(id, {
id,
endpoint,
status: 'healthy',
lastCheck: new Date(),
metrics: {}
});
// Start monitoring
this.monitorServer(id);
}
private async monitorServer(id: string): Promise<void> {
const server = this.servers.get(id);
if (!server) return;
try {
// Create temporary connection for health check
const client = await createClient({ endpoint: server.endpoint });
const health = await client.call('health_check', {});
server.status = health.status;
server.lastCheck = new Date();
server.metrics = health.metrics;
await client.close();
} catch (error) {
server.status = 'unhealthy';
server.lastCheck = new Date();
console.error(`Health check failed for ${id}:`, error);
}
// Schedule next check
setTimeout(() => this.monitorServer(id), this.checkInterval);
}
getFleetHealth(): object {
const servers = Array.from(this.servers.values());
const healthy = servers.filter(s => s.status === 'healthy').length;
const degraded = servers.filter(s => s.status === 'degraded').length;
const unhealthy = servers.filter(s => s.status === 'unhealthy').length;
return {
summary: {
total: servers.length,
healthy,
degraded,
unhealthy,
health_percentage: (healthy / servers.length) * 100
},
servers: servers.map(s => ({
id: s.id,
status: s.status,
last_check: s.lastCheck.toISOString(),
response_time: s.metrics.avg_response_ms
}))
};
}
}
// Usage
const fleet = new FleetHealthMonitor();
fleet.addServer('server-1', 'http://mcp1.internal:3000');
fleet.addServer('server-2', 'http://mcp2.internal:3000');
// Expose fleet health via HTTP endpoint
app.get('/health/fleet', (req, res) => {
res.json(fleet.getFleetHealth());
});
Distributed monitoring provides visibility across your entire MCP infrastructure. By aggregating health data from multiple servers, operators can identify patterns, detect partial outages, and make informed decisions about traffic routing and capacity planning.
Related Guides
Fixing "MCP error -32000: Connection closed" errors
Resolve MCP error 32000 connection closed issues with platform-specific solutions and debugging steps.
Fixing "MCP error -32001: Request timed out" errors
Fix MCP error 32001 request timeouts with timeout configuration and performance optimization strategies.
Configuring MCP installations for production deployments
Configure MCP servers for production with security, monitoring, and deployment best practices.