Build Health Check Endpoints for MCP Servers - Guide

Building a health check endpoint for your MCP server

Kashish Hora

Co-founder of MCPcat

MCP Performance & Scaling MCP Testing MCP Security

The Quick Answer

Add health check endpoints to your MCP server for monitoring and automated recovery:

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'healthy', server: 'mcp-server' });
});

app.get('/health/ready', async (req, res) => {
  const isReady = await checkDependencies();
  res.status(isReady ? 200 : 503).json({ ready: isReady });
});

Health checks validate server functionality, dependencies, and readiness. They enable load balancers to route traffic and orchestrators to restart unhealthy instances automatically.

Prerequisites

Node.js 18+ or Python 3.10+ installed
Basic understanding of MCP server architecture
Express.js (TypeScript) or FastAPI (Python) framework knowledge
Optional: Kubernetes or Docker for production deployments

Installation

Install the required dependencies for your chosen language:

# TypeScript/Node.js
$npm install express @modelcontextprotocol/sdk
 
# Python
$pip install fastapi uvicorn mcp

Configuration

Health check endpoints require careful configuration to balance responsiveness with system load. MCP servers support multiple transport types (stdio, HTTP+SSE, WebSocket), but health checks are most relevant for HTTP-based deployments where external monitoring is possible.

Configure your health check endpoints with appropriate timeouts and response codes:

const HEALTH_CHECK_TIMEOUT = 5000; // 5 seconds
const DEPENDENCY_CHECK_INTERVAL = 30000; // 30 seconds

// Cache dependency status to avoid overloading external services
let lastDependencyCheck = { time: 0, status: true };

async function checkDependencies(): Promise<boolean> {
  const now = Date.now();
  if (now - lastDependencyCheck.time < DEPENDENCY_CHECK_INTERVAL) {
    return lastDependencyCheck.status;
  }
  
  // Perform actual checks
  const checks = await Promise.all([
    checkDatabase(),
    checkExternalAPI(),
    checkMCPServerInit()
  ]);
  
  lastDependencyCheck = { time: now, status: checks.every(c => c) };
  return lastDependencyCheck.status;
}

The caching mechanism prevents health check endpoints from overwhelming your dependencies. In production environments, adjust the DEPENDENCY_CHECK_INTERVAL based on your SLA requirements and dependency reliability.

Usage

MCP servers need different types of health checks for various operational scenarios. The three primary patterns address different monitoring needs:

Basic Health Check

The simplest health check confirms the server process is running and can respond to HTTP requests:

app.get('/health', (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    service: 'mcp-server',
    version: process.env.npm_package_version || '1.0.0',
    uptime: process.uptime()
  };
  res.status(200).json(health);
});

This endpoint serves as a liveness probe, indicating the server hasn't crashed. Load balancers typically check this endpoint every 5-10 seconds to detect unresponsive instances.

Readiness Check

Readiness checks verify the server can handle actual MCP requests by validating all dependencies:

app.get('/health/ready', async (req, res) => {
  try {
    const mcpReady = mcpServer.isInitialized && mcpServer.tools.length > 0;
    const dbConnected = await checkDatabaseConnection();
    const apiAvailable = await checkExternalAPIHealth();
    
    const ready = mcpReady && dbConnected && apiAvailable;
    
    res.status(ready ? 200 : 503).json({
      ready,
      checks: {
        mcp: mcpReady,
        database: dbConnected,
        externalAPI: apiAvailable
      },
      timestamp: new Date().toISOString()
    });
  } catch (error) {
    res.status(503).json({
      ready: false,
      error: 'Health check failed',
      timestamp: new Date().toISOString()
    });
  }
});

Kubernetes uses readiness probes to determine when to route traffic to a pod. A failing readiness check removes the instance from the load balancer pool without restarting it.

Detailed Health Status

For comprehensive monitoring, implement a detailed health endpoint that provides granular status information:

interface HealthComponent {
  name: string;
  status: 'healthy' | 'degraded' | 'unhealthy';
  responseTime: number;
  message?: string;
}

app.get('/health/detailed', async (req, res) => {
  const components: HealthComponent[] = [];
  
  // Check MCP server components
  const mcpStart = Date.now();
  components.push({
    name: 'mcp-server',
    status: mcpServer.isInitialized ? 'healthy' : 'unhealthy',
    responseTime: Date.now() - mcpStart,
    message: `${mcpServer.tools.length} tools, ${mcpServer.resources.length} resources`
  });
  
  // Check each dependency with timeout
  const dbStart = Date.now();
  try {
    await Promise.race([
      checkDatabase(),
      new Promise((_, reject) => 
        setTimeout(() => reject(new Error('Timeout')), 3000)
      )
    ]);
    components.push({
      name: 'database',
      status: 'healthy',
      responseTime: Date.now() - dbStart
    });
  } catch (error) {
    components.push({
      name: 'database',
      status: 'unhealthy',
      responseTime: Date.now() - dbStart,
      message: error.message
    });
  }
  
  const overallStatus = components.every(c => c.status === 'healthy') 
    ? 'healthy' 
    : components.some(c => c.status === 'unhealthy') 
      ? 'unhealthy' 
      : 'degraded';
  
  res.status(overallStatus === 'healthy' ? 200 : 503).json({
    status: overallStatus,
    components,
    timestamp: new Date().toISOString()
  });
});

This pattern helps identify specific failure points during incidents. Monitoring systems can alert on degraded states before complete failures occur.

Common Issues

Error: Connection timeout during health checks

Health check timeouts typically occur when dependency checks take too long or when the server is under heavy load. The root cause often lies in synchronous blocking operations or missing timeout configurations.

// Problem: No timeout protection
async function checkDatabase() {
  const result = await db.query('SELECT 1'); // Can hang indefinitely
  return result.rows.length > 0;
}

// Solution: Add timeout wrapper
async function checkDatabaseWithTimeout() {
  try {
    const result = await Promise.race([
      db.query('SELECT 1'),
      new Promise((_, reject) => 
        setTimeout(() => reject(new Error('Database timeout')), 2000)
      )
    ]);
    return result.rows.length > 0;
  } catch (error) {
    console.error('Database health check failed:', error);
    return false;
  }
}

Implement timeouts for all external calls and use connection pooling to prevent resource exhaustion. Consider implementing circuit breakers for frequently failing dependencies.

Error: Health check passes but server returns errors

This disconnect happens when health checks don't accurately reflect server capability. Often, basic health checks only verify the process is running without testing actual MCP functionality.

// Insufficient check - only tests HTTP server
app.get('/health', (req, res) => res.send('OK'));

// Comprehensive check - validates MCP capabilities
app.get('/health', async (req, res) => {
  try {
    // Test actual MCP functionality
    const testTool = mcpServer.tools.find(t => t.name === 'test-tool');
    if (!testTool) throw new Error('Test tool not found');
    
    // Verify tool execution capability
    const result = await testTool.handler({ test: true });
    if (!result) throw new Error('Tool execution failed');
    
    res.status(200).json({ 
      status: 'healthy',
      mcp: {
        tools: mcpServer.tools.length,
        resources: mcpServer.resources.length,
        capabilities: mcpServer.capabilities
      }
    });
  } catch (error) {
    res.status(503).json({ 
      status: 'unhealthy',
      error: error.message 
    });
  }
});

Always include functional checks that exercise core MCP capabilities. This ensures health checks accurately represent server readiness.

Error: Flapping health status (alternating healthy/unhealthy)

Flapping occurs when health checks are too sensitive to transient issues or when thresholds are poorly configured. This causes unnecessary service disruptions and alert fatigue.

// Implement a stability window to prevent flapping
class HealthChecker {
  private history: boolean[] = [];
  private readonly windowSize = 5;
  private readonly healthyThreshold = 0.6;
  
  async checkHealth(): Promise<{ stable: boolean; healthy: boolean }> {
    const currentHealth = await this.performHealthCheck();
    this.history.push(currentHealth);
    
    if (this.history.length > this.windowSize) {
      this.history.shift();
    }
    
    const healthyCount = this.history.filter(h => h).length;
    const healthyRatio = healthyCount / this.history.length;
    
    return {
      stable: this.history.length >= this.windowSize,
      healthy: healthyRatio >= this.healthyThreshold
    };
  }
}

Use rolling windows and percentage-based thresholds instead of single-check failures. This approach tolerates temporary glitches while still detecting persistent issues.

Examples

Production-Ready TypeScript MCP Health Check Server

This example demonstrates a complete health check implementation for a TypeScript MCP server with multiple monitoring endpoints and dependency checks:

import express from 'express';
import { Server as McpServer } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';

const app = express();
const mcpServer = new McpServer({
  name: 'production-mcp-server',
  version: '1.0.0'
});

// Health check state management
const healthState = {
  startTime: Date.now(),
  isReady: false,
  lastCheck: { time: 0, results: {} }
};

// Initialize MCP server with tools and resources
async function initializeMCPServer() {
  mcpServer.setRequestHandler('tools/list', async () => ({
    tools: [{
      name: 'query-database',
      description: 'Query the database',
      inputSchema: { type: 'object', properties: { query: { type: 'string' } } }
    }]
  }));
  
  // Start MCP server on stdio transport
  const transport = new StdioServerTransport();
  await mcpServer.connect(transport);
  healthState.isReady = true;
}

// Kubernetes-compatible health endpoints
app.get('/health/startup', (req, res) => {
  const startupDuration = Date.now() - healthState.startTime;
  const maxStartupTime = 60000; // 60 seconds
  
  if (startupDuration > maxStartupTime && !healthState.isReady) {
    res.status(503).json({ 
      status: 'failed',
      message: 'Startup timeout exceeded'
    });
  } else if (healthState.isReady) {
    res.status(200).json({ status: 'started' });
  } else {
    res.status(503).json({ 
      status: 'starting',
      duration: startupDuration
    });
  }
});

app.get('/health/live', (req, res) => {
  // Simple liveness check - process is running
  res.status(200).json({
    status: 'alive',
    pid: process.pid,
    uptime: process.uptime(),
    memory: process.memoryUsage()
  });
});

app.get('/health/ready', async (req, res) => {
  if (!healthState.isReady) {
    return res.status(503).json({ ready: false, reason: 'Server initializing' });
  }
  
  // Check all critical dependencies
  const checks = {
    mcp: mcpServer.capabilities !== undefined,
    database: await checkDatabase(),
    cache: await checkRedis(),
    storage: await checkS3()
  };
  
  const ready = Object.values(checks).every(check => check === true);
  res.status(ready ? 200 : 503).json({ ready, checks });
});

// Prometheus-compatible metrics endpoint
app.get('/metrics', (req, res) => {
  const metrics = [
    `# HELP mcp_server_up MCP server status`,
    `# TYPE mcp_server_up gauge`,
    `mcp_server_up{service="production-mcp-server"} ${healthState.isReady ? 1 : 0}`,
    `# HELP mcp_server_uptime_seconds MCP server uptime`,
    `# TYPE mcp_server_uptime_seconds counter`,
    `mcp_server_uptime_seconds ${process.uptime()}`,
    `# HELP mcp_tools_total Total number of MCP tools`,
    `# TYPE mcp_tools_total gauge`,
    `mcp_tools_total ${mcpServer.tools?.length || 0}`
  ];
  
  res.set('Content-Type', 'text/plain');
  res.send(metrics.join('\n'));
});

// Start servers
async function start() {
  await initializeMCPServer();
  app.listen(8080, () => {
    console.log('Health check endpoints available on :8080');
  });
}

start().catch(console.error);

This implementation provides multiple health check endpoints suitable for different monitoring scenarios. The startup probe handles slow initialization, liveness confirms the process hasn't deadlocked, and readiness validates all dependencies before accepting traffic. The Prometheus metrics endpoint enables detailed monitoring and alerting based on custom thresholds.

Python FastAPI MCP Health Monitor

For Python-based MCP servers, this example shows how to implement comprehensive health monitoring with async support:

from fastapi import FastAPI, Response
from mcp.server import Server
from mcp.server.stdio import stdio_transport
import asyncio
import time
from typing import Dict, Any
from datetime import datetime
import aioredis
import asyncpg

app = FastAPI()
mcp_server = Server("python-mcp-server", "1.0.0")

# Health check configuration
HEALTH_CHECK_CACHE_TTL = 30  # seconds
DEPENDENCY_TIMEOUT = 3  # seconds

class HealthMonitor:
    def __init__(self):
        self.cache = {}
        self.server_ready = False
        self.start_time = time.time()
    
    async def check_dependency(self, name: str, check_func) -> Dict[str, Any]:
        """Check a dependency with timeout and caching"""
        cache_key = f"dep_{name}"
        cached = self.cache.get(cache_key)
        
        if cached and (time.time() - cached['timestamp']) < HEALTH_CHECK_CACHE_TTL:
            return cached['result']
        
        start = time.time()
        try:
            result = await asyncio.wait_for(
                check_func(),
                timeout=DEPENDENCY_TIMEOUT
            )
            status = 'healthy' if result else 'unhealthy'
        except asyncio.TimeoutError:
            status = 'timeout'
            result = False
        except Exception as e:
            status = 'error'
            result = False
        
        response_time = (time.time() - start) * 1000  # ms
        
        check_result = {
            'status': status,
            'responseTime': response_time,
            'timestamp': time.time()
        }
        
        self.cache[cache_key] = {
            'result': check_result,
            'timestamp': time.time()
        }
        
        return check_result

health_monitor = HealthMonitor()

# Dependency check functions
async def check_postgres() -> bool:
    """Verify PostgreSQL connection"""
    try:
        conn = await asyncpg.connect(
            'postgresql://user:pass@localhost/db',
            timeout=2
        )
        await conn.fetchval('SELECT 1')
        await conn.close()
        return True
    except:
        return False

async def check_redis() -> bool:
    """Verify Redis connection"""
    try:
        redis = await aioredis.create_redis_pool('redis://localhost')
        await redis.ping()
        redis.close()
        await redis.wait_closed()
        return True
    except:
        return False

# Health check endpoints
@app.get("/health")
async def health_check():
    """Basic health check endpoint"""
    return {
        "status": "healthy",
        "service": "python-mcp-server",
        "timestamp": datetime.utcnow().isoformat(),
        "uptime": time.time() - health_monitor.start_time
    }

@app.get("/health/ready")
async def readiness_check(response: Response):
    """Comprehensive readiness check"""
    if not health_monitor.server_ready:
        response.status_code = 503
        return {"ready": False, "reason": "Server still initializing"}
    
    # Check all dependencies in parallel
    checks = await asyncio.gather(
        health_monitor.check_dependency("postgres", check_postgres),
        health_monitor.check_dependency("redis", check_redis),
        return_exceptions=True
    )
    
    # Process results
    dependency_results = {}
    all_healthy = True
    
    for idx, (name, result) in enumerate(zip(["postgres", "redis"], checks)):
        if isinstance(result, Exception):
            dependency_results[name] = {
                "status": "error",
                "message": str(result)
            }
            all_healthy = False
        else:
            dependency_results[name] = result
            if result['status'] != 'healthy':
                all_healthy = False
    
    # MCP server check
    mcp_healthy = len(mcp_server._tools) > 0
    dependency_results['mcp'] = {
        'status': 'healthy' if mcp_healthy else 'unhealthy',
        'tools': len(mcp_server._tools),
        'resources': len(mcp_server._resources)
    }
    
    if not mcp_healthy:
        all_healthy = False
    
    response.status_code = 200 if all_healthy else 503
    
    return {
        "ready": all_healthy,
        "checks": dependency_results,
        "timestamp": datetime.utcnow().isoformat()
    }

# Initialize MCP server
@mcp_server.tool()
async def query_data(query: str) -> str:
    """Example MCP tool"""
    return f"Processed query: {query}"

async def start_mcp_server():
    """Start the MCP server"""
    async with stdio_transport() as transport:
        await mcp_server.run(transport)
        health_monitor.server_ready = True

# Run both servers
if __name__ == "__main__":
    import uvicorn
    
    # Start MCP server in background
    asyncio.create_task(start_mcp_server())
    
    # Start FastAPI server
    uvicorn.run(app, host="0.0.0.0", port=8080)

This Python implementation leverages FastAPI's async capabilities for efficient health checking. The caching mechanism prevents overwhelming dependencies during high-frequency health checks. The parallel dependency checking ensures fast response times even with multiple external services. Integration with MCP server state provides accurate readiness signals for container orchestration platforms.

Related Guides

Implementing connection health checks and monitoring

Implement health checks and monitoring for MCP servers to ensure reliable production deployments.

Configuring MCP installations for production deployments

Configure MCP servers for production with security, monitoring, and deployment best practices.

Security tests for MCP server endpoints

Test MCP server security by validating authentication, authorization, and vulnerability scanning.

Keep an eye on AI.

Get rich user analytics and tracing on every user interacting with your MCP server.

Get started