Configuring MCP servers for multiple simultaneous connections

Kashish Hora

Kashish Hora

Co-founder of MCPcat

Try out MCPcat

The Quick Answer

To handle multiple concurrent clients, MCP servers need connection pooling and session management. Here's a production-ready configuration:

const mcpServer = new Server({
  name: "multi-client-server",
  version: "1.0.0",
  transport: "streamable-http",
  options: {
    maxConnections: 50,        // Total concurrent clients
    connectionTimeout: 600000, // Keep connections alive for 10 minutes
    idleTimeout: 300000       // Clean up after 5 minutes of inactivity
  }
});
// Claude Desktop config with connection pooling
{
  "mcpServers": {
    "multi-server": {
      "command": "node",
      "args": ["./mcp-server.js"],
      "env": {
        "MCP_MAX_CONNECTIONS": "50",
        "MCP_CONNECTION_POOL_SIZE": "10"
      }
    }
  }
}

When to use this: Production deployments serving multiple AI agents or users simultaneously. The HTTP/2 transport enables efficient multiplexing—multiple request streams over a single TCP connection—reducing overhead by 60% compared to traditional HTTP/1.1.

Expected performance: This configuration comfortably handles 50+ concurrent clients with sub-100ms response times on a 4-core server. Scale these numbers based on your hardware: roughly 10-15 connections per CPU core for optimal performance.

Prerequisites

  • Node.js 18+ or Python 3.8+ installed
  • MCP SDK (@modelcontextprotocol/sdk or mcp-python)
  • HTTP/2 capable runtime for streamable transport
  • Redis (optional) for distributed session storage

Why Concurrent Connections Matter in MCP

Unlike traditional REST APIs that are stateless, MCP maintains conversational context across multiple interactions. Each AI agent or user needs their own isolated session to:

  1. Preserve conversation history: MCP tracks which tools were called, what data was accessed, and the context of previous interactions
  2. Maintain security boundaries: Different users shouldn't see each other's data or tool results
  3. Enable long-running workflows: AI agents often perform multi-step tasks requiring persistent state

The challenge is that a single MCP server might need to handle:

  • Multiple AI agents working on different tasks simultaneously
  • Teams collaborating through shared tools but with isolated contexts
  • Burst traffic when many users access the same resources
  • Failover scenarios where connections migrate between servers

Without proper concurrency handling, you'll face:

  • Session collision: Different users overwriting each other's context
  • Resource exhaustion: Unmanaged connections consuming all available memory
  • Performance degradation: Sequential processing causing unacceptable latency
  • Lost work: Connection drops losing hours of AI agent progress

Configuration

MCP servers handle concurrent connections through three key mechanisms:

1. Transport Protocol Selection

Your choice of transport directly impacts concurrency capabilities:

  • STDIO (Standard I/O): Best for local, single-user scenarios. Processes requests sequentially through stdin/stdout. Cannot handle true concurrent connections.
  • HTTP + SSE: Enables remote connections with persistent event streams. Supports true concurrency through connection pooling.
  • Streamable HTTP: The newest transport, offering stateless HTTP with optional SSE upgrade. Best for cloud deployments.

2. Connection Lifecycle Management

Every MCP connection follows a predictable lifecycle that you must manage:

  1. Initialization: Client sends initialize request → Server generates unique session ID
  2. Active Use: Client makes tool/resource requests with session ID → Server maintains context
  3. Idle Period: No requests for X minutes → Server marks for cleanup
  4. Termination: Explicit close or timeout → Server releases all resources

The key is balancing resource usage with user experience. Too short timeouts frustrate users; too long exhausts server resources.

3. Resource Allocation Strategy

// Essential connection manager pattern
class ConnectionManager {
  private connections = new Map<string, Connection>();
  private maxConnections = 50;
  
  async acceptConnection(clientId: string, transport: any) {
    if (this.connections.size >= this.maxConnections) {
      throw new Error("Connection limit reached");
    }
    
    const connection = {
      id: clientId,
      transport,
      lastActivity: Date.now(),
      sessionData: {}
    };
    
    this.connections.set(clientId, connection);
    return connection;
  }
}

Key configuration decisions:

  • maxConnections: Set to 10-15 per CPU core. A 4-core server handles 40-60 connections comfortably.
  • idleTimeout: 5-10 minutes for interactive use, 30-60 minutes for long-running AI agents
  • connectionTimeout: Maximum session duration, typically 2-4 hours

For distributed deployments, store session state in Redis rather than memory. This enables:

  • Horizontal scaling: Add servers without losing sessions
  • Fault tolerance: Survive server restarts
  • Load balancing: Route requests to any server instance

Usage

Session Isolation Strategy

The core challenge in concurrent MCP is maintaining isolated conversation contexts. Unlike traditional web APIs where each request is independent, MCP sessions accumulate state over time. This creates two critical requirements:

  1. State Isolation: Each session must have its own memory space for conversation history, tool permissions, and intermediate results
  2. Context Preservation: Sessions must survive between requests without mixing data between users

The most effective pattern uses a session manager that maps unique IDs to isolated state containers:

// Minimal session isolation pattern
server.setRequestHandler("tools/list", async (request, context) => {
  const sessionId = context.sessionId;
  const userSession = sessionManager.getSession(sessionId);
  
  // Return only tools this specific user can access
  return {
    tools: userSession.authorizedTools
  };
});

This approach prevents the most common concurrency bug: tool results from one user appearing in another user's session. Without proper isolation, User A might see database query results intended for User B—a critical security failure.

Managing Shared Resources

When multiple sessions access the same underlying resources (databases, APIs, file systems), you need careful coordination to prevent conflicts:

Connection Pooling: Instead of each session creating its own database connection, use a shared pool that automatically manages connection lifecycle. This prevents the "too many connections" error that crashes databases.

Request Queuing: For rate-limited external APIs, implement a queue that serializes requests across all sessions while maintaining fair access.

File Locking: When sessions modify shared files, use advisory locks or atomic operations to prevent corruption.

The key insight is that MCP servers act as multiplexers—taking concurrent session requests and serializing access to underlying resources while maintaining the illusion of dedicated access for each client.

Transport-Specific Considerations

Your transport choice fundamentally affects concurrency handling:

HTTP + SSE: Best for high-concurrency scenarios. The persistent SSE connection enables real-time updates while HTTP/2 multiplexing allows hundreds of concurrent streams over a single TCP connection. Configure your reverse proxy (nginx, Caddy) to handle long-lived connections with appropriate timeouts.

Streamable HTTP: Ideal for serverless or auto-scaling environments. Each request is independent, allowing horizontal scaling without session affinity. Store session state in external storage (Redis, DynamoDB) for true stateless operation.

STDIO: Limited to single-user scenarios. While you can spawn multiple server processes, each handles only one connection. Use this for local development or dedicated single-user deployments.

Monitoring Concurrent Operations

Effective concurrency requires visibility into system behavior:

  • Active Sessions: Track count and age distribution
  • Request Latency: Monitor P50/P95/P99 by operation type
  • Resource Utilization: Database connections, memory per session
  • Error Rates: Particularly timeout and resource exhaustion errors

Set up alerts for anomalies like session count spikes or increased latency, which often indicate capacity issues before they cause failures.

Common Issues

Diagnosing Connection Problems

When dealing with concurrent connections, issues typically fall into three categories:

1. Connection Limit Reached

Symptoms: New clients receive immediate rejection, "Connection limit reached" errors

Root Causes:

  • Burst traffic exceeding configured limits
  • Clients not properly closing connections (connection leak)
  • Insufficient server resources for configured limits

Diagnosis Approach:

  1. Check current connection count vs. limit
  2. Identify connection age distribution (many old connections indicate leaks)
  3. Monitor server resource usage (CPU, memory)

Solutions:

  • Implement connection queueing for temporary bursts
  • Add automatic cleanup for zombie connections
  • Scale horizontally if at resource limits

2. Session State Inconsistency

Symptoms: Users report missing context, wrong data, or "session not found" errors

Root Causes:

  • In-memory sessions lost during server restart
  • Load balancer routing requests to different servers
  • Session timeout too aggressive

Diagnosis Approach:

  1. Check if issues correlate with deployments or server restarts
  2. Verify load balancer session affinity configuration
  3. Review session timeout vs. typical user interaction patterns

Solutions:

  • Implement persistent session storage (Redis, database)
  • Configure sticky sessions at load balancer
  • Adjust timeouts based on usage patterns

3. Performance Degradation Under Load

Symptoms: Increasing latency, timeouts during peak usage

Root Causes:

  • Insufficient connection pooling for backend resources
  • Synchronous operations blocking event loop
  • Memory leaks accumulating over time

Diagnosis Approach:

  1. Profile request latency by operation type
  2. Monitor memory usage trends
  3. Check for blocking operations in logs

Solutions:

  • Implement connection pooling for all external resources
  • Convert blocking operations to async
  • Add memory profiling and periodic restarts

Troubleshooting Workflow

When users report issues with concurrent connections:

  1. Gather Evidence

    • Error messages and timestamps
    • User count and activity patterns
    • Recent changes or deployments
  2. Check System Health

    # Quick health check commands
    $curl http://localhost:8080/health
    $ps aux | grep mcp-server
    $netstat -an | grep :8080 | wc -l
  3. Review Logs

    • Look for patterns in error messages
    • Check for resource exhaustion warnings
    • Identify any crash/restart events
  4. Test Isolation

    • Can you reproduce with a single connection?
    • Does issue appear under specific load?
    • Is it affecting all users or specific subset?
  5. Implement Fix

    • Start with configuration changes (limits, timeouts)
    • Then code changes if needed
    • Always test under realistic load

Prevention Strategies

Build resilience into your concurrent connection handling:

  • Graceful Degradation: Queue excess connections rather than rejecting
  • Circuit Breakers: Temporarily disable features during overload
  • Health Endpoints: Enable proactive monitoring
  • Capacity Planning: Load test to find actual limits
  • Observability: Log enough detail to diagnose issues retroactively

Example: Building a Scalable Multi-Tenant MCP Server

Let's walk through a real production scenario: building an MCP server that handles multiple organizations (tenants) with isolated data and rate limiting. This example illustrates the key concepts we've discussed.

Design Decisions

Before diving into code, consider the architecture choices:

  1. Why Multi-Tenant? Many organizations want to share MCP infrastructure while maintaining data isolation. Think of it like Slack—one platform, many isolated workspaces.

  2. Why Redis? In-memory session storage doesn't survive restarts and can't scale horizontally. Redis provides persistent, distributed session storage with sub-millisecond latency.

  3. Why Rate Limiting? Without limits, one tenant could consume all resources, degrading service for others. Per-tenant limits ensure fair resource allocation.

Core Implementation

class MultiTenantMCPServer {
  private server: Server;
  private redis: Redis;
  private rateLimiter: RateLimiterRedis;
  
  constructor() {
    // Redis for distributed state
    this.redis = new Redis({ 
      host: process.env.REDIS_HOST,
      enableOfflineQueue: false // Fail fast if Redis is down
    });
    
    // Per-tenant rate limiting: 1000 requests/minute
    this.rateLimiter = new RateLimiterRedis({
      storeClient: this.redis,
      keyPrefix: 'mcp_rl',
      points: 1000,
      duration: 60
    });
    
    this.server = new Server({
      name: "multi-tenant-mcp",
      version: "1.0.0"
    });
  }
  
  private async handleToolCall(request, context) {
    const tenantId = context.tenantId;
    
    // 1. Check rate limit first (fail fast)
    try {
      await this.rateLimiter.consume(tenantId, 1);
    } catch (e) {
      throw new Error("Rate limit exceeded");
    }
    
    // 2. Verify tenant has access to requested tool
    const tenant = await this.getTenantConfig(tenantId);
    if (!tenant.allowedTools.includes(request.params.tool)) {
      throw new Error("Tool not authorized");
    }
    
    // 3. Execute with tenant-specific context
    return this.executeTool(request.params, tenantId);
  }
}

Key Patterns Illustrated

Tenant Isolation: Each request includes a tenant ID (extracted from auth token). All operations are scoped to this tenant—they can't access other tenants' data or exceed their resource quotas.

Fail-Fast Philosophy: Check rate limits before doing expensive operations. If a tenant is over quota, reject immediately without consuming server resources.

Distributed State: Using Redis means you can run multiple server instances. Sessions and rate limit counters are shared across all instances, enabling horizontal scaling.

Monitoring and Operations

The most critical aspect of production deployments is observability:

private async monitorHealth() {
  const metrics = {
    tenants: new Map(),
    totalConnections: 0,
    redisLatency: 0
  };
  
  // Track per-tenant metrics
  for (const [tenantId, sessions] of this.tenantSessions) {
    metrics.tenants.set(tenantId, {
      activeSessions: sessions.size,
      requestRate: await this.getRequestRate(tenantId),
      errorRate: await this.getErrorRate(tenantId)
    });
  }
  
  // Alert on anomalies
  if (metrics.totalConnections > this.maxConnections * 0.9) {
    this.alerting.warn("Approaching connection limit");
  }
}

Scaling Strategy

This architecture scales in three dimensions:

  1. Vertical: Add CPU/memory to handle more connections per server
  2. Horizontal: Add server instances behind a load balancer
  3. Sharded: Partition tenants across server clusters for massive scale

Start with vertical scaling (it's simpler), then add horizontal scaling when you need high availability. Only consider sharding when you have hundreds of tenants with thousands of concurrent connections.

Lessons from Production

Real deployments have taught us:

  • Connection leaks are common: Implement aggressive timeout and cleanup policies
  • Rate limits need flexibility: Allow temporary bursts with token bucket algorithms
  • Monitoring is critical: You can't fix what you can't see
  • Plan for failure: Redis will go down, networks will partition, servers will crash

The complete implementation includes error handling, graceful shutdown, health checks, and comprehensive logging—all essential for production reliability but omitted here for clarity.