Configuring MCP servers for multiple simultaneous connections

Kashish Hora
Co-founder of MCPcat
The Quick Answer
To handle multiple concurrent clients, MCP servers need connection pooling and session management. Here's a production-ready configuration:
const mcpServer = new Server({
name: "multi-client-server",
version: "1.0.0",
transport: "streamable-http",
options: {
maxConnections: 50, // Total concurrent clients
connectionTimeout: 600000, // Keep connections alive for 10 minutes
idleTimeout: 300000 // Clean up after 5 minutes of inactivity
}
});
// Claude Desktop config with connection pooling
{
"mcpServers": {
"multi-server": {
"command": "node",
"args": ["./mcp-server.js"],
"env": {
"MCP_MAX_CONNECTIONS": "50",
"MCP_CONNECTION_POOL_SIZE": "10"
}
}
}
}
When to use this: Production deployments serving multiple AI agents or users simultaneously. The HTTP/2 transport enables efficient multiplexing—multiple request streams over a single TCP connection—reducing overhead by 60% compared to traditional HTTP/1.1.
Expected performance: This configuration comfortably handles 50+ concurrent clients with sub-100ms response times on a 4-core server. Scale these numbers based on your hardware: roughly 10-15 connections per CPU core for optimal performance.
Prerequisites
- Node.js 18+ or Python 3.8+ installed
- MCP SDK (
@modelcontextprotocol/sdk
ormcp-python
) - HTTP/2 capable runtime for streamable transport
- Redis (optional) for distributed session storage
Why Concurrent Connections Matter in MCP
Unlike traditional REST APIs that are stateless, MCP maintains conversational context across multiple interactions. Each AI agent or user needs their own isolated session to:
- Preserve conversation history: MCP tracks which tools were called, what data was accessed, and the context of previous interactions
- Maintain security boundaries: Different users shouldn't see each other's data or tool results
- Enable long-running workflows: AI agents often perform multi-step tasks requiring persistent state
The challenge is that a single MCP server might need to handle:
- Multiple AI agents working on different tasks simultaneously
- Teams collaborating through shared tools but with isolated contexts
- Burst traffic when many users access the same resources
- Failover scenarios where connections migrate between servers
Without proper concurrency handling, you'll face:
- Session collision: Different users overwriting each other's context
- Resource exhaustion: Unmanaged connections consuming all available memory
- Performance degradation: Sequential processing causing unacceptable latency
- Lost work: Connection drops losing hours of AI agent progress
Configuration
MCP servers handle concurrent connections through three key mechanisms:
1. Transport Protocol Selection
Your choice of transport directly impacts concurrency capabilities:
- STDIO (Standard I/O): Best for local, single-user scenarios. Processes requests sequentially through stdin/stdout. Cannot handle true concurrent connections.
- HTTP + SSE: Enables remote connections with persistent event streams. Supports true concurrency through connection pooling.
- Streamable HTTP: The newest transport, offering stateless HTTP with optional SSE upgrade. Best for cloud deployments.
2. Connection Lifecycle Management
Every MCP connection follows a predictable lifecycle that you must manage:
- Initialization: Client sends
initialize
request → Server generates unique session ID - Active Use: Client makes tool/resource requests with session ID → Server maintains context
- Idle Period: No requests for X minutes → Server marks for cleanup
- Termination: Explicit close or timeout → Server releases all resources
The key is balancing resource usage with user experience. Too short timeouts frustrate users; too long exhausts server resources.
3. Resource Allocation Strategy
// Essential connection manager pattern
class ConnectionManager {
private connections = new Map<string, Connection>();
private maxConnections = 50;
async acceptConnection(clientId: string, transport: any) {
if (this.connections.size >= this.maxConnections) {
throw new Error("Connection limit reached");
}
const connection = {
id: clientId,
transport,
lastActivity: Date.now(),
sessionData: {}
};
this.connections.set(clientId, connection);
return connection;
}
}
Key configuration decisions:
maxConnections
: Set to 10-15 per CPU core. A 4-core server handles 40-60 connections comfortably.idleTimeout
: 5-10 minutes for interactive use, 30-60 minutes for long-running AI agentsconnectionTimeout
: Maximum session duration, typically 2-4 hours
For distributed deployments, store session state in Redis rather than memory. This enables:
- Horizontal scaling: Add servers without losing sessions
- Fault tolerance: Survive server restarts
- Load balancing: Route requests to any server instance
Usage
Session Isolation Strategy
The core challenge in concurrent MCP is maintaining isolated conversation contexts. Unlike traditional web APIs where each request is independent, MCP sessions accumulate state over time. This creates two critical requirements:
- State Isolation: Each session must have its own memory space for conversation history, tool permissions, and intermediate results
- Context Preservation: Sessions must survive between requests without mixing data between users
The most effective pattern uses a session manager that maps unique IDs to isolated state containers:
// Minimal session isolation pattern
server.setRequestHandler("tools/list", async (request, context) => {
const sessionId = context.sessionId;
const userSession = sessionManager.getSession(sessionId);
// Return only tools this specific user can access
return {
tools: userSession.authorizedTools
};
});
This approach prevents the most common concurrency bug: tool results from one user appearing in another user's session. Without proper isolation, User A might see database query results intended for User B—a critical security failure.
Managing Shared Resources
When multiple sessions access the same underlying resources (databases, APIs, file systems), you need careful coordination to prevent conflicts:
Connection Pooling: Instead of each session creating its own database connection, use a shared pool that automatically manages connection lifecycle. This prevents the "too many connections" error that crashes databases.
Request Queuing: For rate-limited external APIs, implement a queue that serializes requests across all sessions while maintaining fair access.
File Locking: When sessions modify shared files, use advisory locks or atomic operations to prevent corruption.
The key insight is that MCP servers act as multiplexers—taking concurrent session requests and serializing access to underlying resources while maintaining the illusion of dedicated access for each client.
Transport-Specific Considerations
Your transport choice fundamentally affects concurrency handling:
HTTP + SSE: Best for high-concurrency scenarios. The persistent SSE connection enables real-time updates while HTTP/2 multiplexing allows hundreds of concurrent streams over a single TCP connection. Configure your reverse proxy (nginx, Caddy) to handle long-lived connections with appropriate timeouts.
Streamable HTTP: Ideal for serverless or auto-scaling environments. Each request is independent, allowing horizontal scaling without session affinity. Store session state in external storage (Redis, DynamoDB) for true stateless operation.
STDIO: Limited to single-user scenarios. While you can spawn multiple server processes, each handles only one connection. Use this for local development or dedicated single-user deployments.
Monitoring Concurrent Operations
Effective concurrency requires visibility into system behavior:
- Active Sessions: Track count and age distribution
- Request Latency: Monitor P50/P95/P99 by operation type
- Resource Utilization: Database connections, memory per session
- Error Rates: Particularly timeout and resource exhaustion errors
Set up alerts for anomalies like session count spikes or increased latency, which often indicate capacity issues before they cause failures.
Common Issues
Diagnosing Connection Problems
When dealing with concurrent connections, issues typically fall into three categories:
1. Connection Limit Reached
Symptoms: New clients receive immediate rejection, "Connection limit reached" errors
Root Causes:
- Burst traffic exceeding configured limits
- Clients not properly closing connections (connection leak)
- Insufficient server resources for configured limits
Diagnosis Approach:
- Check current connection count vs. limit
- Identify connection age distribution (many old connections indicate leaks)
- Monitor server resource usage (CPU, memory)
Solutions:
- Implement connection queueing for temporary bursts
- Add automatic cleanup for zombie connections
- Scale horizontally if at resource limits
2. Session State Inconsistency
Symptoms: Users report missing context, wrong data, or "session not found" errors
Root Causes:
- In-memory sessions lost during server restart
- Load balancer routing requests to different servers
- Session timeout too aggressive
Diagnosis Approach:
- Check if issues correlate with deployments or server restarts
- Verify load balancer session affinity configuration
- Review session timeout vs. typical user interaction patterns
Solutions:
- Implement persistent session storage (Redis, database)
- Configure sticky sessions at load balancer
- Adjust timeouts based on usage patterns
3. Performance Degradation Under Load
Symptoms: Increasing latency, timeouts during peak usage
Root Causes:
- Insufficient connection pooling for backend resources
- Synchronous operations blocking event loop
- Memory leaks accumulating over time
Diagnosis Approach:
- Profile request latency by operation type
- Monitor memory usage trends
- Check for blocking operations in logs
Solutions:
- Implement connection pooling for all external resources
- Convert blocking operations to async
- Add memory profiling and periodic restarts
Troubleshooting Workflow
When users report issues with concurrent connections:
-
Gather Evidence
- Error messages and timestamps
- User count and activity patterns
- Recent changes or deployments
-
Check System Health
# Quick health check commands$curl http://localhost:8080/health$ps aux | grep mcp-server$netstat -an | grep :8080 | wc -l -
Review Logs
- Look for patterns in error messages
- Check for resource exhaustion warnings
- Identify any crash/restart events
-
Test Isolation
- Can you reproduce with a single connection?
- Does issue appear under specific load?
- Is it affecting all users or specific subset?
-
Implement Fix
- Start with configuration changes (limits, timeouts)
- Then code changes if needed
- Always test under realistic load
Prevention Strategies
Build resilience into your concurrent connection handling:
- Graceful Degradation: Queue excess connections rather than rejecting
- Circuit Breakers: Temporarily disable features during overload
- Health Endpoints: Enable proactive monitoring
- Capacity Planning: Load test to find actual limits
- Observability: Log enough detail to diagnose issues retroactively
Example: Building a Scalable Multi-Tenant MCP Server
Let's walk through a real production scenario: building an MCP server that handles multiple organizations (tenants) with isolated data and rate limiting. This example illustrates the key concepts we've discussed.
Design Decisions
Before diving into code, consider the architecture choices:
-
Why Multi-Tenant? Many organizations want to share MCP infrastructure while maintaining data isolation. Think of it like Slack—one platform, many isolated workspaces.
-
Why Redis? In-memory session storage doesn't survive restarts and can't scale horizontally. Redis provides persistent, distributed session storage with sub-millisecond latency.
-
Why Rate Limiting? Without limits, one tenant could consume all resources, degrading service for others. Per-tenant limits ensure fair resource allocation.
Core Implementation
class MultiTenantMCPServer {
private server: Server;
private redis: Redis;
private rateLimiter: RateLimiterRedis;
constructor() {
// Redis for distributed state
this.redis = new Redis({
host: process.env.REDIS_HOST,
enableOfflineQueue: false // Fail fast if Redis is down
});
// Per-tenant rate limiting: 1000 requests/minute
this.rateLimiter = new RateLimiterRedis({
storeClient: this.redis,
keyPrefix: 'mcp_rl',
points: 1000,
duration: 60
});
this.server = new Server({
name: "multi-tenant-mcp",
version: "1.0.0"
});
}
private async handleToolCall(request, context) {
const tenantId = context.tenantId;
// 1. Check rate limit first (fail fast)
try {
await this.rateLimiter.consume(tenantId, 1);
} catch (e) {
throw new Error("Rate limit exceeded");
}
// 2. Verify tenant has access to requested tool
const tenant = await this.getTenantConfig(tenantId);
if (!tenant.allowedTools.includes(request.params.tool)) {
throw new Error("Tool not authorized");
}
// 3. Execute with tenant-specific context
return this.executeTool(request.params, tenantId);
}
}
Key Patterns Illustrated
Tenant Isolation: Each request includes a tenant ID (extracted from auth token). All operations are scoped to this tenant—they can't access other tenants' data or exceed their resource quotas.
Fail-Fast Philosophy: Check rate limits before doing expensive operations. If a tenant is over quota, reject immediately without consuming server resources.
Distributed State: Using Redis means you can run multiple server instances. Sessions and rate limit counters are shared across all instances, enabling horizontal scaling.
Monitoring and Operations
The most critical aspect of production deployments is observability:
private async monitorHealth() {
const metrics = {
tenants: new Map(),
totalConnections: 0,
redisLatency: 0
};
// Track per-tenant metrics
for (const [tenantId, sessions] of this.tenantSessions) {
metrics.tenants.set(tenantId, {
activeSessions: sessions.size,
requestRate: await this.getRequestRate(tenantId),
errorRate: await this.getErrorRate(tenantId)
});
}
// Alert on anomalies
if (metrics.totalConnections > this.maxConnections * 0.9) {
this.alerting.warn("Approaching connection limit");
}
}
Scaling Strategy
This architecture scales in three dimensions:
- Vertical: Add CPU/memory to handle more connections per server
- Horizontal: Add server instances behind a load balancer
- Sharded: Partition tenants across server clusters for massive scale
Start with vertical scaling (it's simpler), then add horizontal scaling when you need high availability. Only consider sharding when you have hundreds of tenants with thousands of concurrent connections.
Lessons from Production
Real deployments have taught us:
- Connection leaks are common: Implement aggressive timeout and cleanup policies
- Rate limits need flexibility: Allow temporary bursts with token bucket algorithms
- Monitoring is critical: You can't fix what you can't see
- Plan for failure: Redis will go down, networks will partition, servers will crash
The complete implementation includes error handling, graceful shutdown, health checks, and comprehensive logging—all essential for production reliability but omitted here for clarity.
Related Guides
Comparing stdio vs. SSE vs. StreamableHTTP
Compare MCP transport protocols to choose between stdio, SSE, and StreamableHTTP for your use case.
Setting up StreamableHTTP for scalable deployments
Deploy scalable MCP servers with StreamableHTTP for high performance and horizontal scaling.
Implementing connection health checks and monitoring
Implement health checks and monitoring for MCP servers to ensure reliable production deployments.