The Quick Answer
Configure StreamableHTTP transport with a load balancer for horizontally scalable MCP deployments:
// server.ts
app.post('/mcp', async (req, res) => {
const acceptsSSE = req.headers.accept?.includes('text/event-stream');
const response = await mcpServer.handleRequest(req.body);
if (response.streaming && acceptsSSE) {
res.setHeader('Content-Type', 'text/event-stream');
for await (const message of response.stream) {
res.write(`data: ${JSON.stringify(message)}\n\n`);
}
} else {
res.json(response);
}
});
StreamableHTTP's single-endpoint architecture enables 200x better performance under load compared to SSE transport.
Prerequisites
- Node.js 18+ or Python 3.10+ runtime environment
- Load balancer supporting HTTP/2 (AWS ALB, nginx, HAProxy)
- Redis 7+ for distributed session management
- Docker and Kubernetes/ECS for container orchestration
Installation
Install the MCP SDK with StreamableHTTP support:
# TypeScript/JavaScript$npm install @modelcontextprotocol/sdk express# Python$pip install fastmcp uvicorn redis
Configure your load balancer for sticky sessions:
# nginx.conf
upstream mcp_servers {
ip_hash; # Session affinity
server mcp1:8080;
server mcp2:8080;
server mcp3:8080;
}
Basic Implementation
StreamableHTTP consolidates all MCP communication through a single HTTP endpoint. Unlike the deprecated SSE transport that required separate endpoints for requests and responses, StreamableHTTP handles both request-response and streaming patterns through intelligent content negotiation. For a detailed comparison of transport options, see our guide on comparing stdio, SSE, and StreamableHTTP.
Create a basic StreamableHTTP server that can scale horizontally:
import express from 'express';
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import Redis from 'redis';
const app = express();
const redis = Redis.createClient({ url: process.env.REDIS_URL });
const mcpServer = new Server({
name: "scalable-mcp",
version: "1.0.0"
});
// Session middleware
app.use(async (req, res, next) => {
const sessionId = req.headers['mcp-session-id'];
if (sessionId) {
req.session = await redis.get(`session:${sessionId}`);
}
next();
});
app.post('/mcp', async (req, res) => {
try {
const response = await mcpServer.handleRequest(req.body);
// Handle streaming responses
if (response.streaming && req.accepts('text/event-stream')) {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('X-Accel-Buffering', 'no');
for await (const message of response.stream) {
res.write(`data: ${JSON.stringify(message)}\n\n`);
// ... store last event ID for recovery ...
}
res.end();
} else {
res.json(response);
}
} catch (error) {
// ... error handling ...
}
});
// Health checks
app.get('/health', async (req, res) => {
const redisOk = await redis.ping() === 'PONG';
res.status(redisOk ? 200 : 503).json({
status: redisOk ? 'healthy' : 'unhealthy'
});
});
app.listen(8080);
The server implements session affinity through Redis, enabling any instance to handle requests from established sessions. Health checks ensure the load balancer only routes traffic to healthy instances.
Container Deployment
Deploy StreamableHTTP servers in containers for consistent scaling across environments. The stateless nature of StreamableHTTP makes it ideal for container orchestration platforms. For additional Docker deployment patterns, see our guide on configuring MCP transport with Docker.
Create a production-ready Dockerfile:
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM node:20-alpine
RUN apk add --no-cache tini
WORKDIR /app
COPY /app/node_modules ./node_modules
COPY . .
# Run as non-root user
USER node
# Use tini for proper signal handling
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "server.js"]
# Health check
HEALTHCHECK \
CMD node healthcheck.js || exit 1
Deploy with Kubernetes for automatic scaling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-streamablehttp
spec:
replicas: 3
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
- name: mcp
image: mcp-server:latest
ports:
- containerPort: 8080
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-secret
key: url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: mcp-service
spec:
selector:
app: mcp-server
ports:
- port: 80
targetPort: 8080
type: LoadBalancer
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
The Kubernetes configuration ensures proper resource allocation, health monitoring, and session affinity. The horizontal pod autoscaler can dynamically adjust replicas based on CPU and memory usage.
Load Balancing Strategies
Effective load balancing is crucial for StreamableHTTP deployments. Different strategies suit different use cases, from simple round-robin to sophisticated session-aware routing.
Configure AWS Application Load Balancer for production:
// AWS CDK configuration
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
const loadBalancer = new elbv2.ApplicationLoadBalancer(this, 'MCPLoadBalancer', {
vpc,
internetFacing: true,
http2Enabled: true
});
const targetGroup = new elbv2.ApplicationTargetGroup(this, 'MCPTargetGroup', {
vpc,
port: 8080,
protocol: elbv2.ApplicationProtocol.HTTP,
targetType: elbv2.TargetType.IP,
healthCheck: {
path: '/health',
interval: cdk.Duration.seconds(30),
timeout: cdk.Duration.seconds(5),
healthyThresholdCount: 2,
unhealthyThresholdCount: 3
},
stickinessCookieDuration: cdk.Duration.hours(1),
stickinessCookieName: 'MCP-SESSION'
});
// Add targets from ECS service
targetGroup.addTarget(ecsService);
loadBalancer.addListener('MCPListener', {
port: 443,
certificates: [certificate],
defaultTargetGroups: [targetGroup]
});
Implement client-side connection management for failover:
class MCPClient {
private endpoints: string[];
private currentEndpoint: number = 0;
private sessionId?: string;
constructor(endpoints: string[]) {
this.endpoints = endpoints;
}
async request(method: string, params?: any): Promise<any> {
const maxRetries = this.endpoints.length;
let lastError: Error | null = null;
for (let i = 0; i < maxRetries; i++) {
try {
const endpoint = this.endpoints[this.currentEndpoint];
const response = await fetch(`${endpoint}/mcp`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Accept': 'application/json, text/event-stream',
...(this.sessionId && { 'Mcp-Session-Id': this.sessionId })
},
body: JSON.stringify({
jsonrpc: "2.0",
method,
params,
id: Date.now()
})
});
// Capture session ID from response
const newSessionId = response.headers.get('Mcp-Session-Id');
if (newSessionId) {
this.sessionId = newSessionId;
}
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return await response.json();
} catch (error) {
lastError = error as Error;
// Try next endpoint
this.currentEndpoint = (this.currentEndpoint + 1) % this.endpoints.length;
}
}
throw lastError || new Error('All endpoints failed');
}
}
The client implementation provides automatic failover between multiple endpoints while maintaining session affinity through the Mcp-Session-Id
header. This ensures continuous service even during partial outages.
Session Management
Distributed session management enables horizontal scaling while maintaining stateful connections. Redis provides the ideal backend for session storage with its low latency and built-in expiration.
Implement robust session handling:
from fastapi import FastAPI, Request, Response
import redis.asyncio as redis
import json
import uuid
app = FastAPI()
redis_client = redis.from_url("redis://localhost")
class SessionManager:
def __init__(self, redis_client):
self.redis = redis_client
self.session_ttl = 3600 # 1 hour
async def create_session(self, client_info: dict) -> str:
session_id = str(uuid.uuid4())
session_data = {
"id": session_id,
"created_at": datetime.utcnow().isoformat(),
"client_info": client_info,
"last_activity": datetime.utcnow().isoformat()
}
await self.redis.setex(
f"session:{session_id}",
self.session_ttl,
json.dumps(session_data)
)
return session_id
async def get_session(self, session_id: str) -> dict | None:
data = await self.redis.get(f"session:{session_id}")
if not data:
return None
session = json.loads(data)
# Update last activity and extend TTL
session["last_activity"] = datetime.utcnow().isoformat()
await self.redis.setex(
f"session:{session_id}",
self.session_ttl,
json.dumps(session)
)
return session
session_manager = SessionManager(redis_client)
@app.post("/mcp")
async def handle_mcp(request: Request, response: Response):
body = await request.json()
# Session handling
session_id = request.headers.get("mcp-session-id")
session = await session_manager.get_session(session_id) if session_id else None
if not session:
# Create new session
session_id = await session_manager.create_session({
"user_agent": request.headers.get("user-agent"),
"ip": request.client.host
})
response.headers["Mcp-Session-Id"] = session_id
# Process request with session context
result = await process_mcp_request(body, session)
return result
Monitor session distribution across nodes:
async def get_session_metrics():
"""Collect session distribution metrics"""
metrics = {
"total_sessions": 0,
"sessions_by_node": {},
"avg_session_duration": 0
}
# Scan all sessions
cursor = 0
sessions = []
while True:
cursor, keys = await redis_client.scan(
cursor,
match="session:*",
count=100
)
sessions.extend(keys)
if cursor == 0:
break
metrics["total_sessions"] = len(sessions)
# Analyze session distribution
for session_key in sessions:
session_data = await redis_client.get(session_key)
if session_data:
session = json.loads(session_data)
node = session.get("node_id", "unknown")
metrics["sessions_by_node"][node] = \
metrics["sessions_by_node"].get(node, 0) + 1
return metrics
Session management ensures users maintain context across requests while enabling the system to scale horizontally. The Redis backend provides millisecond latency for session operations.
Performance Optimization
StreamableHTTP's architecture enables significant performance optimizations. By eliminating persistent connections and supporting request batching, it achieves superior throughput under load.
Implement connection pooling and request batching:
class OptimizedMCPServer {
private requestQueue: Map<string, Promise<any>> = new Map();
private batchTimer?: NodeJS.Timeout;
private pendingBatch: Array<{request: any, resolve: Function, reject: Function}> = [];
async handleRequest(request: any): Promise<any> {
// Deduplicate identical concurrent requests
const requestKey = JSON.stringify(request);
if (this.requestQueue.has(requestKey)) {
return this.requestQueue.get(requestKey);
}
// Create promise for this request
const promise = this.processBatchedRequest(request);
this.requestQueue.set(requestKey, promise);
// Clean up after completion
promise.finally(() => {
this.requestQueue.delete(requestKey);
});
return promise;
}
private processBatchedRequest(request: any): Promise<any> {
return new Promise((resolve, reject) => {
this.pendingBatch.push({ request, resolve, reject });
// Batch requests every 10ms or when batch size reaches 50
if (!this.batchTimer) {
this.batchTimer = setTimeout(() => this.flushBatch(), 10);
}
if (this.pendingBatch.length >= 50) {
this.flushBatch();
}
});
}
// ... flushBatch implementation ...
}
Configure nginx for optimal StreamableHTTP performance:
http {
upstream mcp_backend {
least_conn; # Better than round-robin for long requests
server mcp1:8080 max_fails=2 fail_timeout=30s;
server mcp2:8080 max_fails=2 fail_timeout=30s;
server mcp3:8080 max_fails=2 fail_timeout=30s;
keepalive 32; # Connection pooling
}
server {
listen 443 ssl http2;
server_name api.example.com;
# SSL configuration
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
location /mcp {
proxy_pass http://mcp_backend;
# StreamableHTTP optimizations
proxy_http_version 1.1;
proxy_set_header Connection "";
# SSE support
proxy_set_header Accept-Encoding "";
proxy_buffering off;
proxy_cache off;
# Timeouts for long-running operations
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
proxy_send_timeout 60s;
# Session affinity via cookie
proxy_set_header Cookie $http_cookie;
# Headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location /health {
proxy_pass http://mcp_backend;
proxy_connect_timeout 5s;
proxy_read_timeout 5s;
}
}
}
These optimizations ensure StreamableHTTP maintains its 200x performance advantage over SSE transport even at scale. Request deduplication and batching reduce backend load while connection pooling minimizes latency.
Common Issues
Error: Session affinity not working with load balancer
Session affinity failures occur when load balancers don't properly route subsequent requests to the same backend instance. This breaks stateful operations and causes authentication issues.
// Solution: Implement session validation and recovery
app.use(async (req, res, next) => {
const sessionId = req.headers['mcp-session-id'];
if (sessionId) {
const session = await redis.get(`session:${sessionId}`);
if (!session) {
// Session lost - create new one with recovery
const newSessionId = generateSessionId();
// Attempt to recover last known state
const lastEventId = req.headers['last-event-id'];
if (lastEventId) {
const streamState = await redis.get(`stream:${sessionId}:${lastEventId}`);
if (streamState) {
await redis.setex(
`session:${newSessionId}`,
3600,
streamState
);
}
}
res.setHeader('Mcp-Session-Id', newSessionId);
req.session = { id: newSessionId, recovered: true };
} else {
req.session = JSON.parse(session);
}
}
next();
});
To prevent session affinity issues, implement client-side session persistence and configure your load balancer with appropriate sticky session duration. Monitor session distribution to detect imbalances early.
Error: Memory leaks during streaming responses
Memory leaks in streaming responses typically occur when event streams aren't properly closed or when backpressure isn't handled. This leads to server crashes under sustained load.
// Solution: Implement proper stream cleanup and backpressure
app.post('/mcp', async (req, res) => {
const streamController = new AbortController();
let streamClosed = false;
// Clean up on client disconnect
req.on('close', () => {
streamClosed = true;
streamController.abort();
});
res.on('error', () => {
streamClosed = true;
streamController.abort();
});
try {
const response = await mcpServer.handleRequest(req.body, {
signal: streamController.signal
});
if (response.streaming && req.accepts('text/event-stream')) {
res.setHeader('Content-Type', 'text/event-stream');
for await (const message of response.stream) {
if (streamClosed) break;
// Handle backpressure
const canWrite = res.write(`data: ${JSON.stringify(message)}\n\n`);
if (!canWrite) {
// Pause until drain event
await new Promise(resolve => res.once('drain', resolve));
}
}
if (!streamClosed) {
res.end();
}
} else {
res.json(response);
}
} finally {
// Ensure cleanup
streamController.abort();
}
});
Monitor memory usage patterns and implement circuit breakers to prevent cascading failures. Use Node.js heap snapshots to identify memory leak sources during development.
Error: High latency during traffic spikes
Traffic spikes can overwhelm StreamableHTTP servers if auto-scaling isn't properly configured. This manifests as increased response times and timeout errors.
# Solution: Configure aggressive auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-streamablehttp
minReplicas: 3
maxReplicas: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # Fast scale-up
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 60
- type: Pods
value: 5 # Add 5 pods minimum
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Slow scale-down
policies:
- type: Percent
value: 10
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Aggressive threshold
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: response_time_p95
target:
type: Value
value: "100" # 100ms p95 target
Implement request queuing and rate limiting to handle burst traffic gracefully. Pre-warm containers during anticipated traffic increases to minimize cold start latency.
Examples
Multi-Region Deployment with Failover
Deploy StreamableHTTP across multiple AWS regions for global availability and disaster recovery. This architecture ensures sub-100ms latency worldwide while maintaining high availability.
// Global StreamableHTTP deployment with DynamoDB Global Tables
import { DynamoDB } from '@aws-sdk/client-dynamodb';
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
class GlobalMCPServer {
private region: string;
private dynamodb: DynamoDB;
private server: Server;
constructor(region: string) {
this.region = region;
this.dynamodb = new DynamoDB({ region });
this.server = new Server({
name: "global-mcp",
version: "1.0.0"
});
}
async handleRequest(request: any, headers: any): Promise<any> {
const sessionId = headers['mcp-session-id'];
// Use DynamoDB Global Tables for session state
if (sessionId) {
const session = await this.getGlobalSession(sessionId);
if (session) {
await this.updateSessionRegion(sessionId, this.region);
}
}
const response = await this.server.handleRequest(request);
response.headers = {
...response.headers,
'X-MCP-Region': this.region
};
return response;
}
private async getGlobalSession(sessionId: string) {
const result = await this.dynamodb.getItem({
TableName: 'MCPSessions',
Key: { sessionId: { S: sessionId } },
ConsistentRead: false // Eventually consistent for global reads
});
return result.Item;
}
// ... updateSessionRegion method ...
}
Configure multi-region infrastructure with CloudFormation:
# cloudformation.yaml
Resources:
GlobalTable:
Type: AWS::DynamoDB::GlobalTable
Properties:
TableName: MCPSessions
BillingMode: PAY_PER_REQUEST
Replicas:
- Region: us-east-1
- Region: eu-west-1
- Region: ap-southeast-1
# ... attribute definitions ...
Production deployment uses Route 53 with health checks for automatic regional failover. DynamoDB Global Tables provide millisecond replication of session state across regions, ensuring users experience no interruption during regional failures.
Serverless StreamableHTTP with AWS Lambda
Deploy StreamableHTTP as serverless functions for infinite scalability and zero infrastructure management. Lambda's event-driven architecture perfectly complements StreamableHTTP's stateless design.
# lambda_function.py
import json
import boto3
from mcp.server import Server
import asyncio
# Initialize outside handler for connection reuse
dynamodb = boto3.resource('dynamodb')
sessions_table = dynamodb.Table('MCPSessions')
server = Server("lambda-mcp")
def lambda_handler(event, context):
"""AWS Lambda handler for StreamableHTTP"""
# Handle OPTIONS for CORS
if event['httpMethod'] == 'OPTIONS':
return {
'statusCode': 200,
'headers': {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'POST, GET, OPTIONS',
'Access-Control-Allow-Headers': 'Content-Type, Mcp-Session-Id'
}
}
try:
body = json.loads(event['body'])
# Session management with DynamoDB
session_id = event['headers'].get('mcp-session-id')
if not session_id:
session_id = str(uuid.uuid4())
sessions_table.put_item(Item={
'sessionId': session_id,
'created': int(time.time()),
'ttl': int(time.time()) + 3600 # 1 hour TTL
})
# Process MCP request
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
result = loop.run_until_complete(
server.handle_request(body)
)
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
'Mcp-Session-Id': session_id
},
'body': json.dumps(result)
}
except Exception as e:
# ... error handling ...
return {'statusCode': 500}
Deploy with Serverless Framework for automatic scaling:
# serverless.yml
service: mcp-streamablehttp
provider:
name: aws
runtime: python3.11
environment:
DYNAMODB_TABLE: ${self:service}-sessions
functions:
mcp:
handler: lambda_function.lambda_handler
events:
- http:
path: /mcp
method: post
cors: true
reservedConcurrency: 100 # Prevent cold starts
resources:
Resources:
SessionsTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: ${self:provider.environment.DYNAMODB_TABLE}
BillingMode: PAY_PER_REQUEST
# ... table configuration ...
Serverless deployment eliminates scaling concerns entirely. Lambda automatically handles millions of concurrent requests while DynamoDB provides consistent session storage. Cost scales linearly with usage, making it ideal for variable workloads.
Serverless deployment eliminates scaling concerns entirely. Lambda automatically handles millions of concurrent requests while DynamoDB provides consistent session storage. Cost scales linearly with usage, making it ideal for variable workloads.
Related Guides
Building a StreamableHTTP MCP server
Deploy scalable MCP servers using StreamableHTTP for cloud environments and remote access.
Configuring MCP installations for production deployments
Configure MCP servers for production with security, monitoring, and deployment best practices.
Comparing stdio vs. SSE vs. StreamableHTTP
Compare MCP transport protocols to choose between stdio, SSE, and StreamableHTTP for your use case.