StreamableHTTP for Scalable MCP Deployments - Setup Guide

Setting up StreamableHTTP for scalable deployments

Kashish Hora

Co-founder of MCPcat

The Quick Answer

Configure StreamableHTTP transport with a load balancer for horizontally scalable MCP deployments:

// server.ts
app.post('/mcp', async (req, res) => {
  const acceptsSSE = req.headers.accept?.includes('text/event-stream');
  const response = await mcpServer.handleRequest(req.body);
  
  if (response.streaming && acceptsSSE) {
    res.setHeader('Content-Type', 'text/event-stream');
    for await (const message of response.stream) {
      res.write(`data: ${JSON.stringify(message)}\n\n`);
    }
  } else {
    res.json(response);
  }
});

StreamableHTTP's single-endpoint architecture enables 200x better performance under load compared to SSE transport.

Prerequisites

Node.js 18+ or Python 3.10+ runtime environment
Load balancer supporting HTTP/2 (AWS ALB, nginx, HAProxy)
Redis 7+ for distributed session management
Docker and Kubernetes/ECS for container orchestration

Installation

Install the MCP SDK with StreamableHTTP support:

# TypeScript/JavaScript
$npm install @modelcontextprotocol/sdk express
 
# Python
$pip install fastmcp uvicorn redis

Configure your load balancer for sticky sessions:

# nginx.conf
upstream mcp_servers {
    ip_hash;  # Session affinity
    server mcp1:8080;
    server mcp2:8080;
    server mcp3:8080;
}

Basic Implementation

StreamableHTTP consolidates all MCP communication through a single HTTP endpoint. Unlike the deprecated SSE transport that required separate endpoints for requests and responses, StreamableHTTP handles both request-response and streaming patterns through intelligent content negotiation. For a detailed comparison of transport options, see our guide on comparing stdio, SSE, and StreamableHTTP.

Create a basic StreamableHTTP server that can scale horizontally:

import express from 'express';
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import Redis from 'redis';

const app = express();
const redis = Redis.createClient({ url: process.env.REDIS_URL });
const mcpServer = new Server({ 
  name: "scalable-mcp", 
  version: "1.0.0" 
});

// Session middleware
app.use(async (req, res, next) => {
  const sessionId = req.headers['mcp-session-id'];
  if (sessionId) {
    req.session = await redis.get(`session:${sessionId}`);
  }
  next();
});

app.post('/mcp', async (req, res) => {
  try {
    const response = await mcpServer.handleRequest(req.body);
    
    // Handle streaming responses
    if (response.streaming && req.accepts('text/event-stream')) {
      res.setHeader('Content-Type', 'text/event-stream');
      res.setHeader('Cache-Control', 'no-cache');
      res.setHeader('X-Accel-Buffering', 'no');
      
      for await (const message of response.stream) {
        res.write(`data: ${JSON.stringify(message)}\n\n`);
        // ... store last event ID for recovery ...
      }
      res.end();
    } else {
      res.json(response);
    }
  } catch (error) {
    // ... error handling ...
  }
});

// Health checks
app.get('/health', async (req, res) => {
  const redisOk = await redis.ping() === 'PONG';
  res.status(redisOk ? 200 : 503).json({ 
    status: redisOk ? 'healthy' : 'unhealthy' 
  });
});

app.listen(8080);

The server implements session affinity through Redis, enabling any instance to handle requests from established sessions. Health checks ensure the load balancer only routes traffic to healthy instances.

Container Deployment

Deploy StreamableHTTP servers in containers for consistent scaling across environments. The stateless nature of StreamableHTTP makes it ideal for container orchestration platforms. For additional Docker deployment patterns, see our guide on configuring MCP transport with Docker.

Create a production-ready Dockerfile:

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

FROM node:20-alpine
RUN apk add --no-cache tini
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY . .

# Run as non-root user
USER node

# Use tini for proper signal handling
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "server.js"]

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node healthcheck.js || exit 1

Deploy with Kubernetes for automatic scaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-streamablehttp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
      - name: mcp
        image: mcp-server:latest
        ports:
        - containerPort: 8080
        env:
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-secret
              key: url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: mcp-service
spec:
  selector:
    app: mcp-server
  ports:
  - port: 80
    targetPort: 8080
  type: LoadBalancer
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

The Kubernetes configuration ensures proper resource allocation, health monitoring, and session affinity. The horizontal pod autoscaler can dynamically adjust replicas based on CPU and memory usage.

Load Balancing Strategies

Effective load balancing is crucial for StreamableHTTP deployments. Different strategies suit different use cases, from simple round-robin to sophisticated session-aware routing.

Configure AWS Application Load Balancer for production:

// AWS CDK configuration
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';

const loadBalancer = new elbv2.ApplicationLoadBalancer(this, 'MCPLoadBalancer', {
  vpc,
  internetFacing: true,
  http2Enabled: true
});

const targetGroup = new elbv2.ApplicationTargetGroup(this, 'MCPTargetGroup', {
  vpc,
  port: 8080,
  protocol: elbv2.ApplicationProtocol.HTTP,
  targetType: elbv2.TargetType.IP,
  healthCheck: {
    path: '/health',
    interval: cdk.Duration.seconds(30),
    timeout: cdk.Duration.seconds(5),
    healthyThresholdCount: 2,
    unhealthyThresholdCount: 3
  },
  stickinessCookieDuration: cdk.Duration.hours(1),
  stickinessCookieName: 'MCP-SESSION'
});

// Add targets from ECS service
targetGroup.addTarget(ecsService);

loadBalancer.addListener('MCPListener', {
  port: 443,
  certificates: [certificate],
  defaultTargetGroups: [targetGroup]
});

Implement client-side connection management for failover:

class MCPClient {
  private endpoints: string[];
  private currentEndpoint: number = 0;
  private sessionId?: string;
  
  constructor(endpoints: string[]) {
    this.endpoints = endpoints;
  }
  
  async request(method: string, params?: any): Promise<any> {
    const maxRetries = this.endpoints.length;
    let lastError: Error | null = null;
    
    for (let i = 0; i < maxRetries; i++) {
      try {
        const endpoint = this.endpoints[this.currentEndpoint];
        const response = await fetch(`${endpoint}/mcp`, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            'Accept': 'application/json, text/event-stream',
            ...(this.sessionId && { 'Mcp-Session-Id': this.sessionId })
          },
          body: JSON.stringify({
            jsonrpc: "2.0",
            method,
            params,
            id: Date.now()
          })
        });
        
        // Capture session ID from response
        const newSessionId = response.headers.get('Mcp-Session-Id');
        if (newSessionId) {
          this.sessionId = newSessionId;
        }
        
        if (!response.ok) throw new Error(`HTTP ${response.status}`);
        
        return await response.json();
      } catch (error) {
        lastError = error as Error;
        // Try next endpoint
        this.currentEndpoint = (this.currentEndpoint + 1) % this.endpoints.length;
      }
    }
    
    throw lastError || new Error('All endpoints failed');
  }
}

The client implementation provides automatic failover between multiple endpoints while maintaining session affinity through the Mcp-Session-Id header. This ensures continuous service even during partial outages.

Session Management

Distributed session management enables horizontal scaling while maintaining stateful connections. Redis provides the ideal backend for session storage with its low latency and built-in expiration.

Implement robust session handling:

from fastapi import FastAPI, Request, Response
import redis.asyncio as redis
import json
import uuid

app = FastAPI()
redis_client = redis.from_url("redis://localhost")

class SessionManager:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.session_ttl = 3600  # 1 hour
    
    async def create_session(self, client_info: dict) -> str:
        session_id = str(uuid.uuid4())
        session_data = {
            "id": session_id,
            "created_at": datetime.utcnow().isoformat(),
            "client_info": client_info,
            "last_activity": datetime.utcnow().isoformat()
        }
        
        await self.redis.setex(
            f"session:{session_id}",
            self.session_ttl,
            json.dumps(session_data)
        )
        return session_id
    
    async def get_session(self, session_id: str) -> dict | None:
        data = await self.redis.get(f"session:{session_id}")
        if not data:
            return None
        
        session = json.loads(data)
        # Update last activity and extend TTL
        session["last_activity"] = datetime.utcnow().isoformat()
        await self.redis.setex(
            f"session:{session_id}",
            self.session_ttl,
            json.dumps(session)
        )
        return session

session_manager = SessionManager(redis_client)

@app.post("/mcp")
async def handle_mcp(request: Request, response: Response):
    body = await request.json()
    
    # Session handling
    session_id = request.headers.get("mcp-session-id")
    session = await session_manager.get_session(session_id) if session_id else None
    
    if not session:
        # Create new session
        session_id = await session_manager.create_session({
            "user_agent": request.headers.get("user-agent"),
            "ip": request.client.host
        })
        response.headers["Mcp-Session-Id"] = session_id
    
    # Process request with session context
    result = await process_mcp_request(body, session)
    return result

Monitor session distribution across nodes:

async def get_session_metrics():
    """Collect session distribution metrics"""
    metrics = {
        "total_sessions": 0,
        "sessions_by_node": {},
        "avg_session_duration": 0
    }
    
    # Scan all sessions
    cursor = 0
    sessions = []
    
    while True:
        cursor, keys = await redis_client.scan(
            cursor, 
            match="session:*", 
            count=100
        )
        sessions.extend(keys)
        if cursor == 0:
            break
    
    metrics["total_sessions"] = len(sessions)
    
    # Analyze session distribution
    for session_key in sessions:
        session_data = await redis_client.get(session_key)
        if session_data:
            session = json.loads(session_data)
            node = session.get("node_id", "unknown")
            metrics["sessions_by_node"][node] = \
                metrics["sessions_by_node"].get(node, 0) + 1
    
    return metrics

Session management ensures users maintain context across requests while enabling the system to scale horizontally. The Redis backend provides millisecond latency for session operations.

Performance Optimization

StreamableHTTP's architecture enables significant performance optimizations. By eliminating persistent connections and supporting request batching, it achieves superior throughput under load.

Implement connection pooling and request batching:

class OptimizedMCPServer {
  private requestQueue: Map<string, Promise<any>> = new Map();
  private batchTimer?: NodeJS.Timeout;
  private pendingBatch: Array<{request: any, resolve: Function, reject: Function}> = [];
  
  async handleRequest(request: any): Promise<any> {
    // Deduplicate identical concurrent requests
    const requestKey = JSON.stringify(request);
    
    if (this.requestQueue.has(requestKey)) {
      return this.requestQueue.get(requestKey);
    }
    
    // Create promise for this request
    const promise = this.processBatchedRequest(request);
    this.requestQueue.set(requestKey, promise);
    
    // Clean up after completion
    promise.finally(() => {
      this.requestQueue.delete(requestKey);
    });
    
    return promise;
  }
  
  private processBatchedRequest(request: any): Promise<any> {
    return new Promise((resolve, reject) => {
      this.pendingBatch.push({ request, resolve, reject });
      
      // Batch requests every 10ms or when batch size reaches 50
      if (!this.batchTimer) {
        this.batchTimer = setTimeout(() => this.flushBatch(), 10);
      }
      
      if (this.pendingBatch.length >= 50) {
        this.flushBatch();
      }
    });
  }
  
  // ... flushBatch implementation ...
}

Configure nginx for optimal StreamableHTTP performance:

http {
    upstream mcp_backend {
        least_conn;  # Better than round-robin for long requests
        
        server mcp1:8080 max_fails=2 fail_timeout=30s;
        server mcp2:8080 max_fails=2 fail_timeout=30s;
        server mcp3:8080 max_fails=2 fail_timeout=30s;
        
        keepalive 32;  # Connection pooling
    }
    
    server {
        listen 443 ssl http2;
        server_name api.example.com;
        
        # SSL configuration
        ssl_certificate /etc/ssl/cert.pem;
        ssl_certificate_key /etc/ssl/key.pem;
        
        location /mcp {
            proxy_pass http://mcp_backend;
            
            # StreamableHTTP optimizations
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            
            # SSE support
            proxy_set_header Accept-Encoding "";
            proxy_buffering off;
            proxy_cache off;
            
            # Timeouts for long-running operations
            proxy_read_timeout 300s;
            proxy_connect_timeout 10s;
            proxy_send_timeout 60s;
            
            # Session affinity via cookie
            proxy_set_header Cookie $http_cookie;
            
            # Headers
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
        
        location /health {
            proxy_pass http://mcp_backend;
            proxy_connect_timeout 5s;
            proxy_read_timeout 5s;
        }
    }
}

These optimizations ensure StreamableHTTP maintains its 200x performance advantage over SSE transport even at scale. Request deduplication and batching reduce backend load while connection pooling minimizes latency.

Common Issues

Error: Session affinity not working with load balancer

Session affinity failures occur when load balancers don't properly route subsequent requests to the same backend instance. This breaks stateful operations and causes authentication issues.

// Solution: Implement session validation and recovery
app.use(async (req, res, next) => {
  const sessionId = req.headers['mcp-session-id'];
  
  if (sessionId) {
    const session = await redis.get(`session:${sessionId}`);
    
    if (!session) {
      // Session lost - create new one with recovery
      const newSessionId = generateSessionId();
      
      // Attempt to recover last known state
      const lastEventId = req.headers['last-event-id'];
      if (lastEventId) {
        const streamState = await redis.get(`stream:${sessionId}:${lastEventId}`);
        if (streamState) {
          await redis.setex(
            `session:${newSessionId}`,
            3600,
            streamState
          );
        }
      }
      
      res.setHeader('Mcp-Session-Id', newSessionId);
      req.session = { id: newSessionId, recovered: true };
    } else {
      req.session = JSON.parse(session);
    }
  }
  
  next();
});

To prevent session affinity issues, implement client-side session persistence and configure your load balancer with appropriate sticky session duration. Monitor session distribution to detect imbalances early.

Error: Memory leaks during streaming responses

Memory leaks in streaming responses typically occur when event streams aren't properly closed or when backpressure isn't handled. This leads to server crashes under sustained load.

// Solution: Implement proper stream cleanup and backpressure
app.post('/mcp', async (req, res) => {
  const streamController = new AbortController();
  let streamClosed = false;
  
  // Clean up on client disconnect
  req.on('close', () => {
    streamClosed = true;
    streamController.abort();
  });
  
  res.on('error', () => {
    streamClosed = true;
    streamController.abort();
  });
  
  try {
    const response = await mcpServer.handleRequest(req.body, {
      signal: streamController.signal
    });
    
    if (response.streaming && req.accepts('text/event-stream')) {
      res.setHeader('Content-Type', 'text/event-stream');
      
      for await (const message of response.stream) {
        if (streamClosed) break;
        
        // Handle backpressure
        const canWrite = res.write(`data: ${JSON.stringify(message)}\n\n`);
        
        if (!canWrite) {
          // Pause until drain event
          await new Promise(resolve => res.once('drain', resolve));
        }
      }
      
      if (!streamClosed) {
        res.end();
      }
    } else {
      res.json(response);
    }
  } finally {
    // Ensure cleanup
    streamController.abort();
  }
});

Monitor memory usage patterns and implement circuit breakers to prevent cascading failures. Use Node.js heap snapshots to identify memory leak sources during development.

Error: High latency during traffic spikes

Traffic spikes can overwhelm StreamableHTTP servers if auto-scaling isn't properly configured. This manifests as increased response times and timeout errors.

# Solution: Configure aggressive auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-streamablehttp
  minReplicas: 3
  maxReplicas: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30  # Fast scale-up
      policies:
      - type: Percent
        value: 100  # Double pods
        periodSeconds: 60
      - type: Pods
        value: 5  # Add 5 pods minimum
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Slow scale-down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # Aggressive threshold
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  - type: External
    external:
      metric:
        name: response_time_p95
      target:
        type: Value
        value: "100"  # 100ms p95 target

Implement request queuing and rate limiting to handle burst traffic gracefully. Pre-warm containers during anticipated traffic increases to minimize cold start latency.

Examples

Multi-Region Deployment with Failover

Deploy StreamableHTTP across multiple AWS regions for global availability and disaster recovery. This architecture ensures sub-100ms latency worldwide while maintaining high availability.

// Global StreamableHTTP deployment with DynamoDB Global Tables
import { DynamoDB } from '@aws-sdk/client-dynamodb';
import { Server } from '@modelcontextprotocol/sdk/server/index.js';

class GlobalMCPServer {
  private region: string;
  private dynamodb: DynamoDB;
  private server: Server;
  
  constructor(region: string) {
    this.region = region;
    this.dynamodb = new DynamoDB({ region });
    this.server = new Server({
      name: "global-mcp",
      version: "1.0.0"
    });
  }
  
  async handleRequest(request: any, headers: any): Promise<any> {
    const sessionId = headers['mcp-session-id'];
    
    // Use DynamoDB Global Tables for session state
    if (sessionId) {
      const session = await this.getGlobalSession(sessionId);
      if (session) {
        await this.updateSessionRegion(sessionId, this.region);
      }
    }
    
    const response = await this.server.handleRequest(request);
    response.headers = {
      ...response.headers,
      'X-MCP-Region': this.region
    };
    
    return response;
  }
  
  private async getGlobalSession(sessionId: string) {
    const result = await this.dynamodb.getItem({
      TableName: 'MCPSessions',
      Key: { sessionId: { S: sessionId } },
      ConsistentRead: false  // Eventually consistent for global reads
    });
    return result.Item;
  }
  
  // ... updateSessionRegion method ...
}

Configure multi-region infrastructure with CloudFormation:

# cloudformation.yaml
Resources:
  GlobalTable:
    Type: AWS::DynamoDB::GlobalTable
    Properties:
      TableName: MCPSessions
      BillingMode: PAY_PER_REQUEST
      Replicas:
        - Region: us-east-1
        - Region: eu-west-1
        - Region: ap-southeast-1
      # ... attribute definitions ...

Production deployment uses Route 53 with health checks for automatic regional failover. DynamoDB Global Tables provide millisecond replication of session state across regions, ensuring users experience no interruption during regional failures.

Serverless StreamableHTTP with AWS Lambda

Deploy StreamableHTTP as serverless functions for infinite scalability and zero infrastructure management. Lambda's event-driven architecture perfectly complements StreamableHTTP's stateless design.

# lambda_function.py
import json
import boto3
from mcp.server import Server
import asyncio

# Initialize outside handler for connection reuse
dynamodb = boto3.resource('dynamodb')
sessions_table = dynamodb.Table('MCPSessions')
server = Server("lambda-mcp")

def lambda_handler(event, context):
    """AWS Lambda handler for StreamableHTTP"""
    
    # Handle OPTIONS for CORS
    if event['httpMethod'] == 'OPTIONS':
        return {
            'statusCode': 200,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Methods': 'POST, GET, OPTIONS',
                'Access-Control-Allow-Headers': 'Content-Type, Mcp-Session-Id'
            }
        }
    
    try:
        body = json.loads(event['body'])
        
        # Session management with DynamoDB
        session_id = event['headers'].get('mcp-session-id')
        if not session_id:
            session_id = str(uuid.uuid4())
            sessions_table.put_item(Item={
                'sessionId': session_id,
                'created': int(time.time()),
                'ttl': int(time.time()) + 3600  # 1 hour TTL
            })
        
        # Process MCP request
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        result = loop.run_until_complete(
            server.handle_request(body)
        )
        
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*',
                'Mcp-Session-Id': session_id
            },
            'body': json.dumps(result)
        }
        
    except Exception as e:
        # ... error handling ...
        return {'statusCode': 500}

Deploy with Serverless Framework for automatic scaling:

# serverless.yml
service: mcp-streamablehttp

provider:
  name: aws
  runtime: python3.11
  environment:
    DYNAMODB_TABLE: ${self:service}-sessions
  
functions:
  mcp:
    handler: lambda_function.lambda_handler
    events:
      - http:
          path: /mcp
          method: post
          cors: true
    reservedConcurrency: 100  # Prevent cold starts
    
resources:
  Resources:
    SessionsTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: ${self:provider.environment.DYNAMODB_TABLE}
        BillingMode: PAY_PER_REQUEST
        # ... table configuration ...

Serverless deployment eliminates scaling concerns entirely. Lambda automatically handles millions of concurrent requests while DynamoDB provides consistent session storage. Cost scales linearly with usage, making it ideal for variable workloads.

Related Guides

Building a StreamableHTTP MCP server

Deploy scalable MCP servers using StreamableHTTP for cloud environments and remote access.

Configuring MCP installations for production deployments

Configure MCP servers for production with security, monitoring, and deployment best practices.

Comparing stdio vs. SSE vs. StreamableHTTP

Compare MCP transport protocols to choose between stdio, SSE, and StreamableHTTP for your use case.

Keep an eye on AI.

Get rich user analytics and tracing on every user interacting with your MCP server.

Get started