Rate limiting controls how many requests a client can make to an API within a given time window. It protects backend services from abuse, prevents resource exhaustion, and ensures fair usage across clients. Every major API — GitHub (5,000 req/hr), Stripe (100 req/sec), OpenAI (tokens-per-minute) — enforces rate limits. The choice of algorithm determines how "bursty" traffic is handled and how evenly requests are distributed. This guide covers the four main algorithms, Redis implementations, and edge-runtime patterns. For broader API design context, see REST API best practices.
How Does the Token Bucket Algorithm Work?
The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected. This allows controlled bursts up to the bucket size while enforcing a long-term average rate.
interface TokenBucket {
tokens: number;
lastRefill: number;
capacity: number;
refillRate: number; // tokens per second
}
function consumeToken(bucket: TokenBucket): boolean {
const now = Date.now();
const elapsed = (now - bucket.lastRefill) / 1000;
// Refill tokens based on elapsed time
bucket.tokens = Math.min(
bucket.capacity,
bucket.tokens + elapsed * bucket.refillRate
);
bucket.lastRefill = now;
if (bucket.tokens >= 1) {
bucket.tokens -= 1;
return true; // Request allowed
}
return false; // Rate limited
}
// Example: 100 requests/minute, burst up to 20
const bucket: TokenBucket = {
tokens: 20,
lastRefill: Date.now(),
capacity: 20,
refillRate: 100 / 60, // ~1.67 tokens/sec
};How Does the Sliding Window Algorithm Work?
The sliding window log tracks the timestamp of every request and counts how many fall within the current window. It is the most accurate algorithm but requires more memory. The sliding window counter is a memory-efficient approximation that uses weighted counts from the current and previous window.
function slidingWindowCounter(
prevCount: number,
currCount: number,
windowSize: number, // in milliseconds
limit: number
): boolean {
const now = Date.now();
const currWindowStart = Math.floor(now / windowSize) * windowSize;
const elapsed = now - currWindowStart;
// Weight previous window by how much of it overlaps
const weight = 1 - elapsed / windowSize;
const estimatedCount = prevCount * weight + currCount;
return estimatedCount < limit;
}
// Example: 100 requests per 60-second window
// Previous window had 80 requests, current window has 30
// We're 45 seconds into the current window
// Estimate: 80 * (1 - 45/60) + 30 = 80 * 0.25 + 30 = 50 → under limitHow Do Fixed Window and Leaky Bucket Compare?
Fixed window counts requests in discrete time windows (e.g., per minute). It is simple but allows bursts at window boundaries — a client can make 2× the limit by sending requests at the end of one window and the start of the next.
Leaky bucket processes requests at a fixed rate, like water dripping from a bucket. Excess requests queue up (up to a max queue size) and are processed in order. It produces the smoothest output rate but adds latency for queued requests.
| Algorithm | Burst handling | Memory | Accuracy | Best for |
|---|---|---|---|---|
| Token bucket | Allows controlled bursts | Low (counter + timestamp) | Good | APIs, general use |
| Sliding window log | Precise per-request | High (stores each timestamp) | Exact | Billing, auditing |
| Sliding window counter | Smoothed approximation | Low (2 counters) | Good | High-scale APIs |
| Fixed window | Boundary bursts possible | Low (1 counter) | Moderate | Simple use cases |
| Leaky bucket | Queued, no bursts | Medium (queue) | Exact rate | Smoothing traffic |
How Do You Implement Rate Limiting with Redis?
Redis is the standard backing store for distributed rate limiting because of its atomic operations, sub-millisecond latency, and built-in expiration. Here are two common patterns:
import type { Redis } from 'ioredis';
async function fixedWindowLimit(
redis: Redis,
key: string,
limit: number,
windowSeconds: number
): Promise<{ allowed: boolean; remaining: number }> {
const current = await redis.incr(key);
if (current === 1) {
// First request in this window — set expiration
await redis.expire(key, windowSeconds);
}
return {
allowed: current <= limit,
remaining: Math.max(0, limit - current),
};
}
// Usage: 100 requests per minute per user
const windowKey = `ratelimit:${userId}:${Math.floor(Date.now() / 60000)}`;
const result = await fixedWindowLimit(redis, windowKey, 100, 60);async function slidingWindowLog(
redis: Redis,
key: string,
limit: number,
windowMs: number
): Promise<{ allowed: boolean; remaining: number }> {
const now = Date.now();
const windowStart = now - windowMs;
const pipeline = redis.pipeline();
// Remove expired entries
pipeline.zremrangebyscore(key, 0, windowStart);
// Count remaining entries
pipeline.zcard(key);
// Add current request
pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`);
// Set key expiration
pipeline.pexpire(key, windowMs);
const results = await pipeline.exec();
const count = (results?.[1]?.[1] as number) ?? 0;
if (count >= limit) {
// Over limit — remove the entry we just added
await redis.zremrangebyscore(key, now, now);
return { allowed: false, remaining: 0 };
}
return { allowed: true, remaining: limit - count - 1 };
}How Should APIs Communicate Rate Limits?
The IETF draft RateLimit header fields (draft-ietf-httpapi-ratelimit-headers, revision 10 published September 2025) standardize how APIs communicate rate limit status. Many APIs already use the X-RateLimit-* convention.
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 # Max requests in the window
X-RateLimit-Remaining: 57 # Requests remaining in current window
X-RateLimit-Reset: 1704067200 # Unix timestamp when the window resets
Retry-After: 30 # Seconds to wait (only on 429 responses)
# When limit is exceeded
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30
Content-Type: application/json
{"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 30 seconds."}Order middleware carefully: rate limiting should sit after CORS preflight handling so 429s carry the right headers — see the CORS guide for the layering rules.
import type { Request, Response, NextFunction } from 'express';
function rateLimitMiddleware(limit: number, windowSeconds: number) {
return async (req: Request, res: Response, next: NextFunction) => {
const windowKey = `ratelimit:${req.ip}:${Math.floor(Date.now() / (windowSeconds * 1000))}`;
const result = await fixedWindowLimit(redis, windowKey, limit, windowSeconds);
res.set('X-RateLimit-Limit', limit.toString());
res.set('X-RateLimit-Remaining', result.remaining.toString());
res.set('X-RateLimit-Reset', (Math.ceil(Date.now() / 1000) + windowSeconds).toString());
if (!result.allowed) {
res.set('Retry-After', windowSeconds.toString());
res.status(429).json({
error: 'rate_limit_exceeded',
message: `Too many requests. Retry after ${windowSeconds} seconds.`,
});
return;
}
next();
};
}
// Apply: 100 requests per minute
app.use('/api', rateLimitMiddleware(100, 60));How Do API Gateways Handle Rate Limiting?
In production, rate limiting is typically handled at the API gateway layer (Nginx, Kong, AWS API Gateway, Cloudflare) rather than in application code. This offloads the work before requests reach your services. The Nginx configuration guide covers the surrounding directives (proxy_pass, upstreams, TLS termination) that pair with rate limiting.
# Define a rate limit zone (10 req/sec per IP, 10MB shared memory)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
# Allow bursts of 20 requests, delay after 10
limit_req zone=api burst=20 delay=10;
limit_req_status 429;
proxy_pass http://backend;
}
}Multi-tier limiting: Apply different limits at different layers — a global per-IP limit at the gateway (e.g., 1000 req/min), a per-user limit at the application layer (e.g., 100 req/min for free tier, 1000 for paid), and a per-endpoint limit for expensive operations (e.g., 10 req/min for search).
How Do You Rate Limit at the Edge?
Edge runtimes (Cloudflare Workers, Vercel Edge Functions, Deno Deploy) run in dozens of regions, so a single Redis primary in us-east-1 turns every check into a cross-continent round trip. Two patterns dominate in 2026: Cloudflare Durable Objects for strongly consistent per-key counters, and Upstash Redis REST for centralized counters with regional replicas.
Durable Objects pin all writes for a given key to a single object instance, giving you atomic increments without a separate Redis. The trade-off is locality: the object lives in one region, so requests from far away pay extra latency. Upstash flips it — REST calls hit the nearest replica, but cross-replica consistency is eventual.
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis/cloudflare';
export default {
async fetch(req: Request, env: Env): Promise<Response> {
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(env),
// Sliding window: 100 requests per 60 seconds
limiter: Ratelimit.slidingWindow(100, '60 s'),
analytics: true,
prefix: 'ratelimit:api',
});
const ip = req.headers.get('cf-connecting-ip') ?? 'anonymous';
const { success, limit, remaining, reset } = await ratelimit.limit(ip);
if (!success) {
return new Response('Rate limit exceeded', {
status: 429,
headers: {
'X-RateLimit-Limit': limit.toString(),
'X-RateLimit-Remaining': remaining.toString(),
'X-RateLimit-Reset': reset.toString(),
'Retry-After': Math.ceil((reset - Date.now()) / 1000).toString(),
},
});
}
return fetch(req);
},
};What Are the Common Rate Limiting Pitfalls?
Trusting raw client IP behind a proxy. req.ip behind a CDN or load balancer is the proxy IP, not the client. Configure trust proxy explicitly (Express: app.set('trust proxy', 'loopback, linklocal, uniquelocal')) and only honor X-Forwarded-For from known infrastructure — otherwise attackers spoof it to evade limits.
Non-atomic counter updates. A read-modify-write outside Redis (e.g., SQL SELECT then UPDATE) races under concurrency. Use INCR, Lua scripts, or Durable Objects for atomicity.
Boundary bursts on fixed windows. A naive 1-minute fixed window lets a client send the full quota in the last second of one window and the first second of the next — 2× the limit in 2 seconds. Use sliding window counter or token bucket for steady throughput.
Hard-blocking instead of returning 429 with Retry-After. Closing the connection or returning 503 leaves clients with no signal to back off. Always return 429 Too Many Requests with Retry-After so well-behaved clients pause instead of retry-storming.
How Should You Layer Rate Limits in Production?
- • Token bucket is the best general-purpose algorithm — it allows bursts while enforcing an average rate
- • Sliding window counter offers the best accuracy-to-memory ratio for high-scale distributed systems
- • Use Redis for centralized rate limiting and Durable Objects or Upstash for edge runtimes — pick by where your compute runs
- • Always return
X-RateLimit-*headers and429status codes withRetry-Afterso clients can back off gracefully - • Apply rate limits at the API gateway layer for per-IP throttling and at the application layer for per-user/tier limits
- • Use different limits per endpoint — a health check and a database-heavy search should not share the same budget
- • Rate limiting protects against per-identity abuse; volumetric L3/L4 floods need DDoS scrubbing or WAF at the edge, not application counters
References
- IETF draft-ietf-httpapi-ratelimit-headers — the standard RateLimit and RateLimit-Policy HTTP header fields
- Stripe API rate limits — production rate-limit policy and 429 handling guidance from Stripe
- GitHub REST API rate limits — primary and secondary rate limits with header conventions
- Cloudflare rate limiting rules — edge rate-limit configuration and characteristics for the Cloudflare WAF
- @upstash/ratelimit — TypeScript rate-limit library for serverless and edge runtimes
- Nginx ngx_http_limit_req_module — official documentation for limit_req_zone and limit_req directives