Rate limiting controls how many requests a client can make to an API within a given time window. It protects backend services from abuse, prevents resource exhaustion, and ensures fair usage across clients. Every major API — GitHub (5,000 req/hr), Stripe (100 req/sec), OpenAI (tokens-per-minute) — enforces rate limits. The choice of algorithm determines how "bursty" traffic is handled and how evenly requests are distributed. This guide covers the four main algorithms, Redis implementations, and API gateway patterns.
How Does the Token Bucket Algorithm Work?
The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected. This allows controlled bursts up to the bucket size while enforcing a long-term average rate.
interface TokenBucket {
tokens: number;
lastRefill: number;
capacity: number;
refillRate: number; // tokens per second
}
function consumeToken(bucket: TokenBucket): boolean {
const now = Date.now();
const elapsed = (now - bucket.lastRefill) / 1000;
// Refill tokens based on elapsed time
bucket.tokens = Math.min(
bucket.capacity,
bucket.tokens + elapsed * bucket.refillRate
);
bucket.lastRefill = now;
if (bucket.tokens >= 1) {
bucket.tokens -= 1;
return true; // Request allowed
}
return false; // Rate limited
}
// Example: 100 requests/minute, burst up to 20
const bucket: TokenBucket = {
tokens: 20,
lastRefill: Date.now(),
capacity: 20,
refillRate: 100 / 60, // ~1.67 tokens/sec
};How Does the Sliding Window Algorithm Work?
The sliding window log tracks the timestamp of every request and counts how many fall within the current window. It is the most accurate algorithm but requires more memory. The sliding window counter is a memory-efficient approximation that uses weighted counts from the current and previous window.
function slidingWindowCounter(
prevCount: number,
currCount: number,
windowSize: number, // in milliseconds
limit: number
): boolean {
const now = Date.now();
const currWindowStart = Math.floor(now / windowSize) * windowSize;
const elapsed = now - currWindowStart;
// Weight previous window by how much of it overlaps
const weight = 1 - elapsed / windowSize;
const estimatedCount = prevCount * weight + currCount;
return estimatedCount < limit;
}
// Example: 100 requests per 60-second window
// Previous window had 80 requests, current window has 30
// We're 45 seconds into the current window
// Estimate: 80 * (1 - 45/60) + 30 = 80 * 0.25 + 30 = 50 → under limitHow Do Fixed Window and Leaky Bucket Compare?
Fixed window counts requests in discrete time windows (e.g., per minute). It is simple but allows bursts at window boundaries — a client can make 2× the limit by sending requests at the end of one window and the start of the next.
Leaky bucket processes requests at a fixed rate, like water dripping from a bucket. Excess requests queue up (up to a max queue size) and are processed in order. It produces the smoothest output rate but adds latency for queued requests.
| Algorithm | Burst handling | Memory | Accuracy | Best for |
|---|---|---|---|---|
| Token bucket | Allows controlled bursts | Low (counter + timestamp) | Good | APIs, general use |
| Sliding window log | Precise per-request | High (stores each timestamp) | Exact | Billing, auditing |
| Sliding window counter | Smoothed approximation | Low (2 counters) | Good | High-scale APIs |
| Fixed window | Boundary bursts possible | Low (1 counter) | Moderate | Simple use cases |
| Leaky bucket | Queued, no bursts | Medium (queue) | Exact rate | Smoothing traffic |
How Do You Implement Rate Limiting with Redis?
Redis is the standard backing store for distributed rate limiting because of its atomic operations, sub-millisecond latency, and built-in expiration. Here are two common patterns:
import type { Redis } from 'ioredis';
async function fixedWindowLimit(
redis: Redis,
key: string,
limit: number,
windowSeconds: number
): Promise<{ allowed: boolean; remaining: number }> {
const current = await redis.incr(key);
if (current === 1) {
// First request in this window — set expiration
await redis.expire(key, windowSeconds);
}
return {
allowed: current <= limit,
remaining: Math.max(0, limit - current),
};
}
// Usage: 100 requests per minute per user
const windowKey = `ratelimit:${userId}:${Math.floor(Date.now() / 60000)}`;
const result = await fixedWindowLimit(redis, windowKey, 100, 60);async function slidingWindowLog(
redis: Redis,
key: string,
limit: number,
windowMs: number
): Promise<{ allowed: boolean; remaining: number }> {
const now = Date.now();
const windowStart = now - windowMs;
const pipeline = redis.pipeline();
// Remove expired entries
pipeline.zremrangebyscore(key, 0, windowStart);
// Count remaining entries
pipeline.zcard(key);
// Add current request
pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`);
// Set key expiration
pipeline.pexpire(key, windowMs);
const results = await pipeline.exec();
const count = (results?.[1]?.[1] as number) ?? 0;
if (count >= limit) {
// Over limit — remove the entry we just added
await redis.zremrangebyscore(key, now, now);
return { allowed: false, remaining: 0 };
}
return { allowed: true, remaining: limit - count - 1 };
}How Should APIs Communicate Rate Limits?
The IETF draft RateLimit header fields (draft-ietf-httpapi-ratelimit-headers) standardize how APIs communicate rate limit status. Many APIs already use the X-RateLimit-* convention.
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 # Max requests in the window
X-RateLimit-Remaining: 57 # Requests remaining in current window
X-RateLimit-Reset: 1704067200 # Unix timestamp when the window resets
Retry-After: 30 # Seconds to wait (only on 429 responses)
# When limit is exceeded
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30
Content-Type: application/json
{"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 30 seconds."}import type { Request, Response, NextFunction } from 'express';
function rateLimitMiddleware(limit: number, windowSeconds: number) {
return async (req: Request, res: Response, next: NextFunction) => {
const windowKey = `ratelimit:${req.ip}:${Math.floor(Date.now() / (windowSeconds * 1000))}`;
const result = await fixedWindowLimit(redis, windowKey, limit, windowSeconds);
res.set('X-RateLimit-Limit', limit.toString());
res.set('X-RateLimit-Remaining', result.remaining.toString());
res.set('X-RateLimit-Reset', (Math.ceil(Date.now() / 1000) + windowSeconds).toString());
if (!result.allowed) {
res.set('Retry-After', windowSeconds.toString());
res.status(429).json({
error: 'rate_limit_exceeded',
message: `Too many requests. Retry after ${windowSeconds} seconds.`,
});
return;
}
next();
};
}
// Apply: 100 requests per minute
app.use('/api', rateLimitMiddleware(100, 60));How Do API Gateways Handle Rate Limiting?
In production, rate limiting is typically handled at the API gateway layer (Nginx, Kong, AWS API Gateway, Cloudflare) rather than in application code. This offloads the work before requests reach your services.
# Define a rate limit zone (10 req/sec per IP, 10MB shared memory)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
# Allow bursts of 20 requests, delay after 10
limit_req zone=api burst=20 delay=10;
limit_req_status 429;
proxy_pass http://backend;
}
}Multi-tier limiting: Apply different limits at different layers — a global per-IP limit at the gateway (e.g., 1000 req/min), a per-user limit at the application layer (e.g., 100 req/min for free tier, 1000 for paid), and a per-endpoint limit for expensive operations (e.g., 10 req/min for search).
Key Takeaways
- • Token bucket is the best general-purpose algorithm — it allows bursts while enforcing an average rate
- • Sliding window counter offers the best accuracy-to-memory ratio for high-scale distributed systems
- • Use Redis for distributed rate limiting — its atomic operations and TTL support are purpose-built for this
- • Always return
X-RateLimit-*headers and429status codes withRetry-Afterso clients can back off gracefully - • Apply rate limits at the API gateway layer for per-IP throttling and at the application layer for per-user/tier limits
- • Use different limits per endpoint — a health check and a database-heavy search should not share the same budget