env.dev

Rate Limiting Strategies: Token Bucket, Sliding Window & More

Implement rate limiting with token bucket, sliding window, fixed window, and Redis-based strategies. Includes code examples and API gateway patterns.

Last updated:

Rate limiting controls how many requests a client can make to an API within a given time window. It protects backend services from abuse, prevents resource exhaustion, and ensures fair usage across clients. Every major API — GitHub (5,000 req/hr), Stripe (100 req/sec), OpenAI (tokens-per-minute) — enforces rate limits. The choice of algorithm determines how "bursty" traffic is handled and how evenly requests are distributed. This guide covers the four main algorithms, Redis implementations, and API gateway patterns.

How Does the Token Bucket Algorithm Work?

The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected. This allows controlled bursts up to the bucket size while enforcing a long-term average rate.

Token bucket implementation
interface TokenBucket {
  tokens: number;
  lastRefill: number;
  capacity: number;
  refillRate: number; // tokens per second
}

function consumeToken(bucket: TokenBucket): boolean {
  const now = Date.now();
  const elapsed = (now - bucket.lastRefill) / 1000;

  // Refill tokens based on elapsed time
  bucket.tokens = Math.min(
    bucket.capacity,
    bucket.tokens + elapsed * bucket.refillRate
  );
  bucket.lastRefill = now;

  if (bucket.tokens >= 1) {
    bucket.tokens -= 1;
    return true; // Request allowed
  }
  return false; // Rate limited
}

// Example: 100 requests/minute, burst up to 20
const bucket: TokenBucket = {
  tokens: 20,
  lastRefill: Date.now(),
  capacity: 20,
  refillRate: 100 / 60, // ~1.67 tokens/sec
};

How Does the Sliding Window Algorithm Work?

The sliding window log tracks the timestamp of every request and counts how many fall within the current window. It is the most accurate algorithm but requires more memory. The sliding window counter is a memory-efficient approximation that uses weighted counts from the current and previous window.

Sliding window counter
function slidingWindowCounter(
  prevCount: number,
  currCount: number,
  windowSize: number, // in milliseconds
  limit: number
): boolean {
  const now = Date.now();
  const currWindowStart = Math.floor(now / windowSize) * windowSize;
  const elapsed = now - currWindowStart;

  // Weight previous window by how much of it overlaps
  const weight = 1 - elapsed / windowSize;
  const estimatedCount = prevCount * weight + currCount;

  return estimatedCount < limit;
}

// Example: 100 requests per 60-second window
// Previous window had 80 requests, current window has 30
// We're 45 seconds into the current window
// Estimate: 80 * (1 - 45/60) + 30 = 80 * 0.25 + 30 = 50 → under limit

How Do Fixed Window and Leaky Bucket Compare?

Fixed window counts requests in discrete time windows (e.g., per minute). It is simple but allows bursts at window boundaries — a client can make 2× the limit by sending requests at the end of one window and the start of the next.

Leaky bucket processes requests at a fixed rate, like water dripping from a bucket. Excess requests queue up (up to a max queue size) and are processed in order. It produces the smoothest output rate but adds latency for queued requests.

AlgorithmBurst handlingMemoryAccuracyBest for
Token bucketAllows controlled burstsLow (counter + timestamp)GoodAPIs, general use
Sliding window logPrecise per-requestHigh (stores each timestamp)ExactBilling, auditing
Sliding window counterSmoothed approximationLow (2 counters)GoodHigh-scale APIs
Fixed windowBoundary bursts possibleLow (1 counter)ModerateSimple use cases
Leaky bucketQueued, no burstsMedium (queue)Exact rateSmoothing traffic

How Do You Implement Rate Limiting with Redis?

Redis is the standard backing store for distributed rate limiting because of its atomic operations, sub-millisecond latency, and built-in expiration. Here are two common patterns:

Redis fixed window (INCR + EXPIRE)
import type { Redis } from 'ioredis';

async function fixedWindowLimit(
  redis: Redis,
  key: string,
  limit: number,
  windowSeconds: number
): Promise<{ allowed: boolean; remaining: number }> {
  const current = await redis.incr(key);

  if (current === 1) {
    // First request in this window — set expiration
    await redis.expire(key, windowSeconds);
  }

  return {
    allowed: current <= limit,
    remaining: Math.max(0, limit - current),
  };
}

// Usage: 100 requests per minute per user
const windowKey = `ratelimit:${userId}:${Math.floor(Date.now() / 60000)}`;
const result = await fixedWindowLimit(redis, windowKey, 100, 60);
Redis sliding window log (sorted set)
async function slidingWindowLog(
  redis: Redis,
  key: string,
  limit: number,
  windowMs: number
): Promise<{ allowed: boolean; remaining: number }> {
  const now = Date.now();
  const windowStart = now - windowMs;

  const pipeline = redis.pipeline();
  // Remove expired entries
  pipeline.zremrangebyscore(key, 0, windowStart);
  // Count remaining entries
  pipeline.zcard(key);
  // Add current request
  pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`);
  // Set key expiration
  pipeline.pexpire(key, windowMs);

  const results = await pipeline.exec();
  const count = (results?.[1]?.[1] as number) ?? 0;

  if (count >= limit) {
    // Over limit — remove the entry we just added
    await redis.zremrangebyscore(key, now, now);
    return { allowed: false, remaining: 0 };
  }

  return { allowed: true, remaining: limit - count - 1 };
}

How Should APIs Communicate Rate Limits?

The IETF draft RateLimit header fields (draft-ietf-httpapi-ratelimit-headers) standardize how APIs communicate rate limit status. Many APIs already use the X-RateLimit-* convention.

Rate limit response headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 100          # Max requests in the window
X-RateLimit-Remaining: 57       # Requests remaining in current window
X-RateLimit-Reset: 1704067200   # Unix timestamp when the window resets
Retry-After: 30                 # Seconds to wait (only on 429 responses)

# When limit is exceeded
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704067200
Retry-After: 30
Content-Type: application/json

{"error": "rate_limit_exceeded", "message": "Too many requests. Retry after 30 seconds."}
Express middleware example
import type { Request, Response, NextFunction } from 'express';

function rateLimitMiddleware(limit: number, windowSeconds: number) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const windowKey = `ratelimit:${req.ip}:${Math.floor(Date.now() / (windowSeconds * 1000))}`;
    const result = await fixedWindowLimit(redis, windowKey, limit, windowSeconds);

    res.set('X-RateLimit-Limit', limit.toString());
    res.set('X-RateLimit-Remaining', result.remaining.toString());
    res.set('X-RateLimit-Reset', (Math.ceil(Date.now() / 1000) + windowSeconds).toString());

    if (!result.allowed) {
      res.set('Retry-After', windowSeconds.toString());
      res.status(429).json({
        error: 'rate_limit_exceeded',
        message: `Too many requests. Retry after ${windowSeconds} seconds.`,
      });
      return;
    }

    next();
  };
}

// Apply: 100 requests per minute
app.use('/api', rateLimitMiddleware(100, 60));

How Do API Gateways Handle Rate Limiting?

In production, rate limiting is typically handled at the API gateway layer (Nginx, Kong, AWS API Gateway, Cloudflare) rather than in application code. This offloads the work before requests reach your services.

Nginx rate limiting
# Define a rate limit zone (10 req/sec per IP, 10MB shared memory)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

server {
    location /api/ {
        # Allow bursts of 20 requests, delay after 10
        limit_req zone=api burst=20 delay=10;
        limit_req_status 429;

        proxy_pass http://backend;
    }
}

Multi-tier limiting: Apply different limits at different layers — a global per-IP limit at the gateway (e.g., 1000 req/min), a per-user limit at the application layer (e.g., 100 req/min for free tier, 1000 for paid), and a per-endpoint limit for expensive operations (e.g., 10 req/min for search).

Key Takeaways

  • Token bucket is the best general-purpose algorithm — it allows bursts while enforcing an average rate
  • Sliding window counter offers the best accuracy-to-memory ratio for high-scale distributed systems
  • • Use Redis for distributed rate limiting — its atomic operations and TTL support are purpose-built for this
  • • Always return X-RateLimit-* headers and 429 status codes with Retry-After so clients can back off gracefully
  • • Apply rate limits at the API gateway layer for per-IP throttling and at the application layer for per-user/tier limits
  • • Use different limits per endpoint — a health check and a database-heavy search should not share the same budget

Frequently Asked Questions

What is rate limiting?

Rate limiting controls how many requests a client can make in a given time period. It protects APIs from abuse, prevents resource exhaustion, and ensures fair usage. Common limits: 100 requests per minute per API key.

Which rate limiting algorithm should I use?

Token bucket is the most popular — it allows bursts while enforcing an average rate. Sliding window is more precise but uses more memory. Fixed window is simplest but allows burst at window boundaries.

How do I implement rate limiting with Redis?

Use Redis INCR with EXPIRE for fixed window, or sorted sets with ZADD/ZRANGEBYSCORE for sliding window. Redis is ideal because it is fast, atomic, and shared across multiple application instances.

Was this helpful?

Stay up to date

Get notified about new guides, tools, and cheatsheets.