ho Designing a Rate Limiter — SystemCraft

Designing a Rate Limiter

Difficulty: Beginner–Intermediate 📋 Prerequisites: None (this is a great first system design problem) ⏱️ Reading time: 15 min


TL;DR

A rate limiter blocks users who send too many requests. It protects your servers from being overwhelmed.

flowchart LR
    CLIENT["Client"]:::client
    EDGE["Edge<br/>IP blocking"]:::edge
    GW["API Gateway<br/>per-user limits"]:::edge
    RL["Rate Limiter<br/>checks Redis"]:::service
    REDIS[("Redis<br/>counters")]:::data
    API["Your API"]:::service

    CLIENT --> EDGE
    EDGE --> GW
    GW --> RL
    RL --> REDIS
    GW -->|"allowed"| API
    GW -->|"rejected 429"| CLIENT

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef edge fill:#42A5F5,stroke:#0D47A1,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

In 3 sentences: Every request passes through a rate limiter before reaching your API. The limiter checks a counter in Redis — if under the limit, allow and increment; if over, reject with HTTP 429. Multiple layers (edge + gateway + service) protect different things.


Understanding the Problem

What is a rate limiter? When you use an API — say Twitter or Stripe — you can only make a certain number of requests per minute. Go over the limit and you get a “429 Too Many Requests” error. That’s a rate limiter.

Why do we need it?

Real examples:


Naive First Cut

The simplest possible rate limiter:

flowchart LR
    CLIENT["Client"]:::client
    API["API Server<br/>HashMap counter"]:::service
    DB[("Your DB")]:::data

    CLIENT --> API
    API --> DB

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Keep a HashMap<userId, requestCount> inside the API server. On each request: if count < limit → allow, else → reject.

Why this breaks:


The Solution: Shared Counter in Redis

flowchart LR
    CLIENT["Client"]:::client
    POD1["API Pod 1"]:::service
    POD2["API Pod 2"]:::service
    POD3["API Pod 3"]:::service
    REDIS[("Redis<br/>shared counters")]:::data

    CLIENT --> POD1
    CLIENT --> POD2
    CLIENT --> POD3
    POD1 --> REDIS
    POD2 --> REDIS
    POD3 --> REDIS

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

All pods check the SAME counter in Redis. Doesn’t matter which pod handles the request — the global count is always accurate.

💡 What is Redis? An in-memory database that responds in under 1 millisecond. Perfect for counters because it’s fast enough to check on every single request without slowing down your API.


Rate Limiting Algorithms

There are 4 main approaches. You need to know all 4 for interviews, but Token Bucket is the most common in production.

Algorithm 1: Fixed Window

flowchart LR
    subgraph Window1["Window: 12:00 - 12:01"]
        C1["count = 73"]
    end
    subgraph Window2["Window: 12:01 - 12:02"]
        C2["count = 12"]
    end

    classDef default fill:#1e1e2e,stroke:#6366f1,color:#e2e8f0

Algorithm 2: Sliding Window (approximate)

Instead of hard window boundaries, use a weighted combination:

estimated_count = current_window_count + previous_window_count × overlap_ratio

Example: At 12:00:45 (45 seconds into the new window):

✅ Smooth. No boundary bursts. ~1% accuracy. ✅ Only stores 2 counters (current + previous). Same memory as fixed window. ❌ Approximate — could allow 1-2% over the limit.

💡 Cloudflare uses this for billions of rate-limit checks per day. The 1% inaccuracy is acceptable.

Algorithm 3: Token Bucket ⭐ (most common)

flowchart TD
    BUCKET["Bucket<br/>max = 10 tokens<br/>current = 7"]:::data
    REFILL["Refill: +1 token every 6 seconds"]:::service
    REQ["Request arrives<br/>costs 1 token"]:::client

    REFILL -->|"adds tokens"| BUCKET
    REQ -->|"removes 1 token"| BUCKET

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

How it works:

  1. Each user has a “bucket” with a maximum capacity (say 10 tokens).
  2. Tokens refill at a steady rate (say 1 token every 6 seconds = 10/minute).
  3. Each request costs 1 token.
  4. If bucket is empty → reject (429).

Why it’s the best for APIs:

Redis implementation (pseudocode):

tokens = current_tokens + (now - last_refill) * refill_rate
tokens = min(tokens, max_tokens)   -- cap at bucket size
if tokens >= 1:
    tokens -= 1
    ALLOW
else:
    REJECT (429)

💡 Stripe, GitHub, and most production APIs use Token Bucket. It gives the best UX because it allows natural burst behavior.

Algorithm 4: Sliding Window Log

Which to pick?

flowchart TD
    START["What do you need?"]:::client
    BURST["Allow short bursts?"]:::service
    EXACT["Need exact precision?"]:::service
    TB["Token Bucket ⭐"]:::data
    SWC["Sliding Window Counter"]:::data
    SWL["Sliding Window Log"]:::data
    FW["Fixed Window"]:::data

    START -->|"Yes bursts OK"| BURST
    START -->|"No just hard cap"| EXACT
    BURST -->|"Yes"| TB
    EXACT -->|"Yes exact"| SWL
    EXACT -->|"No approx fine"| SWC
    START -->|"Simplest possible"| FW

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Where to Rate Limit (3 layers)

flowchart LR
    CLIENT["Client"]:::client
    L1["Layer 1: Edge<br/>Cloudflare WAF<br/>IP-level DDoS"]:::edge
    L2["Layer 2: Gateway<br/>Kong or Envoy<br/>per-API-key limits"]:::edge
    L3["Layer 3: Service<br/>your code<br/>domain-specific"]:::service
    API["Backend"]:::service

    CLIENT --> L1
    L1 --> L2
    L2 --> L3
    L3 --> API

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef edge fill:#42A5F5,stroke:#0D47A1,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
Layer What it blocks Key Example
Edge (Cloudflare/WAF) DDoS, bots, abusive IPs IP address “No IP can send >1000 req/sec”
Gateway (Kong/Envoy) Per-user quota enforcement API key or user ID “Free tier: 100/min. Paid: 10000/min”
Service level Domain-specific limits Per resource “Max 5 password reset emails/hour”

💡 Why multiple layers? Edge blocks volumetric attacks cheaply (before they hit your servers). Gateway enforces business rules. Service handles logic that only your code understands.


What Happens When Redis Goes Down?

This is a classic interview question. Three options:

flowchart TD
    DOWN["Redis is down"]:::data
    FC["Fail-Closed<br/>reject all requests"]:::service
    FO["Fail-Open<br/>allow all requests"]:::service
    FB["Fallback<br/>local in-memory bucket"]:::service

    DOWN --> FC
    DOWN --> FO
    DOWN --> FB

    FC -->|"❌ Global outage"| BAD["Users locked out"]:::client
    FO -->|"⚠️ No protection"| OK["Backend might overload"]:::client
    FB -->|"✅ Graceful"| GOOD["Slightly inaccurate but safe"]:::client

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Best answer for interviews: “Fail-open with a local fallback. If Redis is unreachable, each pod switches to a local in-memory token bucket. Less accurate (each pod enforces limit/N independently) but the API stays up. Alert on Redis being down so ops investigates.”


Complete Flow (Sequence Diagram)

Request allowed:

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway
    participant R as Redis
    participant API as Backend API

    C->>GW: GET /api/search
    GW->>GW: extract API key from header
    GW->>R: EVALSHA token_bucket_check
    R->>R: refill tokens based on elapsed time
    R->>R: tokens >= 1 so decrement
    R-->>GW: ALLOWED remaining=87
    GW->>API: forward request
    API-->>GW: 200 response
    GW-->>C: 200 with X-RateLimit-Remaining 87

Request rejected:

sequenceDiagram
    autonumber
    participant C as Client
    participant GW as API Gateway
    participant R as Redis

    C->>GW: GET /api/search
    GW->>R: EVALSHA token_bucket_check
    R->>R: tokens = 0
    R-->>GW: REJECTED reset_in=42s
    GW-->>C: 429 Too Many Requests with Retry-After 42

Response Headers

When your API has rate limiting, always return these headers so clients can self-throttle:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100        ← max requests per window
X-RateLimit-Remaining: 87     ← how many left
X-RateLimit-Reset: 1750860060 ← when the window resets (unix timestamp)

On rejection:

HTTP/1.1 429 Too Many Requests
Retry-After: 42               ← seconds to wait before retrying
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

Deep Dive: Distributed Rate Limiting

Problem: You have 10 API servers. If each uses its own counter, the total allowed = 10× the limit.

flowchart LR
    subgraph Problem["Without shared state"]
        P1["Pod 1: count=50"]:::service
        P2["Pod 2: count=50"]:::service
        P3["Pod 3: count=50"]:::service
        TOTAL["Total: 150 but limit is 100!"]:::client
    end

    subgraph Solution["With Redis"]
        R1["Pod 1"]:::service
        R2["Pod 2"]:::service
        R3["Pod 3"]:::service
        REDIS["Redis: count=100"]:::data
        R1 --> REDIS
        R2 --> REDIS
        R3 --> REDIS
    end

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Three approaches:

Approach How Trade-off
Centralized Redis Every request checks Redis Accurate but adds 0.5-2ms latency per request
Local + periodic sync Each pod counts locally, syncs to Redis every 100ms Fast but can overshoot by ~10%
Sticky routing Load balancer always sends same user to same pod Simple but uneven load distribution

Interview answer: “For protective limits (abuse prevention), local + periodic sync is fine — 10% overshoot is acceptable. For strict limits (billing, credits), always check centralized Redis.”


Deep Dive: Handling Burst Traffic

Problem: Limit is 100/minute. Client sends all 100 in the first second. Technically within quota, but backend can’t handle 100 concurrent requests from one client.

Solution: Two-tier limiting.

flowchart TD
    REQ["Incoming request"]:::client
    BURST["Burst check<br/>max 10 per second"]:::service
    SUSTAIN["Sustained check<br/>max 100 per minute"]:::service
    ALLOW["✅ Allow"]:::data
    REJECT["❌ 429 Reject"]:::client

    REQ --> BURST
    BURST -->|"pass"| SUSTAIN
    BURST -->|"fail"| REJECT
    SUSTAIN -->|"pass"| ALLOW
    SUSTAIN -->|"fail"| REJECT

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Both limits must pass:

This is what Stripe does — they publish both a “per-second” and “per-minute” limit.


Final Architecture

flowchart LR
    CLIENTS["Clients"]:::client
    EDGE["Edge WAF<br/>IP limits and DDoS"]:::edge
    GW["API Gateway<br/>per-user limits"]:::edge
    RL["Rate Limit Check<br/>Redis Lua script"]:::service
    REDIS[("Redis Cluster<br/>token buckets")]:::data
    RULES[("Rules Config<br/>limit per tier")]:::data
    API["Backend Services"]:::service
    K["Analytics<br/>limit events"]:::async

    CLIENTS --> EDGE
    EDGE --> GW
    GW --> RL
    RL --> REDIS
    RL --> RULES
    GW -->|"allowed"| API
    RL --> K

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef edge fill:#42A5F5,stroke:#0D47A1,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Interview Cheat Sheet

Question Answer
“Which algorithm?” Token Bucket — allows bursts, caps sustained rate
“Where to store counters?” Redis — sub-ms latency, atomic Lua, built-in TTL
“How to make it atomic?” Redis Lua script — read + check + decrement in one operation
“What if Redis is down?” Fail-open + local fallback. Never be a single point of failure.
“Where to put it?” 3 layers: Edge (IP/DDoS) → Gateway (per-user) → Service (domain logic)
“How to handle distributed?” Centralized Redis for strict limits; local sync for soft limits
“What headers to return?” X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After

Key Technologies Mentioned

Term What it is
Redis An in-memory database. Responds in < 1ms. Used for counters, caches, and fast lookups.
Lua script A tiny program that runs INSIDE Redis. Lets you read + check + write atomically in one network call.
API Gateway A server that sits in front of your APIs. Handles auth, rate limiting, routing. Examples: Kong, Envoy, AWS API Gateway.
CDN / Edge Servers at the “edge” of the network, close to users worldwide. Cloudflare, CloudFront. First line of defense.
Token Bucket Algorithm: bucket of tokens, refills at steady rate. Each request costs a token. Empty bucket = rejected.
HTTP 429 Standard HTTP status code meaning “Too Many Requests.” Client should back off and retry later.

Related: System Design Fundamentals · URL Shortener · Chat System

Designing a Real-Time ... →

💬 Comments