Designing a Scalable Rate Limiting Architecture

A few days ago, during a system design interview, I was asked a simple-looking but tricky question:

“How would you design a scalable rate-limiting algorithm for millions of API requests per second?”

At first glance, rate limiting sounds straightforward — “just block requests after a certain threshold.”
But when you think about scale, fairness, consistency, and fault tolerance, it quickly becomes an interesting system design problem.

In this blog, I’ll break down my thought process, different algorithms, and how I’d design a production-ready rate limiting architecture.

Why Do We Need Rate Limiting?

Before diving into algorithms, let’s revisit why we even need rate limiting:

Protect Infrastructure
Prevent your servers and databases from getting overloaded.
Ensure Fair Usage
Stop one user from hogging all the resources.
Control Costs
If you’re using paid APIs (e.g., OpenAI, Google Maps), you don’t want a spike burning your wallet.
Prevent Abuse
Avoid DDoS attacks, brute force attempts, and malicious scraping.

In short, rate limiting is about balancing performance, reliability, and fairness.

Core Rate Limiting Algorithms

Let’s break down the four most commonly discussed algorithms — including their trade-offs and suitability for interviews.

1. Fixed Window Counter

How it works:

You maintain a counter per user/IP per time window (e.g., 100 requests/minute).
If the counter exceeds the limit, block the request.
Reset the counter at the start of the next window.

Example:

User A — Allowed 100 requests/minute
09:00:00 → Counter = 0
09:00:30 → Counter = 99 ✅
09:00:59 → Counter = 100 ✅
09:00:59 → Counter = 101 ❌ (blocked)
09:01:00 → Counter resets to 0

Pros ✅

Super simple to implement.
Works well at low scale.

Cons ❌

Burstiness problem: A user can send 100 requests at 09:00:59 and another 100 at 09:01:00 — effectively 200 requests in 2 seconds.

Interview Tip: If asked, mention burstiness — shows you understand trade-offs.

2. Sliding Window Log

How it works:

Store timestamps of each request in a sorted log.
For each new request, remove old timestamps beyond the allowed window.
Count how many requests remain in the window — if above threshold, block.

Pros ✅

Handles burstiness better.
Precise control over request timing.

Cons ❌

High memory usage for popular users.
Log maintenance cost is higher at scale.

3. Sliding Window Counter (Hybrid)

How it works:

Combines Fixed Window and Log approaches.
You still use counters but smooth out burstiness by interpolating counts between windows.

Example:

Limit = 100 req/minute.
Current window (09:00 → 09:01) = 80 requests.
Previous window (08:59 → 09:00) = 40 requests.
Weighted allowed requests = 80 + (fraction_of_time_passed * 40).

Pros ✅

More efficient than storing full logs.
Better burst control than fixed windows.

Cons ❌

Slightly more complex math.
Not perfectly accurate under extreme load.

4. Token Bucket (Most Common in Production)

How it works:

Imagine a bucket that holds tokens.
Tokens are added at a fixed rate (e.g., 10 tokens/sec).
Each request consumes a token.
If the bucket is empty → block request.

Why it’s awesome 🚀

Smooths bursts: Allows short bursts if tokens accumulate.
Simple math: Add tokens periodically, consume when used.
Widely used in API gateways and cloud providers.

5. Leaky Bucket

How it works:

Similar to token bucket but instead of adding tokens, requests enter a queue.
Requests leave the queue at a constant rate.
If the queue is full → drop requests.

Best suited for:

Systems where steady outflow is required, like payment gateways or media streaming.

Designing a Scalable Architecture

Rate limiting algorithms are just the local logic. But interviews often want you to scale it to millions of requests/sec.

High-Level Architecture

                ┌─────────────┐
     Client ───▶│ API Gateway │───▶ Services
                └─────┬───────┘
                      │
              Rate Limiting Service
                      │
          ┌─────────────────────────┐
          │ Centralized Data Store   │
          │ (Redis / DynamoDB etc.) │
          └─────────────────────────┘

Key Components

1. API Gateway

First entry point for requests.
Integrates with the rate limiting service.
Examples: Nginx, Kong, AWS API Gateway, Envoy.

2. Centralized Rate Limiter

Implements one of the algorithms above.
Needs low latency (<1ms ideally).
Redis is a popular choice for distributed counters.

3. Token Synchronization

Use Redis atomic operations:

  INCR user:123:counter
  EXPIRE user:123:counter 60

Or use Lua scripts for atomic check + update in one step.

4. Horizontal Scaling

Use consistent hashing or sharded Redis clusters.
Ensure counters for the same user/IP always land on the same shard.

Interview Insights & Trade-offs

If traffic is huge → avoid sliding logs; prefer token buckets.
If precision matters (like payments) → sliding window log.
If you expect bursts → token bucket is best.
If you want fairness → leaky bucket ensures a constant flow.

Optimizations for Real-World Systems

Shadow Mode: Log potential violations without blocking — helps tune thresholds.
User vs. Global Limits: Apply both per-user and global caps.
Distributed Consistency: Use CRDT-based counters or Redis streams for cross-region scaling.
Monitoring: Expose metrics like requests_blocked, requests_allowed, and visualize in Grafana.

Final Thoughts

Rate limiting seems like a simple interview question, but it tests system design depth:

Can you pick the right algorithm? ✅
Can you scale it? ✅
Can you handle bursts & fairness? ✅

In production, I personally prefer token bucket + Redis + Lua scripts — it’s fast, reliable, and widely adopted.

If you get asked this in an interview, start small, explain the algorithms, and then talk about distributed architecture. That’s what interviewers look for.

Key Takeaways

Understand at least four algorithms deeply.
Always talk about trade-offs.
Mention scalability challenges.
Bonus points for Redis atomic ops and API gateway integration.

Designing a Scalable Rate Limiting Architecture — Lessons from an Interview

Why Do We Need Rate Limiting?

Core Rate Limiting Algorithms

1. Fixed Window Counter

2. Sliding Window Log

3. Sliding Window Counter (Hybrid)

4. Token Bucket (Most Common in Production)

5. Leaky Bucket

Designing a Scalable Architecture

High-Level Architecture

Key Components

1. API Gateway

2. Centralized Rate Limiter

3. Token Synchronization

4. Horizontal Scaling

Interview Insights & Trade-offs

Optimizations for Real-World Systems

Final Thoughts

Key Takeaways

Comments

More from this blog

JWT: A simple step-by-step deep dive (so you actually understand it)

Understanding WebRTC for Voice Bots — A Desi Developer's Guide

Build Your Own Promise

How Frontend Frameworks Work Under the Hood: A Deep Dive into React, Next.js, and Modern Build Systems

Command Palette

Why Do We Need Rate Limiting?

Core Rate Limiting Algorithms

1. Fixed Window Counter

2. Sliding Window Log

3. Sliding Window Counter (Hybrid)

4. Token Bucket (Most Common in Production)

5. Leaky Bucket

Designing a Scalable Architecture

High-Level Architecture

Key Components

1. API Gateway

2. Centralized Rate Limiter

3. Token Synchronization

4. Horizontal Scaling

Interview Insights & Trade-offs

Optimizations for Real-World Systems

Final Thoughts

Key Takeaways

Comments

More from this blog