Designing a Scalable Rate Limiting Architecture — Lessons from an Interview

building shits
A few days ago, during a system design interview, I was asked a simple-looking but tricky question:
“How would you design a scalable rate-limiting algorithm for millions of API requests per second?”
At first glance, rate limiting sounds straightforward — “just block requests after a certain threshold.”
But when you think about scale, fairness, consistency, and fault tolerance, it quickly becomes an interesting system design problem.
In this blog, I’ll break down my thought process, different algorithms, and how I’d design a production-ready rate limiting architecture.
Why Do We Need Rate Limiting?
Before diving into algorithms, let’s revisit why we even need rate limiting:
Protect Infrastructure
Prevent your servers and databases from getting overloaded.Ensure Fair Usage
Stop one user from hogging all the resources.Control Costs
If you’re using paid APIs (e.g., OpenAI, Google Maps), you don’t want a spike burning your wallet.Prevent Abuse
Avoid DDoS attacks, brute force attempts, and malicious scraping.
In short, rate limiting is about balancing performance, reliability, and fairness.
Core Rate Limiting Algorithms
Let’s break down the four most commonly discussed algorithms — including their trade-offs and suitability for interviews.
1. Fixed Window Counter
How it works:
You maintain a counter per user/IP per time window (e.g., 100 requests/minute).
If the counter exceeds the limit, block the request.
Reset the counter at the start of the next window.
Example:
User A — Allowed 100 requests/minute
09:00:00 → Counter = 0
09:00:30 → Counter = 99 ✅
09:00:59 → Counter = 100 ✅
09:00:59 → Counter = 101 ❌ (blocked)
09:01:00 → Counter resets to 0
Pros ✅
Super simple to implement.
Works well at low scale.
Cons ❌
- Burstiness problem: A user can send 100 requests at
09:00:59and another 100 at09:01:00— effectively 200 requests in 2 seconds.
Interview Tip: If asked, mention burstiness — shows you understand trade-offs.
2. Sliding Window Log
How it works:
Store timestamps of each request in a sorted log.
For each new request, remove old timestamps beyond the allowed window.
Count how many requests remain in the window — if above threshold, block.
Pros ✅
Handles burstiness better.
Precise control over request timing.
Cons ❌
High memory usage for popular users.
Log maintenance cost is higher at scale.
3. Sliding Window Counter (Hybrid)
How it works:
Combines Fixed Window and Log approaches.
You still use counters but smooth out burstiness by interpolating counts between windows.
Example:
Limit = 100 req/minute.
Current window (09:00 → 09:01) = 80 requests.
Previous window (08:59 → 09:00) = 40 requests.
Weighted allowed requests =
80 + (fraction_of_time_passed * 40).
Pros ✅
More efficient than storing full logs.
Better burst control than fixed windows.
Cons ❌
Slightly more complex math.
Not perfectly accurate under extreme load.
4. Token Bucket (Most Common in Production)
How it works:
Imagine a bucket that holds tokens.
Tokens are added at a fixed rate (e.g., 10 tokens/sec).
Each request consumes a token.
If the bucket is empty → block request.
Why it’s awesome 🚀
Smooths bursts: Allows short bursts if tokens accumulate.
Simple math: Add tokens periodically, consume when used.
Widely used in API gateways and cloud providers.
5. Leaky Bucket
How it works:
Similar to token bucket but instead of adding tokens, requests enter a queue.
Requests leave the queue at a constant rate.
If the queue is full → drop requests.
Best suited for:
- Systems where steady outflow is required, like payment gateways or media streaming.
Designing a Scalable Architecture
Rate limiting algorithms are just the local logic. But interviews often want you to scale it to millions of requests/sec.
High-Level Architecture
┌─────────────┐
Client ───▶│ API Gateway │───▶ Services
└─────┬───────┘
│
Rate Limiting Service
│
┌─────────────────────────┐
│ Centralized Data Store │
│ (Redis / DynamoDB etc.) │
└─────────────────────────┘
Key Components
1. API Gateway
First entry point for requests.
Integrates with the rate limiting service.
Examples: Nginx, Kong, AWS API Gateway, Envoy.
2. Centralized Rate Limiter
Implements one of the algorithms above.
Needs low latency (<1ms ideally).
Redis is a popular choice for distributed counters.
3. Token Synchronization
Use Redis atomic operations:
INCR user:123:counter EXPIRE user:123:counter 60Or use Lua scripts for atomic check + update in one step.
4. Horizontal Scaling
Use consistent hashing or sharded Redis clusters.
Ensure counters for the same user/IP always land on the same shard.
Interview Insights & Trade-offs
If traffic is huge → avoid sliding logs; prefer token buckets.
If precision matters (like payments) → sliding window log.
If you expect bursts → token bucket is best.
If you want fairness → leaky bucket ensures a constant flow.
Optimizations for Real-World Systems
Shadow Mode: Log potential violations without blocking — helps tune thresholds.
User vs. Global Limits: Apply both per-user and global caps.
Distributed Consistency: Use CRDT-based counters or Redis streams for cross-region scaling.
Monitoring: Expose metrics like
requests_blocked,requests_allowed, and visualize in Grafana.
Final Thoughts
Rate limiting seems like a simple interview question, but it tests system design depth:
Can you pick the right algorithm? ✅
Can you scale it? ✅
Can you handle bursts & fairness? ✅
In production, I personally prefer token bucket + Redis + Lua scripts — it’s fast, reliable, and widely adopted.
If you get asked this in an interview, start small, explain the algorithms, and then talk about distributed architecture. That’s what interviewers look for.
Key Takeaways
Understand at least four algorithms deeply.
Always talk about trade-offs.
Mention scalability challenges.
Bonus points for Redis atomic ops and API gateway integration.

