Rate limiting is the unglamorous infrastructure that keeps an API alive: it stops one user's runaway script from drowning everyone else, protects you from accidental and malicious overload, and enforces your pricing tiers. It's also the simplest possible system that breaks the instant you scale past one server, which makes it a perfect lens on distributed state.
What we're building
Functional · what it does
- Limit each user to N requests per time window
- Return 429 with a Retry-After when over the limit
- Different limits for different tiers and endpoints
- Allow short bursts where it makes sense
Non-functional · what it must survive
- Correct across many API servers, not just one
- Add almost no latency to every request
- Survive the rate-limit store being slow or down
- Scale to millions of users and high request rates
The whole difficulty is hidden in one non-functional line: correct across many API servers. With one server, a counter in memory does the job. With fifty servers, that approach lets a user make fifty times their limit, one per server.
The algorithms
Two designs cover almost everything. Know both and when each fits.
Fixed / token bucket
A bucket holds tokens up to a capacity and refills at a steady rate. Each request takes one token; an empty bucket means 429. This allows bursts (you can spend a full bucket at once) while enforcing an average rate over time. Cheap and the most common choice.
Sliding window
Count requests over a rolling window (the last 60 seconds, continuously) rather than a fixed clock-minute. More accurate at window boundaries (no double-rate burst at the edge of two fixed windows) but more state and computation per request.
The token bucket is the workhorse. Here it is in full; note how refill is computed from elapsed time rather than a background timer.
The fixed-window boundary bug
A plain fixed-window counter (reset every clock-minute) lets a user send their full limit at 11:00:59 and again at 11:01:00, double their rate across the boundary in two seconds. The token bucket avoids this because it refills smoothly; the sliding window avoids it by counting a rolling period. If burst-at-the-boundary matters for you, don't use a naive fixed window.
Why the counter must be central
Here's the crux. If each API server keeps its own counter, a user spreading requests across servers gets limit × server_count. The limit becomes meaningless. So the counter has to live in one place all servers share, and that place is almost always Redis: it's fast (sub-millisecond), it's central, and it has atomic operations built for exactly this.
The basic move is an atomic increment with an expiry: INCR rl:user:42:minute, and set a TTL of 60 seconds on the first increment. If the value exceeds the limit, return 429. The increment is atomic, so two servers hitting it at the same instant can't both read a stale value and both think there's room.
Atomicity: the read-modify-write trap
A subtle bug: "read the counter, check it, then write" is three operations, and another server can slip in between the read and the write. Even with Redis, if your limiter does GET then SET, you've reintroduced the race you were trying to kill.
The fix is to make the whole check-and-update one atomic operation. For a simple counter, INCR is already atomic. For anything more complex (a token bucket with refill, a sliding window with cleanup), you run a small Lua script on Redis, which executes the entire read-modify-write as a single indivisible step. Redis runs one script at a time, so there's no window for another server to interleave.
DecisionPush the whole rate-limit decision into one atomic operation in Redis.
Doing the logic in your application (read from Redis, compute, write back) reintroduces a race across servers and adds round trips. A Lua script (or an atomic command) runs the entire decision in Redis as one step, eliminating the race and cutting it to a single round trip. The cost is a bit of Lua you have to maintain, which is a small price for correctness.
What happens when Redis is slow or down?
You've now made every single API request depend on Redis. That's a new failure mode and a real design decision: if the rate-limit store is unreachable, do you fail open (allow the request) or fail closed (reject it)?
Fail open
If Redis is down, allow requests through unlimited. Your API stays up; you've just temporarily lost rate limiting. Right for most APIs, where availability matters more than perfect enforcement for a few seconds.
Fail closed
If Redis is down, reject everything. Safer against abuse, but a rate-limiter outage now takes down your whole API. Usually the wrong trade unless the thing you're protecting is more precious than availability.
Most systems fail open: a rate limiter is a guard rail, and a guard rail being temporarily absent shouldn't crash the road. You also keep the latency cost tiny with short timeouts (if Redis doesn't answer in a couple of milliseconds, treat it as a miss and allow), and sometimes a small local in-memory limiter as a coarse backstop so "fail open" doesn't mean "completely unprotected."
Scaling the limiter itself
At very high request rates, even one Redis becomes a hot spot. The escapes:
- Shard by key. Rate-limit keys are independent (
user:42,user:99), so they can live on different Redis nodes. The key's hash picks the node, spreading load with no coordination. - Approximate locally, reconcile centrally. For extreme scale, each server enforces a local share of the limit and syncs with the central store periodically. You trade some precision (the global limit becomes approximate) for far less load on the central store, the same accuracy-for-scale trade seen in the streaming counter.
- Limit at the edge. Coarse limits (per-IP, obvious abuse) can be enforced at the CDN or load balancer before traffic ever reaches your servers, so the expensive precise limiter only sees traffic worth checking.
The one idea to take away
A rate limiter is a lesson in distributed state disguised as a utility. The counter must be central or the limit is a lie; the check-and-update must be atomic or you've reintroduced the race; and because every request now depends on that store, you decide deliberately to fail open so the guard rail's failure doesn't become the API's failure. Everything else (token bucket vs sliding window, sharding, edge limits) is detail on top of those three calls.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QWhy does an in-memory counter per server fail to rate-limit correctly?+
Because a user spreading requests across N servers gets N times their limit, one limit's worth per server. The limit only means something if all servers consult the same counter. The fix is a central shared store, almost always Redis, with atomic increments.
QToken bucket vs sliding window: when do you pick which?+
Token bucket for the common case: it's cheap, allows controlled bursts, and enforces an average rate via smooth refill. Sliding window when boundary accuracy matters and you can't tolerate a double-rate burst at the edge of two fixed windows. The sliding window costs more state and computation per request.
QWhat's the read-modify-write trap in a rate limiter?+
If you GET the counter, check it in your app, then SET it, another server can slip in between the read and the write, so both allow a request that should have been blocked. Even with Redis you've reintroduced the race. The fix is to make the whole decision one atomic operation: INCR for a simple counter, or a Lua script for token bucket / sliding window logic.
QWhy use a Lua script in Redis for rate limiting?+
Because a token bucket or sliding window needs a multi-step read-modify-write, and a Lua script runs that entire sequence on Redis as one indivisible step (Redis executes one script at a time). That removes the cross-server race and collapses several round trips into one, at the cost of a little Lua to maintain.
QRedis is down. Should the rate limiter fail open or fail closed?+
Usually fail open: allow requests and temporarily lose rate limiting, so a guard-rail outage doesn't take down your whole API. Use short timeouts so a slow Redis is treated as a miss, and optionally a coarse local backstop. Fail closed only when the protected resource is more precious than availability.
QOne Redis node can't handle your request rate. How do you scale the limiter?+
Shard by key so independent rate-limit keys live on different nodes (the key hash picks the node, no coordination needed). For extreme scale, let each server enforce a local share and reconcile with the central store periodically, trading precision for load. And push coarse limits (per-IP) to the edge so the precise limiter sees less traffic.
Comments
Loading comments…