Skip to main content

Rate Limiters

·1921 words·10 mins
Yash Sachdeva
Author
Yash Sachdeva
Software Engineer | Turning Complex Problems into Simple Solutions
Software engineer by trade, problem solver by nature. I write about the systems I build in my free time and the experiences that shape them.

Life Without a Rate Limiter
#

Imagine a public web API that allows clients to fetch user data without any rate limiting. Under normal conditions this might work, but during traffic spikes or abuse (e.g., bots or scrapers) the backend can be overwhelmed, leading to resource exhaustion, cascading failures, and poor availability for legitimate users. Without any form of control, a single noisy neighbor can starve others, increase infrastructure costs, and make it difficult to meet SLAs.

Simple Java service without limiting
#

public class UserService {

    public String getUser(String userId) {
        // Pretend this hits database, cache, etc.
        return "User-" + userId;
    }

    public static void main(String[] args) {
        UserService service = new UserService();
        for (int i = 0; i < 1_000_000; i++) {
            service.getUser("123");
        }
    }
}

In this world, any number of calls are accepted; there is no back-pressure, fairness, or abuse protection. The first step is to introduce a minimal gate that can reject excessive traffic.


Version 1: In-Memory Fixed Window Counter
#

A fixed window rate limiter divides time into discrete windows (e.g., 1 second or 1 minute) and maintains a counter of requests per window. If the count exceeds a threshold in that window, further requests in that window are rejected, and when the window expires, the counter is reset to zero.

Implementation
#

This first implementation is a single in-process limiter.

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

public class FixedWindowPerKeyRateLimiter {

    private static class WindowCounter {
        final AtomicInteger count = new AtomicInteger(0);
        volatile long windowStart;

        WindowCounter(long windowStart) {
            this.windowStart = windowStart;
        }
    }

    private final int maxRequests;
    private final long windowSizeMillis;
    private final Map<String, WindowCounter> counters = new ConcurrentHashMap<>();

    public FixedWindowPerKeyRateLimiter(int maxRequests, long windowSizeMillis) {
        this.maxRequests = maxRequests;
        this.windowSizeMillis = windowSizeMillis;
    }

    public boolean allow(String key) {
        long now = System.currentTimeMillis();
        WindowCounter wc = counters.computeIfAbsent(key, k -> new WindowCounter(now));

        synchronized (wc) {
            if (now - wc.windowStart >= windowSizeMillis) {
                wc.windowStart = now;
                wc.count.set(0);
            }
            if (wc.count.get() < maxRequests) {
                wc.count.incrementAndGet();
                return true;
            }
            return false;
        }
    }
}

Usage test and boundary problem
#

public class FixedWindowPerKeyRateLimiterTest {

    public static void main(String[] args) throws InterruptedException {
        // 5 requests per 10 seconds per user
        FixedWindowPerKeyRateLimiter limiter =
                new FixedWindowPerKeyRateLimiter(5, 10_000);

        String user = "user-1";

        // Scenario: 5 requests at the end of one window
        long start = System.currentTimeMillis();
        for (int i = 1; i <= 5; i++) {
            boolean allowed = limiter.allow(user);
            System.out.println((allowed ? "ALLOWED" : "BLOCKED"));
        }

        // Sleep just enough to cross the window boundary
        Thread.sleep(10_100);

        // Immediately send 5 more requests at the start of the new window
        for (int i = 6; i <= 10; i++) {
            boolean allowed = limiter.allow(user);
            System.out.println((allowed ? "ALLOWED" : "BLOCKED"));
        }
    }
}

This implementation protects the system from unlimited traffic, but it has several issues:

  • Boundary spikes: Bursts at window edges can slip through. Because the algorithm resets the counter at fixed boundaries, the user effectively sends 10 requests in a little over 10 seconds (5 at the end of one window and 5 at the start of the next), violating the intuitive “5 per 10 seconds” limit. To smooth out these edge effects, a sliding window algorithm evaluates requests over a rolling time window instead of fixed slots.

  • Single process only: It doesn’t work across multiple service instances.


Version 2: In-Process Sliding Window Log
#

The sliding window log algorithm tracks the timestamp of each accepted request in a per-key log and, on each new request, drops timestamps older than the window and counts the remaining ones. This yields an accurate sliding window that enforces “N requests per W seconds” regardless of window boundaries.

Implementation
#

import java.util.ArrayDeque;
import java.util.Deque;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class SlidingWindowLogRateLimiter {

    private final int maxRequests;
    private final long windowSizeMillis;
    private final Map<String, Deque<Long>> logs = new ConcurrentHashMap<>();

    public SlidingWindowLogRateLimiter(int maxRequests, long windowSizeMillis) {
        this.maxRequests = maxRequests;
        this.windowSizeMillis = windowSizeMillis;
    }

    public boolean allow(String key) {
        long now = System.currentTimeMillis();
        Deque<Long> deque = logs.computeIfAbsent(key, k -> new ArrayDeque<>());

        synchronized (deque) {
            long threshold = now - windowSizeMillis;
            while (!deque.isEmpty() && deque.peekFirst() < threshold) {
                deque.pollFirst();
            }
            if (deque.size() < maxRequests) {
                deque.addLast(now);
                return true;
            }
            return false;
        }
    }
}

Usage test and limitation
#

public class SlidingWindowLogRateLimiterTest {

    public static void main(String[] args) throws InterruptedException {
        // 5 requests per 10 seconds per key
        SlidingWindowLogRateLimiter limiter =
                new SlidingWindowLogRateLimiter(5, 10_000);

        String user = "user-1";
        long start = System.currentTimeMillis();

        // 5 requests spread over 9 seconds
        for (int i = 1; i <= 5; i++) {
            boolean allowed = limiter.allow(user);
            System.out.println(allowed ? "ALLOWED" : "BLOCKED");
        }
        Thread.sleep(9_000);
        // 6th request within 10-second sliding window should be blocked
        boolean allowed = limiter.allow(user);
        System.out.println((allowed ? "ALLOWED" : "BLOCKED"));

        Thread.sleep(1_001);
        // 7th request should be allowed as the first request is now outside the sliding window
        allowed = limiter.allow(user);
        System.out.println((allowed ? "ALLOWED" : "BLOCKED"));
    }
}

This approach correctly enforces the rate across arbitrary boundaries, but the log can grow large for high-traffic keys and is still local to a single process, making it hard to use across multiple instances or data centers.

Production systems often prefer algorithms like Token Bucket, which provide good control over average rate while allowing bursts that might be a business requirement and are more amenable to efficient implementations in distributed stores like Redis.


Version 3: In-Process Token Bucket
#

The token bucket algorithm maintains a bucket of tokens that replenishes at a fixed rate up to a maximum capacity; each request consumes one token and is allowed only if a token is available. This allows short bursts (up to the bucket size) while enforcing a long-term average rate defined by the refill rate.

Implementation
#

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class TokenBucketRateLimiter {

    private static class Bucket {
        int tokens;
        long lastRefillTimestamp;

        Bucket(int tokens, long lastRefillTimestamp) {
            this.tokens = tokens;
            this.lastRefillTimestamp = lastRefillTimestamp;
        }
    }

    private final int refillTokensPerSecond;
    private final int capacity;
    private final Map<String, Bucket> buckets = new ConcurrentHashMap<>();

    public TokenBucketRateLimiter(int refillTokensPerSecond, int capacity) {
        this.refillTokensPerSecond = refillTokensPerSecond;
        this.capacity = capacity;
    }

    public boolean allow(String key) {
        long now = System.nanoTime();
        Bucket bucket = buckets.computeIfAbsent(key,
                k -> new Bucket(capacity, now));

        synchronized (bucket) {
            int secondsSinceLast = (int)((now - bucket.lastRefillTimestamp) / 1_000_000_000.0);
            int tokensToAdd = secondsSinceLast * refillTokensPerSecond;
            if (tokensToAdd > 0) {
                bucket.tokens = Math.min(capacity, bucket.tokens + tokensToAdd);
                bucket.lastRefillTimestamp = now;
            }

            if (bucket.tokens >= 1) {
                bucket.tokens -= 1;
                return true;
            }
            return false;
        }
    }
}

Usage test and limitation
#

public class TokenBucketRateLimiterTest {

    public static void main(String[] args) throws InterruptedException {
        // Capacity 5, refill 1 token per second per key
        TokenBucketRateLimiter limiter =
                new TokenBucketRateLimiter(1, 5);

        String user = "user-1";

        // After some idle time, user has accumulated tokens for a burst
        Thread.sleep(5_000); // accumulate close to capacity

        // Burst of 6 requests
        for (int i = 1; i <= 6; i++) {
            boolean allowed = limiter.allow(user);
            System.out.println((allowed ? "ALLOWED" : "BLOCKED"));
        }

        Thread.sleep(2_000);
        // 2 requests after 2 seconds should be allowed
        for (int i = 1; i <= 2; i++) {
            boolean allowed = limiter.allow(user);
            System.out.println((allowed ? "ALLOWED" : "BLOCKED"));
        }
    }
}

The token bucket enforces a sustainable rate with configurable burst tolerance and is widely used in API gateways, microservices, and infrastructure rate limiting. However, this implementation is still in-process and not safe across multiple instances; concurrent instances with their own buckets would each allow the full quota unless a shared store and atomic updates are introduced.


Version 4: Distributed Token Bucket with Redis and Lua
#

In a distributed system with many application servers behind a load balancer, rate limiting must be coordinated across instances so that the limit is enforced globally for each key. A common pattern is to store rate limiter state in a centralized store such as Redis and use Lua scripts so that the read–modify–write sequence (refill tokens, check, consume) is executed atomically.

High-level architecture
#

  1. Each application server receives requests and computes an identity key (user ID, API key, IP, tenant ID, etc.).
  2. Before processing the request, the server calls a rate limiter component that runs a Redis Lua script implementing token bucket logic for that key.
  3. The script reads the current token count and last refill time, refills tokens based on elapsed time, consumes one token if available, and writes back the updated state in a single atomic operation.

Java + Redis + Lua example (conceptual)
#

import redis.clients.jedis.Jedis;
import redis.clients.jedis.params.SetParams;

public class RedisTokenBucketRateLimiter {

    private final Jedis jedis;
    private final String scriptSha;

    public RedisTokenBucketRateLimiter(Jedis jedis) {
        this.jedis = jedis;
        String script =
                "local key = KEYS[1] " +
                "local capacity = tonumber(ARGV[1]) " +
                "local refill_per_sec = tonumber(ARGV[2]) " +
                "local now = tonumber(ARGV[3]) " +
                "local state = redis.call('HMGET', key, 'tokens', 'timestamp') " +
                "local tokens = tonumber(state[1]) or capacity " +
                "local last_ts = tonumber(state[2]) or now " +
                "local delta = now - last_ts " +
                "if delta > 0 then " +
                "  local refill = delta * refill_per_sec " +
                "  tokens = math.min(capacity, tokens + refill) " +
                "  last_ts = now " +
                "end " +
                "if tokens < 1 then " +
                "  redis.call('HMSET', key, 'tokens', tokens, 'timestamp', last_ts) " +
                "  return 0 " +
                "else " +
                "  tokens = tokens - 1 " +
                "  redis.call('HMSET', key, 'tokens', tokens, 'timestamp', last_ts) " +
                "  redis.call('EXPIRE', key, 3600) " +
                "  return 1 " +
                "end";
        this.scriptSha = jedis.scriptLoad(script);
    }

    public boolean allow(String key, double capacity, double refillPerSecond) {
        long nowSeconds = System.currentTimeMillis() / 1000L;
        Object result = jedis.evalsha(
                scriptSha,
                1,
                key,
                String.valueOf(capacity),
                String.valueOf(refillPerSecond),
                String.valueOf(nowSeconds)
        );
        Long allowed = (Long) result;
        return allowed == 1L;
    }
}

Usage test (integration-style)
#

public class RedisTokenBucketRateLimiterTest {

    public static void main(String[] args) throws InterruptedException {
        try (Jedis jedis = new Jedis("localhost", 6379)) {
            RedisTokenBucketRateLimiter limiter = new RedisTokenBucketRateLimiter(jedis);

            String user = "user-1";
            double capacity = 5.0;
            double refillPerSecond = 1.0;

            // Simulate concurrent calls from multiple application instances
            for (int i = 1; i <= 7; i++) {
                boolean allowed = limiter.allow("rate:" + user, capacity, refillPerSecond);
                System.out.printf("Global req %d -> %s%n", i, allowed ? "ALLOWED" : "BLOCKED");
            }
        }
    }
}

In this design, all instances share the same Redis-backed bucket per key, and the Lua script ensures the rate limiting decision is atomic even under heavy concurrency, which is a common industry standard approach. This solves the single-instance limitation and provides strong, globally consistent rate limiting, though it introduces network latency to Redis and a dependency on its availability, which can be mitigated via caching, sharding, or fallback behaviors.

Systemic Resilience and Fault Tolerance
#

If the rate limiting middleware is coded defensively to deny all incoming traffic when Redis is down / unreachable, it can lead to a complete outage of the application. Therefore, the rate limiting middleware should be designed to fail open (allow traffic) when Redis is down using in-memory rate limiting with a lower threshold. This ensures that the application remains available even when the rate limiting infrastructure is unavailable.

The API Contract
#

A rate limiter shouldn’t be viewed as merely a gatekeeper but as a contract between the service provider and the consumer. It defines the expected usage patterns and ensures fair resource allocation, protecting the service from abuse while providing predictable performance for legitimate users.

When a limit is exceeded, the API should return a 429 Too Many Requests status code along with a Retry-After header indicating when the client can retry the request. This allows clients to gracefully handle rate limiting and adjust their request patterns accordingly.

Related