System Design — FM India

System design is the part most frontend developers walk into cold. The trick is that it isn't about memorising architectures. It's about naming trade-offs clearly and choosing one on purpose. Let's build the vocabulary.

The axes of scalability

"Scalable" means nothing without a dimension. Scalable in requests per second? In data volume? In team size? Each one needs different tactics.

Vertical scaling

Make the machine bigger: more RAM, more CPU, faster disk. Simple, and often the right call up to a point, especially for databases. The limits are physics, non-linear cost, and one big single point of failure.

Horizontal scaling

Add more machines doing the same thing. This needs the work to be parallelisable, which is easy for stateless web servers and hard for stateful databases, and it needs a load balancer to route the work.

The key distinction underneath both is stateless vs stateful. A stateless service keeps no important in-memory state, so any request can hit any instance, and you scale it by adding boxes. A stateful service (a database, a cache, a session store) needs real care to scale.

The mantra

Keep state out of stateless services. Concentrate it in purpose-built systems (databases, caches, queues). Scale the stateless tier by adding boxes, and scale the stateful tier with the technique that fits.

Load balancing

A load balancer sits in front of your servers and spreads traffic across them. Three things to understand.

A Layer 4 balancer works at the TCP level and forwards connections without reading them. A Layer 7 balancer works at the HTTP level, so it can route by path, header, or cookie, terminate TLS, and rewrite requests. Most modern balancers (nginx, Envoy, AWS ALB) are Layer 7.

The routing algorithm can be round robin (in order), least connections (to the least busy backend), consistent hash (the same key always goes to the same backend), or weighted (bigger backends get more).

And it runs health checks: pinging each backend and pulling unhealthy ones out of rotation. There are two kinds, and they're not the same. Liveness asks "is the process up?" Readiness asks "can it serve traffic yet?" A box still warming its caches is alive but not ready.

Caching, layer by layer

A cache trades freshness for speed. Every layer has the same shape: check the cache, hit (fast) or miss (fetch and store). The interesting choices are where to cache and when to invalidate.

Browser cacheCache-Control headers · free, closest to user

CDNCloudflare, Fastly · edge, great for static + public HTML

Application cacheRedis, Memcached · the cache you control

Database cachePostgres keeps hot pages in shared memory

Cache as close to the user as you can. Each layer up shields everything below it.

Most app caches are some flavour of LRU, least-recently-used: when the cache fills up, the entry untouched the longest gets evicted. It's a satisfying thing to build, and short in JavaScript.

Build an LRU cacherun · edit · saved to you

Loading editor…

For invalidation, you've got TTLs (entries expire on a timer), explicit deletion (drop the key on change, if you know every key derived from a fact), and versioned keys. There's also stale-while-revalidate: serve the stale value, fetch a fresh one in the background, swap it in. Great for read-heavy public pages.

The genuinely hard problem

"There are only two hard things in computer science: cache invalidation and naming things." It's not really a joke. Most consistency bugs in real systems are cache-invalidation bugs. When in doubt, shorter TTLs. When in real doubt, no cache.

Database scaling

The hardest part of any system, because databases are stateful and consistent, and both make scaling hard.

Start with vertical scaling. A single Postgres on a big box handles a startling amount: tens of thousands of writes a second, hundreds of thousands of cached reads. It's boring and reliable. Don't move past it until you must.

Next come read replicas: read-only copies that the primary streams its WAL to. Writes go to the primary, reads can go to either. The catch is replication lag. A replica trails the primary by some delay, so a user who writes and immediately reads might hit a replica that hasn't caught up. The fixes are to route that user's reads to the primary for a short window, or to use synchronous replication and pay the latency.

Only when all that runs out do you shard: split data across many databases by a key.

Range sharding

Split by key ranges (users 1–1M on shard A, 1M–2M on shard B). Simple, but hot spots are common, since all the newest users land on one shard.

Hash sharding

Hash the key and pick a shard. Even distribution. Use consistent hashing so adding a shard moves only a small slice of keys instead of nearly all of them.

That difference between plain modulo and consistent hashing is worth feeling for yourself. Measure how many keys move when you add one shard.

Why consistent hashing beats modulorun · edit · saved to you

Loading editor…

The unspoken rule

Don't shard if you don't have to. Sharding adds permanent complexity to every query, and cross-shard queries are painful. A single big Postgres with read replicas and good caching gets you very far. Teams that shard early usually regret it.

Consistency models

When data lives in more than one place (replicas, caches, shards), they can disagree. Consistency models are the rules about what they're allowed to disagree on.

Strong consistency. Every read sees the latest write, everywhere, immediately. Easy to reason about, expensive to provide.
Eventual consistency. Updates spread to all copies eventually; reads might see stale data for a moment. Cheap and scales beautifully. DNS and CDNs work this way.
Causal consistency. The middle ground. Operations that depend on each other are seen in order everywhere; unrelated ones can be seen in any order.

The CAP theorem is usually quoted wrong. The real statement: during a network partition, you must choose between consistency and availability. Either you refuse to serve (keeping consistency, losing availability) or you serve possibly-stale data (keeping availability, losing consistency). The more useful framing is PACELC: if partitioned, choose availability or consistency; else, choose latency or consistency. That last part captures the everyday trade-off that exists even when nothing is broken.

Monolith, microservices, modular monolith

The most over-argued topic in backend engineering. Let's ground it.

A monolith is one codebase, one deployable, one database. Simple to build, deploy, and debug. It hurts when the team grows past roughly fifty engineers all editing the same code, or when one part of the app scales very differently from another.

Microservices are many small services, each deployed and stored separately. Amazon and Netflix adopted them at a scale where team coordination was the bottleneck. The cost is that distributed-system problems appear everywhere, and observability stops being optional.

The honest default

For most teams, a modular monolith is right: one deployable and one database, but the code is organised into modules with clear boundaries that don't reach into each other's tables. You get the discipline of microservices without the operational tax, and the seams are already drawn for the day you actually need to split. Microservices solve organisational problems first and technical ones second.

Event-driven architecture

Instead of services calling each other directly, they publish events and other services react. The order service publishes OrderPlaced; the shipping service and the email service both subscribe and act on their own.

Why bother: decoupling (the publisher doesn't know who's listening, so adding a subscriber changes nothing upstream), failure isolation (if email is down, orders still get placed and email retries later), and replay (persisted events can rebuild state or feed a new consumer).

The patterns worth knowing are pub/sub (events to a topic, subscribers receive them), event sourcing (the events are the source of truth and current state is a projection), CQRS (writes and reads use different models), and the saga (a multi-step process coordinated by events, with compensating "undo" steps when a later step fails). And once more: duplicate deliveries are unavoidable, so build idempotent consumers and stop chasing exactly-once.

Multi-tenancy

A SaaS serves many customers (tenants) from one system. Three choices, simple to complex:

Shared schema

One database, every table has a tenant_id, every query filters by it. Cheap, simple to operate, and cross-tenant analytics are trivial. The risk is a missed WHERE tenant_id = ? leaking data, which Postgres Row-Level Security mitigates.

Schema- or database-per-tenant

Stronger isolation, with the filter at the connection level. Enterprise customers love database-per-tenant. The cost is operational: migrations run many times, and you can't have thousands of them.

Watch for the noisy neighbour: in shared systems, one tenant's huge query degrades everyone's performance. Per-tenant rate limits, connection limits, and query timeouts keep it contained. Start with shared schema plus RLS, and move heavy customers to dedicated infrastructure as a paid tier.

Canonical designs to internalise

A few designs come up again and again. Sketch each until you can do it on a whiteboard.

URL shortener. Generate short IDs (base62 of a counter, or random with a collision check), expect massively more reads than writes so cache hard, and don't update a row on every click. Push clicks to a queue and batch them.
News feed. The hard "simple" one. Fanout-on-write (when Alice posts, write into every follower's feed) gives fast reads and slow writes, and breaks for celebrities. Fanout-on-read computes the feed at read time. Real systems go hybrid: fanout-on-write for normal users, fanout-on-read for the few with millions of followers.
Chat. WebSockets or SSE for delivery, messages stored by conversation ID, presence in Redis, push notifications for offline users. The hard part is reliability: persist every message before acknowledging it.
Payment ledger. Double-entry bookkeeping in an append-only table, balances derived by summing entries, idempotency keys on everything. This is where strong consistency earns its cost.

How to attack a system-design interview

Four steps. Clarify requirements (what it does, plus scale, read/write ratio, latency targets). Sketch the high level (boxes and arrows: clients, balancer, services, databases, queues, caches). Deep-dive one or two components (usually the data model and the busiest path). Scale and harden (what breaks at 10x, where to cache, how to shard, what fails). The interview rewards talking through trade-offs out loud more than writing a complete answer.

Now go deep

This chapter is the vocabulary. The System Design Deep Dives are the long form: full case studies with real scale numbers, architecture, and failure modes. Start with live cricket streaming at JioHotstar scale, then surviving the IRCTC Tatkal rush and a payments ledger for UPI.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QDesign a URL shortener. What's the hardest part?+

Usually ID generation at scale, not storage. A single auto-incrementing counter is a bottleneck, UUIDs are too long, hashing collides. A common answer: a central ID service that hands out ranges to web servers. The second-hardest part is counting clicks without writing a row on every redirect, which you solve by pushing to a queue and aggregating.

QYou have read-after-write inconsistency on read replicas. Your options?+

Route that user's reads to the primary for a short window after their write. Or pass a logical timestamp with the response and require the next read to be at least that fresh. Or use synchronous replication and accept the latency. Or embrace eventual consistency and design the UX around it with optimistic updates.

QWhen should you NOT use microservices?+

Under roughly fifty engineers, when team coordination isn't the bottleneck, when the domain isn't well understood yet, when you have no platform team for the operational overhead, or when latency is tight (every call becomes a network hop). For most teams, a modular monolith is the answer.

QDesign a Twitter-like feed for 100M users including celebrities.+

Hybrid fanout. For normal users, fanout-on-write into each follower's timeline cache. For celebrities (say 10K+ followers), don't pre-write; at read time, merge their small set of posts into the timeline. That bounds write amplification while keeping reads fast. Cache timelines in Redis, persist posts in a sharded store.

QConsistent hashing vs modulo hashing for sharding?+

With modulo (hash(k) % N), adding a shard changes almost every key's destination, so you'd move nearly all your data. Consistent hashing arranges shards on a ring and sends each key to the next shard clockwise, so adding one moves only about 1/N of keys. Used everywhere from Cassandra to CDNs.

QCaching strategy for a profile page loaded constantly but rarely changed?+

Cache-aside with a long TTL. On profile update, explicitly invalidate the user's key, and optionally version it so stale entries age out. Layer a CDN on top for public profiles. Don't write-through; it adds latency to a write path exercised far less than reads.

QPeriodic 5-second latency spikes, but the database looks fine. What might it be?+

Candidates: V8 garbage-collection pauses, libuv thread-pool saturation, a "fresh" read hitting a lagged replica, connection-pool exhaustion (requests waiting for a slot), or background-job interference. Diagnose with request IDs through every layer, tracing that times each span, and runtime metrics like event-loop lag and GC duration.

QRate limiting across 10 instances of your API. How?+

Centralise the counter in Redis. Each instance does an atomic increment on a key like rl:{user}:{minute} with a TTL, and returns 429 over the limit. Per-instance limiting is faster but lets a user get limit×N by spreading across instances.

QDesign a notification system: email, push, SMS.+

Event-driven. The originating service publishes NotificationRequested. A dispatcher reads it, checks the user's channel preferences and quiet hours, and publishes channel-specific events. Channel workers consume them, each idempotent, each retrying with backoff, each writing to a dead-letter queue on permanent failure. Track delivery status per notification.

QYour read replicas are 30 seconds behind. What's likely wrong?+

Either the replica can't apply WAL fast enough (CPU, disk, or a long query holding a snapshot), the network between primary and replica is saturated, or the primary is writing too much. Check the replica's CPU/disk/network, check pg_stat_replication on the primary, and look for long-running queries on the replica.

QWhat's a saga and when do you reach for one?+

A multi-step process across services where each step has an "undo" (a compensating action). If step three fails, you run the undos for two and one. You use it when you'd want a distributed transaction but can't have one across separate databases. Classic example: book flight, book hotel, charge card; if the charge fails, cancel the hotel and flight.

QScale a single-DB app to 10x traffic without microservices.+

In order: vertically scale the database (this often handles 10x alone), add read replicas for read-heavy endpoints, cache aggressively at the CDN, app, and database levels, move slow work to background jobs, and add a connection pooler. Only then consider sharding. A surprising amount of "we need microservices" really means "we haven't cached enough."