The system design chapter told you to start with one big Postgres and not shard until you must. This deep dive is why that advice works: a single Postgres is a remarkably capable machine once you understand what it's doing underneath. We're going to open the box, look at how a row lives on disk, and trace what happens when you read it, update it, and index it.
The heap: where rows actually live
A Postgres table is stored as a heap: a set of fixed-size pages (8 KB each) on disk. Each page holds a bunch of rows. A row, internally, is called a tuple.
The word "heap" is literal. Rows are not stored in any sorted order. A new row goes wherever there's free space in some page. So "the third row I inserted" is not "the third row on disk," and there is no cheap way to get rows "in order" without either sorting them or using an index. This single fact explains a lot of later behaviour.
To find a row by value with no index, Postgres does a sequential scan: read every page, check every tuple. For a 200-row table that's instant. For a 200-million-row table, it's a disaster. The whole job of an index is to avoid that scan.
MVCC: why an UPDATE is secretly an INSERT
Here's the idea that surprises everyone coming from "a row is a row." Postgres uses MVCC, multi-version concurrency control. When you UPDATE a row, Postgres does not overwrite it in place. It writes a brand-new version of the tuple and marks the old one as expired.
Why? So that readers never block writers and writers never block readers. A transaction that started before your update keeps seeing the old version (a consistent snapshot), while your new version waits to become visible once you commit. Each tuple carries hidden bookkeeping (xmin, the transaction that created it; xmax, the transaction that expired it) so Postgres can decide which version each transaction is allowed to see.
Three consequences you must internalise
Updates grow the table. Every update writes a new tuple, so an update-heavy table accumulates dead tuples faster than you'd expect. Deletes don't free space immediately. A deleted row is just marked expired; the space isn't reclaimed yet. Reads stay consistent without locks. A long-running SELECT sees a frozen snapshot of the database as of when it started, regardless of concurrent writes.
Vacuum and bloat
If updates and deletes leave dead tuples behind, something has to clean them up. That something is VACUUM. It scans tables, finds tuples no live transaction can still see, and marks that space reusable. Autovacuum runs it automatically in the background.
When autovacuum can't keep up, you get bloat: a table that's mostly dead tuples, far bigger on disk than its live data, with queries reading dead space they then discard. The classic cause is a long-running transaction: because that old transaction might still need to see old tuple versions, vacuum can't remove anything newer than its snapshot. One forgotten six-hour analytics query can hold back cleanup across the whole database.
DecisionMVCC trades disk space and a cleanup process for lock-free reads.
The price of "readers never block writers" is that old row versions pile up and must be vacuumed away. On a busy update-heavy table you tune autovacuum to run more aggressively, and you hunt down long transactions that pin old snapshots. In exchange you get a database where a reporting query never blocks a checkout. For most workloads that's a great trade, but it means disk usage and vacuum health are things you actually watch.
The WAL: how Postgres survives a crash
Before Postgres changes a data page, it first writes the change to the Write-Ahead Log (WAL): an append-only log of "here is what I'm about to do." Only after the WAL record is safely on disk does the change get applied to the actual pages (lazily, in memory first).
This ordering is the durability guarantee. If the server loses power mid-write, on restart Postgres replays the WAL from the last checkpoint and reconstructs every committed change. It's the same persist-before-you-acknowledge principle from the chat deep dive, applied to disk pages.
The WAL is also how replicas stay in sync: a replica receives the primary's WAL stream and replays it to mirror the primary's state. So when a read replica lags behind, the bottleneck is almost always how fast it can replay WAL, not the network.
Indexes: turning a scan into a few hops
An index is a separate data structure that maps values to row locations, so Postgres can find rows without scanning the heap. The default and by far most common type is the B-tree: a balanced, sorted tree.
Picture a sorted tree of key values. To find email = 'x', Postgres walks from the root down through a couple of internal nodes to a leaf, following pointers, comparing as it goes. A few hops instead of millions of row reads. Because the tree is sorted, the same structure also serves ranges (<, >, BETWEEN) and prefix matches (LIKE 'foo%') efficiently. It cannot help a suffix match (LIKE '%foo'), because the sort order starts from the front of the string.
You can feel the difference an index makes. Without one, a lookup scans every row; with one, it's a direct hit.
The index types worth knowing
- B-tree — the default. Equality, ranges, prefix, ordering. Ninety percent of your indexes.
- Composite — an index on
(a, b, c)helps filters ona, on(a, b), or on all three, working strictly left to right. It does nothing for a filter onbalone. Put the most-selective, most-often-filtered column first. - Partial — indexes only rows matching a condition (
WHERE status = 'active'). Smaller and faster for "active rows only" queries. - Covering — includes the extra columns a query needs (
INCLUDE (...)) so Postgres answers from the index alone, never touching the heap. This is an index-only scan, the fastest read there is. - GIN — for "many values per row": arrays, JSONB, full-text search.
Why you don't index everything
An index is a copy of data that must be kept in sync. Every INSERT, UPDATE, and DELETE has to update every index on the table. So indexes speed up reads and slow down writes, and an over-indexed table can have writes take many times longer than necessary.
Unused indexes are pure cost
An index nobody queries still gets updated on every write and still takes disk and memory. Postgres tracks index usage in pg_stat_user_indexes; an index with near-zero scans after a representative period is dead weight you can drop. The discipline isn't "add indexes until it's fast," it's "add the indexes your real queries use, and remove the rest."
Putting it together: why this model pays off
Almost every Postgres performance question reduces to this picture. A slow query is usually a sequential scan on a big table that should have used an index. A table that's huge on disk but small in live data is MVCC bloat from updates plus lagging vacuum, often caused by a long transaction. Slow writes on a hot table can be too many indexes. A lagging replica is WAL replay struggling to keep up. None of these are mysterious once you can see the heap, the tuples, the WAL, and the B-tree in your head.
The one idea to take away
Rows live unsorted in heap pages, so finding one by value is either a full scan or an index hop. Updates write new tuple versions (MVCC) rather than overwriting, which keeps reads lock-free but produces dead tuples that vacuum must reclaim. The WAL makes all of it durable and feeds replicas. Indexes (mostly B-trees) trade write cost for read speed. Hold that picture and database behaviour becomes predictable instead of magical.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QWhy is a row sometimes called a 'tuple', and why does it matter that the heap is unordered?+
A tuple is Postgres's internal name for one stored row version. The heap stores tuples in no particular order, wherever free space exists, so there's no cheap way to read rows 'in order' or find one by value without either sorting or using an index. That's the entire reason indexes exist.
QWhat actually happens on disk when you UPDATE a row?+
Postgres writes a new version of the tuple and marks the old one expired; it does not overwrite in place. This is MVCC. It lets transactions that started earlier keep seeing the old version (a consistent snapshot) while the new version becomes visible after commit. The side effect is dead tuples that vacuum later reclaims.
QYour table is 60 GB on disk but holds 5 GB of live data. Diagnose it.+
MVCC bloat. Updates and deletes leave dead tuples, and autovacuum hasn't reclaimed them, often because a long-running transaction is pinning an old snapshot so vacuum can't remove anything newer. Find long transactions in pg_stat_activity, tune autovacuum to run more aggressively, and consider pg_repack to rebuild the table online.
QWhat does the WAL guarantee, and how does it relate to replicas?+
It guarantees durability: changes are written to the append-only Write-Ahead Log before being applied to data pages, so a crash can be recovered by replaying the WAL. Replicas stay in sync by replaying the primary's WAL stream, which is why a lagging replica is usually limited by WAL replay speed rather than the network.
QHow does a B-tree index find a row, and what can't it help with?+
It walks a sorted, balanced tree from the root through a few internal nodes to a leaf, following pointers and comparing values, so a lookup is a few hops instead of a full scan. Being sorted, it also serves ranges, prefix matches, and ordering. It cannot help a suffix match like LIKE '%foo' because sort order starts at the front of the value.
QYou have an index on (country, city, name). Which queries use it?+
Filters on country, on country + city, or on all three, because composite indexes work left to right. A filter on city alone, name alone, or city + name without country cannot use it. Order the columns so the most-filtered one comes first.
QWhy not just add an index on every column?+
Because every index must be updated on every insert, update, and delete, so over-indexing slows writes (sometimes by a lot) and wastes disk and memory. Indexes you don't query are pure cost. Add the indexes your real queries use, check pg_stat_user_indexes, and drop the unused ones.
Comments
Loading comments…