A Testing Strategy That Pays for Itself

Ask ten engineers why they write tests and you'll get ten answers: coverage, confidence, "best practice," because the linter complains. The honest answer is narrower and more useful: a test exists so that future-you can change this code and know within seconds whether you broke something that matters. That's it. Once you hold that as the goal, most arguments about testing resolve themselves, because you can ask of any test: does this let me change code with less fear, or does it just make change harder?

The simplest way to think about it

Imagine you're about to refactor a function. Before you touch it, you want a smoke alarm: something that goes off the moment the behavior changes in a way a user would notice. A good test is that smoke alarm. A bad test is a smoke alarm wired to the light switch — it goes off every time you do anything, so you learn to ignore it, and then it doesn't go off the day there's an actual fire.

So there are really only two ways a test fails you:

A flaky / brittle test

It fails when nothing real broke — you renamed an internal variable, reordered a list that has no defined order, or the clock ticked. You start re-running CI "until it goes green," and now the suite is decoration.

A missing test

The behavior that actually matters — money moves correctly, the wrong user can't read someone else's data — isn't checked at all, so the suite is green while the bug ships.

Every decision below is about avoiding both: test the behavior that matters, at the cheapest level that still catches the real bug, in a way that doesn't break when you change things that don't matter.

The pyramid, and why the "trophy" replaced it

The classic advice is the testing pyramid: lots of fast unit tests at the bottom, fewer integration tests in the middle, very few slow end-to-end tests at the top. The shape encodes a cost argument — unit tests are cheap and fast, E2E tests are expensive and slow, so write more of the cheap ones.

Two shapes

End-to-endfew · slow · real browser/network

Integrationmany · medium · real DB, in-process

Unitsome · fast · pure logic

Static (types, lint)free · instant · catches whole classes of bugs

The pyramid optimizes for speed; the trophy optimizes for confidence per test. For backend services the trophy usually wins.

The pyramid isn't wrong, but it was written before TypeScript caught a third of the bugs unit tests used to catch, and before integration tests got fast. The modern shape, the testing trophy (Kent C. Dodds' framing), pushes weight into the integration layer, because that's where the bang-for-buck is on a backend: a test that hits your real route handler, your real validation, and a real database catches the bugs users actually hit, while a wall of unit tests on mocked layers passes happily as the system breaks at the seams between them.

DecisionDefault to integration tests for backend code; reserve unit tests for genuinely tricky pure logic.

A unit test on a function that just orchestrates calls to the DB and other services mostly tests your mocks, not your system — and mocks drift from reality silently. An integration test that drives the real handler against a real database tests the thing that ships. The cost is speed (milliseconds vs microseconds) and setup (you need a database in CI), and that cost is almost always worth it. Keep unit tests for the parts where the logic is the hard part: a date-range calculator, a pricing rule, a parser, a retry backoff. There, a unit test is precise and the integration test would be a clumsy way to probe the edge cases.

What to test: behavior, not implementation

The single most useful rule: test what the code does, not how it does it. A test should read like a description of a requirement.

// Brittle: couples the test to the implementation. Renaming the
// private method or reordering calls breaks it, though nothing real changed.
expect(service._calculateTax).toHaveBeenCalledWith(500);

// Robust: states the behavior a user/caller cares about.
const invoice = await createInvoice({ amount: 500, region: "KA" });
expect(invoice.total).toBe(590); // 500 + 18% GST

The second test survives any refactor that keeps the behavior. The first one is a tripwire on your own code's internals — it actively punishes you for cleaning up. If you find yourself asserting on private methods, spies, and call counts, you're testing implementation, and that's the brittle smoke alarm.

The refactor test

A good suite has this property: you can rewrite the internals of a module completely — different functions, different structure — and if the externally observable behavior is unchanged, not a single test should need editing. If a refactor forces you to rewrite tests, those tests were measuring the wrong thing.

Why you test against a real database

This is the one most teams get wrong, and it's worth being firm about: integration tests should hit a real database, not a mock. The whole category of bugs that integration tests exist to catch lives in the gap between "what I think SQL does" and "what SQL does": a NULL that doesn't compare the way you expected, a unique constraint you forgot, a migration that's valid on an empty table and fails on a populated one, a transaction isolation level that allows a write skew under load. A mocked database returns exactly what you told it to return, so it can never surprise you — which means it can never catch the bug, because the bug is the surprise.

The practical setup: spin up a real Postgres for the test run (a container, or an ephemeral schema per worker), run your migrations against it, and let each test create the data it needs and roll back or truncate afterwards. It runs in milliseconds per test once the database is warm.

One database, isolated per test
Either wrap each test in a transaction that rolls back at the end (fast, but you can't test code that itself commits), or truncate the relevant tables between tests (slower, but tests real commit behavior). Most teams pick truncation for handler tests and transaction-rollback for repository tests.
Run the real migrations
The test database is built by the same migration files as production. This means a broken migration fails your tests, not your deploy — which is exactly where you want to find it.
Seed the minimum, assert the outcome
Each test inserts only the rows it needs, calls the real code path, and asserts on what came back and what's now in the database. No mocks of your own data layer.

The incident this prevents

The canonical disaster: a service with 95% coverage, all green, ships a migration that adds a NOT NULL column without a default. Every unit test passes because they mock the database. The deploy runs the migration against the populated production table, it fails halfway, and now you're in an incident with a half-applied schema. A single integration test that ran the migration against a table with one row in it would have caught it in CI. Mocks can't catch what they're told to fake.

Contract testing: the seam between services

Once you have more than one service, a new class of bug appears that neither unit nor integration tests catch: the producer changes the shape of its response, and the consumer breaks, and nobody finds out until production. Service A's tests pass (it tests itself). Service B's tests pass (they mock A's response — with the old shape). They only meet in production.

A contract test pins the agreement between two services. The consumer declares "when I call GET /users/:id, I expect a body with id, email, and displayName." That expectation (the contract) is shared with the producer, and the producer's CI runs a test that verifies its real responses still satisfy every consumer's contract. Now if the producer renames displayName to name, the producer's build goes red, before it ships, naming the consumer that would have broken.

Without contract tests

Each side mocks the other with its own assumptions. Both suites are green. The mismatch is discovered by a user, or by an on-call engineer reading a stack trace at midnight.

With contract tests

The consumer's expectations are executable and shared. The producer can't merge a breaking change without its own build failing and telling it exactly which consumer it would break.

You don't need a heavy framework to start: even a shared, versioned JSON schema for each endpoint that both sides validate against is a lightweight contract. The principle is what matters — the agreement between services is itself something you test, not something you hope holds.

Testing the non-deterministic part: code that calls an LLM

The Frontier chapter's whole point is that an AI feature is a distributed system where the service in the middle gives different answers to the same input. That breaks the foundational assumption of a normal test: expect(output).toBe(expected). Run the same prompt twice and you may get two different strings, both correct. So you test it in layers, and you keep the non-deterministic part out of your normal suite.

Test everything around the model deterministically
The prompt you build, the retrieval that feeds it, the parsing of its response, your retry and timeout logic, your cost guardrails — all of that is ordinary code. Mock the LLM call (return a canned completion) and unit/integration test the scaffolding hard. This is most of your bug surface, and it's fully deterministic.
Assert on structure and properties, not exact text
For the model's actual output, assert what must be true regardless of wording: the JSON parses, the required fields exist, the summary is under N tokens, the classification is one of the allowed labels, no disallowed content. These are property checks, not equality checks.
Move quality measurement into evals, not CI gates
Whether the answer is good (not just well-formed) is measured by an eval set — a fixed set of inputs with graded expected outcomes — run on a schedule or before a prompt change, scored by exact match where possible and by a model-graded rubric where not. Evals tell you "this prompt change regressed quality by 4%." They don't belong as a hard pass/fail in the unit suite, because they're statistical, not binary.

The mental shift for AI tests

Normal tests answer "is it correct?" with yes/no. AI tests split into two questions: "is the output well-formed and safe?" (deterministic, gate it in CI) and "is the output good?" (statistical, measure it with evals and watch the trend). Conflating the two gives you a flaky suite that blocks deploys for no reason.

What not to test

Confidence comes as much from what you skip as what you write. Don't test:

Third-party libraries. You don't test that Postgres' ORDER BY sorts, or that the HTTP framework parses headers. Test your use of them.
Trivial glue with no logic. A one-line getter, a passthrough. There's no behavior to protect; the test is pure ceremony.
The framework's behavior. That a route is registered, that DI wires up. If it's broken, every test fails anyway.
Exact wording of error messages and logs (unless a machine parses them). Asserting on prose makes copy edits break the build.

Coverage percentage is a weak proxy and a dangerous target. You can hit 100% by testing getters while never asserting that money moves correctly. Aim coverage at the code where a bug would hurt — money, auth, data integrity — and let the rest be thin.

Coverage is a map, not the territory

A coverage number tells you which lines ran during tests, not which behaviors are verified. A test that calls a function and asserts nothing gives you 100% coverage of that function and zero confidence. Use coverage to find what's completely untested (the zeros are informative), not to chase a percentage.

The one idea to take away

Write the tests that let you change code without fear, at the cheapest level that still catches a real bug. For backend work that center of gravity is the integration test against a real database, not a wall of mocked unit tests. Test behavior, not implementation, so refactors don't break your suite. Pin the seams between services with contracts. For AI features, gate well-formedness deterministically and measure quality with evals. And skip the tests that only add ceremony — a smaller suite you trust beats a huge one you've learned to ignore.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QWhat's the actual purpose of a test, in one sentence?+

To let future-you change this code and know within seconds whether you broke a behavior that matters. Coverage, "best practice," and green badges are proxies; the real goal is fearless change. Any test that doesn't serve that — that breaks on harmless refactors or skips the behavior that matters — is failing at its job.

QWhy does the 'testing trophy' push weight into integration tests instead of unit tests for backend code?+

Because most backend bugs live in the seams between layers — the handler, the validation, and the database interacting — and a wall of unit tests on mocked layers passes while those seams break. Integration tests drive the real code path against a real database, so they catch the bugs users actually hit. Types and linting now catch many bugs unit tests used to, freeing you to invest in the integration layer. Unit tests stay valuable for genuinely tricky pure logic.

QWhat does 'test behavior, not implementation' mean, and how do you know you've violated it?+

Assert on what the code does as observed by a caller (the returned invoice total), not how it does it (which private method was called with what). You've violated it when a refactor that preserves behavior forces you to rewrite tests — that means the tests were measuring internals. The litmus test: you should be able to rewrite a module's internals completely and, if behavior is unchanged, edit zero tests.

QWhy test against a real database instead of a mock?+

Because the bugs integration tests exist to catch live in the gap between what you think SQL/your schema does and what it actually does: null comparisons, unique constraints, migrations that fail on populated tables, isolation anomalies. A mock returns exactly what you told it to, so it can never surprise you — and the bug is the surprise. Spin up a real Postgres, run real migrations, isolate per test by transaction-rollback or truncation.

QWhat bug does contract testing catch that unit and integration tests miss?+

A producer service changes its response shape and a consumer breaks, but each side's suite is green because each mocks the other with its own (now-stale) assumptions. A contract test makes the consumer's expectations executable and shared, and runs them against the producer's real responses in the producer's CI — so a breaking change turns the producer's build red before shipping, naming the consumer it would break.

QHow do you test code that calls an LLM when the same input gives different outputs?+

In layers. Test the scaffolding (prompt building, retrieval, parsing, retries, cost guards) deterministically by mocking the model call. For the model's real output, assert structure and properties (valid JSON, required fields, allowed labels, length, safety) rather than exact text. Measure whether answers are actually good with an eval set scored on a schedule — keep that out of the hard CI gate because it's statistical, not binary.

QWhy is chasing a coverage percentage a poor goal?+

Coverage measures which lines ran during tests, not which behaviors are verified — a test that asserts nothing still counts. So you can hit 100% by exercising getters while never checking that money moves correctly. Use coverage to find code that's completely untested (the zeros), and aim real testing effort at where a bug would hurt: money, auth, and data integrity.