Debugging a Production Incident, Like a Senior

There's a kind of debugging they don't teach, because it doesn't happen at a keyboard with a debugger attached. It's 2am, the error-rate graph is climbing, customers are tweeting, and you don't have a neat exception with a line number — you have a system, in production, behaving wrong, and no obvious reason why. How an engineer handles that hour is the single clearest signal of seniority, and almost none of it is about being clever. It's about having a method when your heart rate is up.

The instinct to suppress: "let me find the bug"

The junior move under pressure is to dive into the code and start reading, hunting for the flaw. In a 2am incident that's often exactly wrong, for two reasons. First, the system is currently harming users — every minute you spend understanding is a minute the harm continues, and understanding can take a long time. Second, production rarely hands you the bug; it hands you a symptom (latency up, errors up, a queue backing up), and the path from symptom to cause runs through data, not through code-reading.

So seniors invert it. The order is:

Stop the bleeding
Make the harm stop, even if you don't yet understand it. Mitigate first, diagnose second.
Find what changed
Production was fine; now it isn't. Something changed. Find the change before you theorize.
Narrow the search space
Use signals and bisection to go from "the system is broken" to "this component is broken," before reading a single line.
Then, and only then, read the code
Once the data points at a component, open the code with a specific question, not a general search.

The rest of this dive is each of those steps in depth.

Step 1: Stop the bleeding before you understand it

This is the hardest instinct to build, because it feels like cheating. You don't need to know why to make it stop. If a deploy 10 minutes ago lines up with the graph turning red, roll it back now — you can understand the bug at leisure once users are safe. If one feature is melting the database, disable that feature (a flag, a kill switch) and the rest of the product survives. If a downstream dependency is timing out and dragging everything with it, shed its load.

DecisionMitigate first, diagnose second — even though mitigating may erase evidence.

There's real tension here: rolling back the deploy might hide the very change that would've told you the cause, and you'll have to reproduce it later in a safe environment. Do it anyway. The user impact is happening now and is concrete; your understanding can wait an hour. The senior judgment is recognizing that "I'll just quickly find the root cause first" is how a 5-minute incident becomes a 90-minute one. Capture what you can on the way out (a snapshot of the bad metrics, a few example failing requests) and then make it stop.

The most useful question in any incident

"What changed?" Production doesn't usually break on its own — it breaks because something moved: a deploy, a config change, a feature flag flipped, a traffic spike, a dependency's own deploy, a certificate expiring, a disk filling, a cron job that fires at midnight. Before you theorize about subtle bugs, line your symptom's start time up against your deploy and change logs. The cause is sitting in that timeline far more often than in clever code analysis. If the graph turned at 02:14 and a deploy landed at 02:13, you're done theorizing.

Step 2: Read the four golden signals

When you don't have a stack trace, you have graphs — and there are four that, read together, tell you what kind of sick the system is. These are the golden signals, and knowing what each one's shape means is most of triage.

Latency — how long requests take

Rising latency with steady traffic means something downstream got slow (a slow query, a struggling dependency, a full thread pool). Watch the tail (p99), not the average — the average hides the 1% of users having a terrible time.

Traffic — how many requests

A spike explains a lot: a sale, a campaign, a bot, a retry storm. Always check traffic first, because if load doubled, the "bug" might just be the system doing exactly what it does under 2x load.

Errors — what fraction fail

The rate and the kind. A wall of 5xx is your code or your dependencies failing. A wall of 4xx might be a client/contract change or an auth misconfiguration. A sudden change in the ratio is the signal.

Saturation — how full the resources are

CPU, memory, connection pool, disk, queue depth. Saturation is the leading indicator: a connection pool at 100% or a queue growing without bound tells you where the squeeze is, often before errors even start.

The shapes combine into a diagnosis. Latency up, traffic flat, saturation up on the database connection pool → you're pool-starved, probably a slow query holding connections (go read the query plan). Errors up, traffic spiked, latency up → you're overloaded by real or retry traffic (shed load, scale out). Errors up, everything else flat, right after a deploy → it's the deploy (roll back). You're pattern-matching the graphs to a class of failure before you ever touch code.

Step 3: Bisect the system

You've stopped the bleeding and you have a class of failure. Now find where. The technique is the same one git bisect uses on commits, applied to the architecture: cut the system in half and ask which half has the problem.

A request flows browser → CDN → load balancer → your service → database (and maybe → a queue → a worker → a third party). The question at each hop is "is it healthy when it arrives here, and healthy when it leaves?"

Bisecting the request path

Clientok?

Load balancerok?

Serviceslow here

Databaseslow query

Trace the request across each boundary and ask where it's still healthy and where it isn't. The break is between the last good hop and the first bad one.

This is exactly what distributed tracing gives you for free if you have it: a single request's journey, broken into timed spans across every service it touched, so you can see "the request spent 12ms in the service and 1,800ms waiting on the database" without guessing. If you don't have tracing, you bisect manually with logs and metrics at each boundary: is the load balancer returning errors, or passing them through from the service? Is the service slow on its own, or slow waiting on the database? Each answer halves the space. You keep cutting until you've cornered the failure in one component — and now the data has earned you the right to read that component's code.

The three pillars, used in anger

The Edge chapter named the three pillars of observability — metrics, logs, traces. An incident is when you find out whether you actually invested in them. Metrics (the graphs) tell you something is wrong and what kind. Traces tell you where in the request path. Logs tell you what exactly happened at that spot. You move down the pillars as you narrow: graph says "errors up," trace says "in the payment service," logs say "connection refused to the fraud-check API." If you can't do that walk, the gap you found tonight is your top priority tomorrow.

Correlation is not cause: the trap that wastes the most time

Two graphs move together and the brain screams "there's the cause." Sometimes. Often both are effects of a third thing. The classic: latency spiked and CPU spiked, so you "fix" the CPU by scaling out — and the latency stays, because both were caused by a slow downstream dependency that was making every request pile up (which raised latency) and spin while waiting (which raised CPU). You treated a symptom that happened to be next to the cause.

Test the cause before you commit to it

Before you act on "X caused Y," ask: if X is really the cause, what else must be true? If a slow query is the cause, you should see that query in the slow-query log and connections piling up on it. If the deploy is the cause, rolling it back should fix it. A real cause makes specific, checkable predictions. If your theory predicts nothing you can verify, it's a guess wearing a lab coat — and acting on guesses at 2am is how you make an incident worse. The discipline of "what else must be true?" is what separates a diagnosis from a hunch.

The senior habit is to hold the hypothesis loosely and seek the mechanism: not "these moved together" but "here is the actual path by which this thing caused that thing, and here's the evidence of each link." When you can narrate the mechanism end to end, you've found the cause. When you can only narrate the correlation, you haven't yet.

After: the blameless postmortem that actually changes something

The incident's over; users are safe. The work isn't done, because an incident you don't learn from is one you'll have again. The postmortem's job is to change the system so this class of failure can't recur — and it only works if it's blameless.

DecisionWrite postmortems blameless — focus on the system that allowed the failure, never the person who triggered it.

The instinct after a painful night is to find who pushed the bad deploy. Resist it, and not out of niceness: the moment postmortems assign blame, engineers stop reporting incidents, stop volunteering "I think my change did this," and hide mistakes — and you lose the information that prevents the next one. Blameless doesn't mean "no accountability"; it means the question is "what about our system let a single human error become a customer-facing outage?" If one engineer's typo could take down production, the bug is the missing guardrail (no review, no staging check, no gradual rollout, no automatic rollback on error spike), not the typo. People will always make mistakes; the postmortem's job is to make the system survive them.

A postmortem that changes something has a specific shape: a timeline (what happened, when, in absolute timestamps), the impact (who was affected and how much — this is what justifies the follow-up work), the root cause told as a mechanism not a name, and — the only part that matters long-term — action items with owners and dates that remove the class of failure. "Be more careful" is not an action item. "Add an automatic rollback when error rate exceeds 2% within 5 minutes of a deploy, owned by Priya, by next sprint" is. The test of a postmortem is whether the same incident could happen again next month; if the answer is yes, it didn't produce a real fix.

The one idea to take away

Production debugging is the inverse of code debugging: you narrow the search space under pressure instead of reading code until you spot the bug. Stop the bleeding first (roll back, flag off, shed load) even before you understand it — user harm is happening now and understanding can wait. Ask "what changed?" and check the timeline against deploys. Read the four golden signals (latency, traffic, errors, saturation) to classify the failure, then bisect the request path with traces and logs to corner it in one component — and only then read that component's code, with a specific question. Distrust correlation: demand the mechanism and check what else must be true. Afterwards, run a blameless postmortem whose action items remove the class of failure, because the goal was never to find a culprit — it was to make the system survive the next human mistake.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QWhy shouldn't you start a 2am incident by reading code to find the bug?+

Two reasons. The system is harming users right now, and understanding can take a long time you don't have — so mitigation has to come first. And production hands you a symptom (latency, errors, a backing-up queue), not a stack trace, so the path from symptom to cause runs through data, not code-reading. Seniors invert the order: stop the bleeding, find what changed, narrow the search space with signals, and read code only once the data points at a component.

QWhat does 'stop the bleeding before you understand it' mean in practice, and what's the tension?+

Make the harm stop without knowing the root cause: roll back the recent deploy, flip off the offending feature, or shed a failing dependency's load. The tension is that mitigating can erase the evidence you'd need to diagnose — rolling back hides the change. You do it anyway, capturing a snapshot of the bad metrics on the way out, because concrete user impact now outweighs your understanding later. "I'll just find the root cause first" is how a 5-minute incident becomes a 90-minute one.

QWhat are the four golden signals and what does each tell you?+

Latency (how long requests take — watch the p99 tail, rising latency means something downstream got slow), traffic (how many requests — a spike may explain everything, so check it first), errors (what fraction fail and what kind — 5xx vs 4xx point at different culprits), and saturation (how full your resources are — pool, CPU, queue depth, the leading indicator of where the squeeze is). Read together, their shapes classify the kind of failure before you touch code.

QWhat does it mean to 'bisect the system'?+

Apply git-bisect logic to the architecture: trace a request across each hop (client → LB → service → DB → worker → third party) and ask at each boundary whether it's healthy arriving and healthy leaving. The break is between the last good hop and the first bad one. Distributed tracing gives you this directly (timed spans per service); without it you bisect manually with logs and metrics at each boundary. Each answer halves the search space until one component is cornered.

QHow do the three pillars of observability map onto narrowing an incident?+

Metrics (graphs) tell you something is wrong and what kind; traces tell you where in the request path it's happening; logs tell you what exactly happened at that spot. You move down the pillars as you narrow: "errors up" → "in the payment service" → "connection refused to the fraud-check API." An incident is the test of whether you actually invested in all three — any pillar you're missing tonight is tomorrow's top priority.

QWhy is correlation dangerous in debugging, and how do you guard against it?+

Two graphs moving together feels like cause, but both are often effects of a third thing — latency and CPU both spiking because a slow dependency made requests pile up and spin, so scaling CPU fixes nothing. Guard against it by asking "if X is really the cause, what else must be true?" A real cause makes specific, checkable predictions (the slow query shows in the slow-query log; rolling back the deploy fixes it). Seek the mechanism end to end, not the coincidence.

QWhat makes a postmortem blameless, and why does it matter?+

It focuses on what about the system let a human error become an outage, never on who triggered it. It matters for information, not niceness: blame makes engineers hide mistakes and stop reporting incidents, so you lose the data that prevents the next one. If one typo could take down production, the real bug is the missing guardrail (review, staging check, gradual rollout, auto-rollback), not the typo. Good postmortems end in action items with owners and dates that remove the class of failure — "be more careful" is not one.