Shipping code is the easy part. Keeping it running, knowing when it's sick, and not getting surprised by the bill is the rest of the job. None of this is glamorous, and all of it is what separates a working demo from a system people depend on.
Containers, properly
A container packages your app so it runs the same way everywhere. It bundles your code, the Node runtime, your dependencies, and the OS-level libraries into one image. The host runs that image in an isolated environment using kernel features. It isn't a virtual machine (there's no separate OS), but it feels like one.
Three things make a good image: small size, fast builds, and security. A multi-stage Dockerfile gets you all three.
# Build stage: install everything, compile.
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Runtime stage: start fresh, copy only what's needed.
FROM node:20-alpine
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
COPY package.json ./
USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]The build stage has all your dev tools; the runtime stage copies just the compiled output, so the shipped image stays tiny.
Aim for under 200 MB
Use alpine or distroless base images. Put rarely-changing steps (installing dependencies) before frequently-changing ones (copying code) so the layer cache gets reused. Add a .dockerignore for node_modules, .git, and build artefacts. If your Node image is 1.5 GB, you're shipping your build tools or a stack of unused dependencies.
Where to deploy: the honest landscape
There's no "best," only "best for your stage."
A managed PaaS takes a Dockerfile or a Git repo and handles scaling, zero-downtime deploys, and basic observability. It costs more per CPU-hour than raw cloud, and it saves you a platform engineer's worth of time. Serverless is great for spiky, event-driven work and scales to zero, but it's a poor fit for long-running processes or anything that needs persistent database connections (use a pooler).
The honest answer for your first project
Use Fly.io or Cloud Run. You'll learn the real concepts (containers, scaling, deploys, rollouts) without spending a month on Kubernetes YAML. Learn enough Kubernetes vocabulary (pod, service, deployment, ingress) to read a manifest, and adopt the real thing only if you join a team that already runs it.
CI/CD pipelines
Continuous integration means every push runs your tests automatically. Continuous deployment means passing tests ship on their own (often to staging, then to production behind a manual gate). A reasonable pipeline runs in this order, fastest checks first:
Lint and typecheck
Fast, runs first, fails early.
Unit tests
Fast and parallelisable.
Integration tests
Slower, hit a real database in a container.
Build the image, deploy to staging
Cache the layers, then run smoke tests against staging.
Deploy to production
Automatic, or behind a manual gate depending on risk.
For the deploy itself, rolling replaces instances a few at a time (the default), blue-green runs two full environments and swaps traffic (easy rollback, double capacity briefly), and canary sends a small slice of traffic to the new version and expands only if the metrics hold (the safest for high traffic).
Migrations are where teams break production
Database migrations must run before the code that needs them, and they must be backward-compatible because the old version is still live during the deploy. The expand-migrate-contract pattern from the data chapter is not optional here.
Observability: the three pillars
When your app misbehaves in production, you need to answer three questions, and each has its own tool.
Logs — what happened
Structured, JSON, with a request ID on every line. Use Pino in Node and ship to a central system. Don't log secrets or PII.
Metrics — how often
Numbers over time: request rate, error rate, p95/p99 latency, queue depth, pool usage. Prometheus and Grafana are the open standard.
The third pillar is traces: a single request flows through many services, and a trace shows the whole journey as nested spans with timing for each. When your homepage takes two seconds, the trace tells you which of the twelve downstream calls is to blame.
The one habit that pays off most
Propagate a single trace ID through every log line, every error, and every queue message. When something breaks, you search one ID and see the entire causal chain in one place. This single discipline turns an hour of detective work into a 30-second search.
SLOs and error budgets
The most important idea from Site Reliability Engineering: perfect reliability is not the goal. You define an acceptable amount of unreliability and manage to it.
Three acronyms. An SLI is a measurement, like "percent of requests that finish under 200ms without a 5xx." An SLO is your target for it, like "99.9% of requests." An SLA is a contractual SLO with consequences if you miss it.
The error budget
An SLO of 99.9% means a 0.1% error budget, about 43 minutes a month of allowed unreliability. Under budget, ship features fast. Over budget, slow down and focus on reliability. It turns the safety conversation into a number instead of an argument. Pick SLIs that reflect what users feel (did the page load?), not internal numbers (CPU usage). Two or three SLOs per service is plenty.
Security at the API
Most of what a senior engineer needs here is "know enough not to ship the obvious bugs." The OWASP API Security Top 10 is the canonical list. The ones you'll meet most:
- Broken object-level authorization.
GET /orders/42returns the order without checking it belongs to the caller. The single most common API security bug. - Excessive data exposure. Your API returns the whole user object,
password_hashincluded, because you forgot to pick the fields. - Mass assignment. A PATCH lets a user set
isAdmin: truebecause you accepted the whole body. - SQL injection. String-concatenated SQL. Always use parameterised queries.
- Insecure direct object references. Predictable IDs let an attacker walk
/users/1,/users/2. Use UUIDs or scoped opaque IDs.
Secrets and transport
Never commit, log, or print secrets. Use a secrets manager and rotate regularly. Serve HTTPS only, always (certs are free from Let's Encrypt). Configure CORS tightly with specific origins, never *. Add a Content Security Policy and start strict.
Cost literacy
Senior engineers know what their architecture costs. It shapes design decisions and is one of the cheapest ways to look senior in a budget conversation.
Where cloud bills actually come from
Data transfer (egress) is the silent killer: moving data out to the internet costs around $0.09/GB, and many a startup has found a five-figure egress bill on the third of the month. NAT gateways charge per GB and add up fast. Logs charge per GB ingested and stored, so verbose logging is a real line item. Compute is the part everyone watches and the part that's easiest to reason about.
The optimisation moves: reserved instances or savings plans for steady workloads (30–60% off), spot instances for fault-tolerant work, a CDN in front of object storage to dodge egress, tight log-retention policies, and right-sizing instances every quarter because most run half-idle. Above all, set billing alerts at several thresholds. The worst bill is the one you didn't see coming.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QYour Docker image is 1.5 GB. Where's the weight, and how do you fix it?+
Most likely build tools in the final image, dev dependencies not pruned, and a heavy base image (node:20 is about 1 GB versus node:20-alpine at about 150 MB). Use a multi-stage build that compiles in stage one and copies only compiled output and production dependencies into an Alpine or distroless stage two. Add a .dockerignore.
QLiveness probe vs readiness probe?+
Liveness asks "is the process up?" and if it fails the orchestrator restarts the container. Readiness asks "can it serve traffic?" and if it fails the orchestrator stops sending traffic without restarting. A pod warming a cache is alive but not ready. During graceful shutdown you flip readiness to fail so traffic drains, then exit.
QDeploying a risky change. What strategy?+
Canary. Roll out to about 1% of traffic, watch error rate, latency, and business metrics for a while, then expand to 10%, 50%, 100% with checkpoints. Roll back to zero if anything regresses. It needs good observability so you see the problem before users complain.
QDefine an SLO for a checkout endpoint.+
Something like: "99.9% of POST /checkout requests over a rolling 30-day window return a non-5xx response within 1500ms." That's about 43 minutes of allowed badness a month. Checkout failures cost real money, so this is a tight one, and you watch both error rate and latency in the same SLO.
QWhat's broken object-level authorization and why is it #1?+
Returning objects by ID without checking the caller has access to that specific object. GET /invoices/42 returns an invoice belonging to another customer. It's common because developers check authentication and role but forget per-resource ownership, and it's damaging because it leaks data between customers. Fix it with ownership checks at the resource level, ideally in middleware or RLS.
QYour cloud bill jumped 3x this month. Where do you look first?+
Cost Explorer by service and usage type. The usual culprits, in order: data transfer/egress (someone pulled gigabytes out), log ingestion (a verbose new log line times millions of requests), object-storage requests, or an instance left running. Egress is the most common because it's invisible until the bill arrives.
QValue of propagating a trace ID through your logs?+
One ID ties together every log line and span across every service for a single request. When something fails you search one ID and see the whole causal chain, instead of correlating timestamps by hand. It's the difference between an hour of work and a 30-second search.
QIntermittent failures only in production, and logs show nothing. What's missing?+
Probably tracing and metrics. Logs tell you what code ran, not about resource contention, network blips, or queue depth. Add metrics on every external call, traces across service boundaries, and runtime metrics like event-loop lag and GC pauses. Most "logs show nothing" mysteries become obvious in traces or metrics.
Comments
Loading comments…