So far every query returned rows. But business questions are usually summaries: "how many orders?", "what's the average price?", "revenue per order?". For those you aggregate: collapse many rows into one number, optionally one number per group.
Aggregate functions: many rows, one number
An aggregate function takes a whole column of values and returns a single result. The common ones:
COUNT(*)— how many rowsSUM(column)— totalAVG(column)— averageMIN(column)/MAX(column)— smallest / largest
On their own they squash the whole table to one row. SELECT COUNT(*) FROM products; returns just 6. That's useful, but the real power comes from doing it per group.
Aggregates skip NULL (except COUNT(*))
One detail that surprises people: SUM, AVG, MIN, MAX, and COUNT(column) all ignore NULL values. So AVG(price) is the average of the rows that have a price, not "treat missing as zero," and those two give different answers. COUNT(*) is the exception, it counts every row including ones full of NULLs, whereas COUNT(price) counts only rows where price isn't NULL. When a total or average looks off, check whether NULLs are quietly being left out.
GROUP BY: one summary row per group
GROUP BY splits the rows into groups that share a value, then runs the aggregate within each group. "How many products in each category, and their average price?" groups by category and counts within each.
Press Run to execute your query.
Read it as: form one group per distinct category, then for each group report the count and the average price. The rule to internalise: every column in your SELECT must either be in the GROUP BY or wrapped in an aggregate. It can't be a plain column that isn't grouped, because the database wouldn't know which row's value to show for the group. (One quirk: rows where the grouped column is NULL all collapse into a single group, since GROUP BY treats every NULL as the same group even though NULL never equals itself elsewhere.)
Aggregating across a join
Aggregation gets powerful when combined with joins. "Total value of each order" needs the order's items, each item's quantity, and each product's price, multiplied and summed per order. Join first, then group.
Press Run to execute your query.
The shape here is the everyday workhorse of analytics: join the tables you need, then GROUP BY the thing you're summarising over, then SUM/COUNT/AVG the measure. Revenue per order, orders per customer, items per category, all the same pattern.
HAVING: filter the groups
You already know WHERE filters rows. But what if you want to filter on the aggregate, like "only categories whose average price is above 1000"? WHERE can't see aggregates (it runs before grouping). That's what HAVING is for: it filters groups after aggregation.
Press Run to execute your query.
WHERE vs HAVING — the order they run
WHERE filters individual rows before they're grouped. HAVING filters whole groups after the aggregate is computed. So "only Electronics products" is a WHERE (it's about rows), while "only categories averaging over 1000" is a HAVING (it's about a group's aggregate). You can use both in one query: WHERE narrows the rows, then GROUP BY groups them, then HAVING filters the groups.
The full pipeline
Now you've met every major clause. They always run in this logical order, even though you write SELECT first:
FROM / JOIN
Gather and connect the rows.
WHERE
Drop rows that don't match (before grouping).
GROUP BY
Collapse the surviving rows into groups.
HAVING
Drop groups that don't match the aggregate condition.
SELECT
Choose columns and compute aggregates.
ORDER BY / LIMIT
Sort the result and cut it to size.
The mental model
Aggregation answers "how many / how much / on average," and GROUP BY answers it "per something." Join to bring the data together, WHERE to drop rows, GROUP BY to form groups, an aggregate to summarise each, HAVING to drop groups. Once this pipeline is in your head, most reporting queries write themselves.
Next: everything so far has only read data. Time to change it, with INSERT, UPDATE, and DELETE, and to see why the database guards you while you do.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QWhat does an aggregate function do?+
It takes a whole column of values and returns a single result: COUNT how many, SUM total, AVG average, MIN/MAX the extremes. On its own it collapses the whole table to one number; with GROUP BY it produces one number per group.
QWhat's the rule about columns in a SELECT with GROUP BY?+
Every selected column must either appear in the GROUP BY or be wrapped in an aggregate. A plain ungrouped column isn't allowed, because the database wouldn't know which row's value to show for the group. So you group by category and select category plus aggregates of the other columns.
QWhat's the difference between WHERE and HAVING?+
WHERE filters individual rows before grouping; HAVING filters whole groups after the aggregate is computed. 'Only Electronics products' is a WHERE (about rows); 'only categories averaging over 1000' is a HAVING (about a group's aggregate). You can use both together.
QHow do you compute revenue per order across joined tables?+
Join orders to order_items to products, then GROUP BY the order id and SUM(quantity * price). The pattern is: join the tables you need, group by the thing you're summarising over, and aggregate the measure. It generalises to orders per customer, items per category, and so on.
QHow do aggregate functions handle NULL values?+
They skip them. SUM, AVG, MIN, MAX, and COUNT(column) all ignore NULLs, so AVG(price) averages only the rows that have a price rather than treating missing as zero. The one exception is COUNT(*), which counts every row regardless. This is why a SUM or AVG can look wrong when a column has missing values.
QIn what logical order do the clauses actually run?+
FROM/JOIN gather rows, WHERE drops rows, GROUP BY forms groups, HAVING drops groups, SELECT picks columns and computes aggregates, then ORDER BY and LIMIT shape the output. You write SELECT first, but it's evaluated near the end.
Comments
Loading comments…