Get AI summaries of any video or article — Sign up free
Session 36 - Window Functions in SQL | DSMP 2023 thumbnail

Session 36 - Window Functions in SQL | DSMP 2023

CampusX·
5 min read

Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Window functions compute analytical metrics across related rows while returning results for each original row, unlike GROUP BY which collapses to one row per group.

Briefing

Window functions in SQL are positioned as the key upgrade from basic aggregation: they let analysts compute metrics like averages, ranks, and “next/previous” values while keeping the original row-level detail. Instead of collapsing data into one row per group (as with GROUP BY), window functions return results for every row, which makes it possible to run more complex analytical queries without losing context.

The session starts by grounding the concept. A window function is described as an analytical function that performs calculations across a set of rows related to the current row. That “set” is defined using window specifications—most importantly PARTITION BY (which splits rows into groups based on a column) and ORDER BY (which defines the sequence within each partition). The instructor emphasizes that this is conceptually similar to GROUP BY, but with a crucial behavioral difference: GROUP BY produces one output row per group, while window functions produce output row-by-row.

A worked example uses a student marks table with columns like Student ID, Name, Branch, and Marks. The goal is to compute average marks per branch. With GROUP BY, the output becomes a smaller table: Branch plus average marks. With a window function such as AVG(Marks) OVER (PARTITION BY Branch), the average is computed per branch but repeated on every original row belonging to that branch—so each student row carries the branch-level average alongside their own marks. The practical payoff is highlighted: even small differences in output shape enable a wide range of analytics.

The session then contrasts “overall” aggregation with windowed aggregation. An AVG over the entire dataset yields a single overall average repeated nowhere useful for row-level comparisons. Adding OVER and PARTITION BY Branch turns it into a per-branch average repeated for each student, enabling queries like “show only students whose marks are greater than their branch average.” The approach avoids correlated subqueries by using the windowed average directly in a comparison.

Next comes a core set of ranking functions that analysts repeatedly rely on: RANK, DENSE_RANK, and ROW_NUMBER. All three rank rows within a partition based on ORDER BY (typically Marks descending). RANK assigns the same rank to ties but skips subsequent ranks; DENSE_RANK also assigns the same rank to ties but does not skip; ROW_NUMBER assigns a unique sequential number to every row, even when marks tie. The instructor also notes that ORDER BY is essential for deterministic ranking.

The session expands into “value extraction” functions—FIRST_VALUE, LAST_VALUE, and NTH_VALUE—used to pull the first, last, or nth element within a window after sorting. A major conceptual warning appears here: LAST_VALUE depends on the window frame. That leads to the idea of FRAMES (the subset of rows included in the calculation for the current row). Using the default frame can produce surprising LAST_VALUE results; adjusting the frame to something like “UNBOUNDED FOLLOWING” ensures the last value truly reflects the end of the partition for each row.

Finally, the session covers LAG and LEAD for previous/next row values within a partition. These functions enable rolling analytics such as month-over-month growth: compute monthly revenue totals, then use LAG to subtract the previous month and calculate growth percentage. The session closes by previewing that the remaining window functions and more difficult query patterns will be tackled next, with practice-focused tasks and scenarios.

Cornell Notes

Window functions let SQL compute analytical metrics across related rows while still returning results for every original row. By using PARTITION BY to split data into groups and ORDER BY to define row order, functions like AVG, RANK, and ROW_NUMBER can produce group-aware outputs without collapsing rows like GROUP BY does. The session highlights ranking behavior differences: RANK skips ranks after ties, DENSE_RANK doesn’t, and ROW_NUMBER always assigns unique numbers. It also stresses that FIRST_VALUE/LAST_VALUE depend on the window frame, so adjusting frames (e.g., UNBOUNDED FOLLOWING) is necessary for correct “last” results. LAG/LEAD then support row-to-row comparisons such as month-over-month revenue growth.

How does a window function differ from GROUP BY in output shape, and why does that matter?

GROUP BY collapses rows into one output row per group (e.g., Branch + average marks). A window function computes the same group-level metric but returns it for every row in the partition. In the student example, AVG(Marks) OVER (PARTITION BY Branch) repeats the branch average on each student row, enabling comparisons like “marks greater than branch average” while keeping student-level detail.

What do PARTITION BY and ORDER BY control inside a window function?

PARTITION BY defines how rows are grouped into separate windows (e.g., one window per Branch). ORDER BY defines the sequence within each partition, which is crucial for functions that depend on ordering—especially ranking (RANK/ROW_NUMBER) and offset functions (LAG/LEAD). Without ORDER BY, ranking and “next/previous” logic can become nondeterministic.

When should RANK, DENSE_RANK, and ROW_NUMBER be used, and what’s the tie behavior difference?

All three rank within a partition based on ORDER BY (e.g., Marks DESC). RANK gives the same rank to ties and skips the next rank (e.g., two students tied for 1st both get rank 1, next becomes rank 3). DENSE_RANK gives the same rank to ties but does not skip (next becomes rank 2). ROW_NUMBER always assigns a unique sequential number to every row, even if marks tie—useful when you need exactly one row per position.

Why can LAST_VALUE produce unexpected results, and how do window frames fix it?

LAST_VALUE depends on the window frame—the subset of rows considered for the current row. With the default frame, LAST_VALUE may only look within a limited range (e.g., up to the current row), so it can return the “last value so far” rather than the true last value of the partition. Setting the frame to include the rest of the partition (such as UNBOUNDED FOLLOWING) makes LAST_VALUE reflect the actual end of the partition for every row.

How do LAG and LEAD enable month-over-month growth calculations?

After aggregating revenue per month, LAG(revenue) fetches the previous month’s revenue for the current month within an ORDER BY month sequence. Growth can then be computed as (current_month_revenue - previous_month_revenue) / previous_month_revenue, producing a month-over-month growth percentage. The key is partitioning (if needed) and ordering by month so “previous” is well-defined.

How can a query filter rows using a windowed aggregate without correlated subqueries?

Compute the windowed metric (e.g., AVG(Marks) OVER (PARTITION BY Branch)) and then filter based on it. For example, to keep only students whose Marks exceed their branch average, compare Marks > branch_avg directly in a WHERE clause (or equivalent filtering step). This avoids correlated subqueries because the branch average is already available per row via the window function.

Review Questions

  1. In the student marks example, what exact change turns a GROUP BY average into a windowed average that repeats per row?
  2. Given ties in marks within a branch, how would the output ranks differ between RANK and DENSE_RANK?
  3. What frame adjustment is needed so LAST_VALUE returns the true last value of the partition rather than the last value within the default frame?

Key Points

  1. 1

    Window functions compute analytical metrics across related rows while returning results for each original row, unlike GROUP BY which collapses to one row per group.

  2. 2

    PARTITION BY defines the window’s groups (e.g., one partition per Branch), and ORDER BY defines the row order inside each partition.

  3. 3

    RANK, DENSE_RANK, and ROW_NUMBER differ mainly in how they handle ties: RANK skips ranks, DENSE_RANK doesn’t, and ROW_NUMBER always produces unique positions.

  4. 4

    FIRST_VALUE, LAST_VALUE, and NTH_VALUE depend on the window frame; incorrect default frames can make LAST_VALUE behave like “last so far.”

  5. 5

    Adjusting the frame (e.g., using UNBOUNDED FOLLOWING) ensures LAST_VALUE reflects the end of the partition.

  6. 6

    LAG and LEAD provide previous/next row values within a partition and are ideal for rolling comparisons like month-over-month revenue growth.

  7. 7

    Windowed aggregates can be used for filtering (e.g., marks above branch average) without correlated subqueries by computing the needed metric per row.

Highlights

AVG(Marks) OVER (PARTITION BY Branch) repeats the branch average on every student row, enabling row-level comparisons that GROUP BY can’t support directly.
RANK skips ranks after ties, DENSE_RANK doesn’t, and ROW_NUMBER always assigns a unique number—so the choice depends on how ties should be treated.
LAST_VALUE can be wrong unless the window frame is set correctly; UNBOUNDED FOLLOWING is a common fix to get the true last value.
LAG/LEAD turn “previous month” and “next month” into simple columns, making growth-rate calculations straightforward.

Topics

  • Window Functions
  • Partitioning
  • Ranking Functions
  • Window Frames
  • LAG Lead Growth