Examples of using Window Functions in PostgreSQL

Introduction

Window functions often draw a blank stare when mentioned in a conversation about SQL. The support for using them in ORM's is usually limited or non-existent. They don't get mentioned in most courses on SQL/PostgreSQL, and so they slip under the radar.

But here's the thing: they're pretty handy. Not only do they help users remove cruft from their queries, they also provide real performance gains. They also make rows aware of the rows around them, which can help users to chain queries together when the data has a form of implicit association.

But knowing when to use them, and understanding the problems they can solve can be a bit of a minefield. Below is an example of a request to do with events stored on users and a dashboard to display them.

Our problem

Self-Description

You are storing user events in a very simple SQL table.

  1. Someone asks for a dashboard with the latest event for each user,
  2. Someone then asks that it shows the latest five events for each user,
  3. Someone else asks that it shows the latest five non-repeating events for each user.

For this example, let's not worry about displaying the dashboard. Instead, we'll just concern ourselves with preparing the data for it.

I have a simple SQL table with three columns, with tick expected to increment on each event for each name.

create table events(tick int, action varchar(10), name text);  

The order that events are created in the table is not guaranteed.

insert into events  
     values (1, 'update', 'andy'), 
            (2, 'update', 'beth'), 
            ...

Our solution

I think we should build an sql database

Latest event

This common request can be solved with a distinct on and an order by.

SELECT DISTINCT ON (name) tick, action, name  
FROM events  
ORDER BY name, tick DESC;

 tick | action |   name
------+--------+----------
  999 | delete | andy
  999 | edit   | beth
...
(8 rows)

Time: 8.856 ms  

Latest five events

This is often a difficult problem to solve as the key limiting factor is the last five events. You can't rely on some constant tick value as there's no guarantee that all the users are on the same tick.

  • We could request the last five events for each user individually, repeating this example by changing the name = 'andy' accordingly.
SELECT DISTINCT name FROM events;

   name
----------
 chris
 eric
...
(8 rows)

Time: 3.143 ms  

First we'd gather a list of unique names and then loop through each to get their latest events.

SELECT tick, action, name  
FROM events  
WHERE name = 'andy'  
ORDER BY tick DESC  
LIMIT 5;

 tick | action  | name
------+---------+------
  999 | delete  | andy
  998 | exit    | andy
...
(5 rows)

Time: 2.276 ms  

However this creates N number of queries to the database, where N is the number of users. Not very efficient for a dashboard that may have a lot of traffic.

  • We can move the first query as a inner SQL query that returns the last five ticks against the rows in the table. Then we can narrow it to only select the latest rows.
SELECT e.tick, e.action, e.name  
FROM events e  
WHERE e.tick IN (  
  SELECT tick 
  FROM events e2 
  WHERE e.name = e2.name 
  ORDER BY e2.tick DESC 
  LIMIT 5
)
ORDER BY e.name, e.tick DESC;

 tick |  action  |   name
------+----------+----------
  999 | delete   | andy
  998 | exit     | andy
...
(40 rows)

Time: 6783.961 ms  

The downside? This isn't very efficient as it has to check every row against all rows that match that user. What we want is the five maximum tick values PER user before requesting them, otherwise we are wasting CPU cycles.

Non-repeating events

This can be seen as a linked list problem. We want to read the latest events for each user, and only store each subsequent action that's different.

  • Check for non-repeating cycles in your favourite programming language.

We would read in a small number of events for each user. And if we can't get five non-repeating events then make another request to the database with an offset.

SELECT * FROM events  
WHERE name = 'andy'  
ORDER BY tick desc  
LIMIT 100  
OFFSET 100;

 tick |  action  | name
------+----------+------
  899 | enter    | andy
  898 | upvote   | andy
...
(100 rows)

Time: 2.298 ms  

While you'd expect user behaviour wouldn't be that repetitive, it will happen. This will increase the number of queries required, as well as the aforementioned requirement to repeat this for each user.

In exploring a pure SQL solution, I came up short in putting together a solution that did not utilise window functions. If you're able to provide a performant example please pass it onto us at uSwitch. We're always interested in learning new ways to solve these sort of issues.

Enter the window functions

Lead and lag will provide the linked list functionality that will help this query. Other window functions such as rank and row_number will enable us to select just the latest five.

Lead and Lag

These will give you the next and previous values. Those values depend on how you partition (similar to group by) and order your window function.

SELECT  
  tick,
  action,
  name,
  LAG(action) OVER w AS previous_action,
  LEAD(action) OVER w AS next_action
FROM  
  events e
WINDOW w AS (  
  PARTITION BY name
  ORDER BY tick
);

 tick |  action  |   name   | previous_action | next_action
------+----------+----------+-----------------+-------------
    0 | downvote | andy     |                 | comment
    1 | comment  | andy     | downvote        | comment
    2 | comment  | andy     | comment         | comment
    3 | comment  | andy     | comment         | edit
(8000 rows)

Time: 20.618 ms  

Here we're using the WINDOW w AS () so that we avoid repeating ourselves and making mistakes with the order or partitions in the separate columns.

These two window functions enable you to build lists of rows which are aware of what has come before or after. When combined with Common Table Expressions (CTE) you can filter for actions that are different from the next or previous action.

WITH events_with_next_action AS (  
    SELECT
      tick,
      action,
      name,
      LEAD(action) OVER w AS next_action
    FROM
      events e
    WINDOW w AS (
      PARTITION BY name
      ORDER BY tick
    )
)
SELECT  
  tick,
  action,
  name
FROM  
  events_with_next_action
WHERE  
  action IS DISTINCT FROM next_action
ORDER BY name, tick DESC;

 tick |  action  |   name
------+----------+----------
...
  961 | comment  | andy
  959 | edit     | andy
  958 | delete   | andy
  956 | upvote   | andy
  954 | create   | andy
...
(7011 rows)

Time: 22.364 ms  

We've pruned the list slightly to illustrate an area where we're skipping ticks because of the repetitions in the action. The performance of this across the total set of 8000 rows is very strong as we are simply narrowing the search space in our where clause.

You may have noticed the use of is distinct from. This avoids the problem with the first and last value in our partitions being compared against NULL and not being included in the returned rows.

Up until now we have been able to show that with window functions you can get the previous and next values without any heavy performance costs. But we haven't tackled the second and third question.

Rank & Row number

These functions, and their derivatives dense_rank and percent_rank, all provide a rather plain value. They usually return an accumulating number of the position, or rank, of the row within its partition and order. In this rather plain origin lies the answer to getting the latest events for each user.

WITH events_with_next_action AS (  
    SELECT
      tick,
      action,
      name,
      LEAD(action) OVER w AS next_action
    FROM
      events e
    WINDOW w AS (
      PARTITION BY name
      ORDER BY tick
    )
)
SELECT  
  tick,
  action,
  name,
  ROW_NUMBER() OVER w AS position 
FROM  
  events_with_next_action
WHERE  
  action IS DISTINCT FROM next_action
WINDOW w AS (  
  PARTITION BY name
  ORDER BY tick DESC
)

ORDER BY  
  name,
  tick DESC;

 tick |  action  |   name   | position
------+----------+----------+----------
  999 | delete   | andy     |        1
  998 | exit     | andy     |        2
  ...
    0 | downvote | andy     |      878
  999 | edit     | beth     |        1
  998 | enter    | beth     |        2
  ...
(7011 rows)

Time: 29.080 ms  

The position continues to rise until we move onto the next partition with "beth". Given this information we able to place this position into another CTE and simply return the rows that have a position of 5 or less.

Window function solutions

Here's the solution to the second problem of getting the latest five events for each user.

WITH events_with_row_numbers AS (  
    SELECT
      tick,
      action,
      name,
      ROW_NUMBER() OVER w AS position
    FROM
      events e
    WINDOW w AS (
      PARTITION BY name
      ORDER BY tick DESC
    )
)
SELECT  
  tick,
  action,
  name
FROM  
  events_with_row_numbers
WHERE  
  position < 6
ORDER BY  
  name,
  tick DESC;

 tick |  action  |   name
------+----------+----------
  999 | delete   | andy
  998 | exit     | andy
...
(40 rows)

Time: 13.851 ms  

And here's the solution to the third problem of getting the latest five non-repeating events for each user.

WITH events_with_next_action AS (  
    SELECT
      tick,
      action,
      name,
      LEAD(action) OVER w AS next_action
    FROM
      events e
    WINDOW w AS (
      PARTITION BY name
      ORDER BY tick
    )
), unique_events_with_rank AS (
  SELECT
    tick,
    action,
    name,
    ROW_NUMBER() OVER w AS position 
  FROM
    events_with_next_action
  WHERE
    action IS DISTINCT FROM next_action
  WINDOW w AS (
    PARTITION BY name
    ORDER BY tick DESC
  )
)
SELECT  
  tick,
  action,
  name
FROM  
  unique_events_with_rank
WHERE  
  position < 6
ORDER BY  
  name,
  tick DESC;

 tick |  action  |   name
------+----------+----------
  999 | delete   | andy
  998 | exit     | andy
...
(40 rows)

Time: 23.372 ms  

Both of these are very performant over the table, and are not complicated to read, once the behaviours of the window functions are explained.

Conclusion

Hopefully, this very basic example provides some insight into using window functions to solve specific problems. It's well worth using them anytime you find yourself running multiple queries against the same table set, especially where there is an implicit relationship in the rows returned. In this case the tick against each name/user was very important.

You can find a docker ready postgres instance here that has the data I built these queries around.

Reading list

Here are a few guides and documents I have used to help me understand the syntax behind window functions and some other examples of using them.