X

Celebrating the joy and power of Oracle SQL with the Oracle Developer Advocate team

How to select the top-N rows per group with SQL in Oracle Database

Chris Saxon
Developer Advocate

Fetching the first N rows from a result set is easy with the fetch first clause:

select * 
from   co.orders 
order  by order_datetime desc 
fetch  first 10 rows only;

Or if you're using an archaic version of Oracle Database you can use the rownum trick.

But what if you want the top-N rows in each group? For example, the three most recent orders for each customer?

There's no native syntax for this in SQL. In this post we'll explore how to:

To run the queries, we'll use the Customer Orders schema, which you can get from GitHub. The scripts for this post are available in this Live SQL script.

How to find the top-N rows in each group

Photo by Romain Chollet on Unsplash

Getting the top-N rows for a group is a two-step process:

  1. Assign row numbers starting at one for each group
  2. Filter the result of this down to the rows less than or equal to the number you want

Assigning values from one for each group

To set the row numbers, use the handy row_number() function. This assigns sequential integers starting at one to each row, according to the sort you specify. You define this order in the over clause.

This is also where you can get the row numbers to start at one for each group. Do this with the partition by clause. This splits the rows into groups for each unique set of values in partitioning columns.

Like a regular query, to define the sort, use the order by clause.

So to find the most recent orders for each customer, you should:

  • partition by customer_id
  • order by order_datetime desc

Giving this query:

select o.*,
       row_number () over (
         partition by customer_id
         order by order_datetime desc
       ) rn
from   co.orders o;

Restrict the results to the first N in each group

After generating the row numbers, you need to place this logic in a subquery. This could be an inline view or with clause; I find CTEs easier to read so tend to go with that.

Whichever method you choose, all that's left to do is filter the output of the row_number expression to the most rows you want in each group:

with rws as (
  select o.*, row_number () over (
           partition by customer_id
           order by order_datetime desc
         ) rn
  from   co.orders o
)
  select * from rws
  where  rn <= 3
  order  by customer_id, order_datetime desc;

Pretty easy, right?

Often this is what you want, but there are a few nuances to be aware of. Let's start with a special case: getting one row per customer. For example, the most recent order for each customer.

How to find the first row in each group

Photo by Benjamin Davies on Unsplash

You can get one row per customer and the date of their most recent order using group by:

select customer_id,
       max ( order_datetime )
from   co.orders
group  by customer_id;

The challenge here is adding other columns to the result.

If you add these unaggregated to the select list, you also need to group by them. Doing this gives you a row for each combination of values, so you could have many rows for each customer instead of one.

Placing the other columns in max or min also gives the incorrect result, as this returns the highest or lowest value for each customer. Not the values from the row with the most recent date.

To get only the values for the most recent row, you could use the row numbering method above.

Or there is another trick you can use: the keep clause.

How to get values for columns with the same value as the first or last

The keep clause – referred to as the first and last functions in the docs – enables you to choose which value from the group to show. The syntax for this is:

aggregate_function keep ( 
  dense_rank { first | last }
  order by <sorting columns>
)

This tells the database to get the top (first) or bottom (last) value for the column according to the sort. If two or more rows are tied for first or last position, you take the highest value with max and the lowest with min.

So to get the store where the most recent order was placed for a customer, you need to:

  • Take the min/max store_id
  • Keeping the last row
  • Sorted by order_datetime (ascending)

Which gives this query:

select customer_id,
       max ( order_datetime ),
       max ( store_id ) keep (
         dense_rank last
         order by order_datetime
       ) last_purchase_store_id
from   co.orders
group  by customer_id;

Note that taking the first value from an ascending sort is logically the same as the last value for a descending sort. So could also write the query above by flipping first to last and ordering oldest to newest:

select customer_id,
       max ( order_datetime ),
       max ( store_id ) keep (
         dense_rank first
         order by order_datetime desc
       ) last_purchase_store_id
from   co.orders
group  by customer_id;

Which combination of first/last and ascending or descending sorting you use is largely a matter of personal preference. Some clauses will more accurately reflect the question the query answers.

This is cool, but could lead to lots of extra typing. Particularly if you have hundreds of columns to return that aren't part of the grouping or sorting.

So you may be wondering: why choose this over a generic top-N/group solution, filtered to just one row?

In general you won't! But in rare cases you may need to get values from different rows in each group.

For example, say you've been asked to find for each customer:

  • The date of their last order
  • The store where they placed their first order
  • The status of the order with the highest order_id

Using the top-N per group method, you need to run three separate queries, sorting by order_datetime desc, order_datetime, and order_id desc respectively.

Whereas using keep first/last, you can get all these values with this query:

select customer_id,
       max ( order_datetime ) last_order_date,
       max ( store_id ) keep (
         dense_rank first
         order by order_datetime
       ) first_purchase_store_id,
       max ( order_status ) keep (
         dense_rank last
         order by order_id
       ) last_order_id_status
from   co.orders
group  by customer_id;

While it's unusual to have to answer these kinds of queries, it's worth being aware of these functions when they do crop up.

There is one final case to consider when fetching one row per group. That's where you just don't care which value you get for the unaggregated, ungrouped columns.

Oracle Database 19c added an easy option to do this – the any_value function.

How to get values from a random row in the group with any_value

Sometimes you want to write a query like this:

select customer_id,
       max ( order_datetime ),
       <other columns>
from   co.orders
group  by customer_id;

In this case it doesn't matter which values you fetch for the other columns. From Oracle Database 19c you can fetch a value from random row in each group using the any_value function:

select customer_id,
       max ( order_datetime ),
       any_value ( store_id ),
       any_value ( order_status ),
       any_value ( order_id )
from   co.orders
group  by customer_id;

So why choose this over the long-standing min or max functions?

Firstly it more accurately reflects your intent. You just want some value, rather than explicitly the highest or lowest in the group.

Secondly, it avoids any sorting needed by min and max functions. On large data sets this could lead to performance gains.

Like min and max this ignores null values. So it only returns a null value if the entire group is null.

This gives you several ways to get one row per group. Returning the first row for a customer highlights an important question:

What do you do if two or more rows have the same value for the sort key?

How to deal with ties

Image by TooMuchCoffeeMan from Pixabay

So far we've assumed a customer can only place one order at a time. While there's no unique constraint in the schema to guarantee this, in practice it's unlikely a customer will place two orders at exactly the same time.

But there are some scenarios where duplicates are more-or-less guaranteed. For example, for each store show the three customers that placed the most orders.

To do this, you need to group by store and customer, counting how many rows there are for each:

select store_id, customer_id, 
       count (*) num_orders
from   co.orders
group  by store_id, customer_id;

You want to number the rows for each store, sorted by the number of rows for each customer, most to least. This is:

  • partition by store_id
  • order by count (*) desc

Plugging these into the query gives:

with rws as (
  select store_id, customer_id, count (*) num_orders,
         row_number () over (
           partition by store_id
           order by count(*) desc
         ) rn
  from   co.orders
  group  by store_id, customer_id
)
  select * from rws
  where  rn <= 3
  order  by store_id, rn;

It's likely many customers will make same number of orders at each location. For example, at store 1:

  • 4 customers made 8 orders
  • 13 customers made 7 orders
  • 26 customers made 6 orders

So which customers should you show if you're limiting to the top three per store?

There are few ways to you could handle this:

  • Always return (at most) N rows.
    • If the sort has duplicates for the Nth row, return a random row
    • If the sort has duplicates for the Nth row, always return the same row
  • Return all the rows with the same value as the Nth
  • Return all the rows with the first N values for the sorting columns

The implementation so far returns the first option: at most N rows. Whether the Nth row chosen is deterministic depends on the order by clause. If the combination of this and the partition by clause contain all the columns in a primary key or unique constraint, the output is deterministic. So running the query many times will always give you the same result.

The sort key for the subquery is (store_id, count(*)). Clearly there's no unique constraint on this, so two or more rows can have the same value for store and count. Thus the current implementation is non-deterministic; running this query twice on the same data might give different results.

The easiest way to make this deterministic is to add the primary key to the end of the sort in row_number. We're dealing with an aggregation here, so this is impossible. To ensure predictable output, you need to put all the columns of the group by clause in the partition by or order by of row_number:

with rws as (
  select store_id, customer_id, count (*) num_orders,
         row_number () over (
           partition by store_id
           order by count(*) desc, customer_id
         ) rn
  from   co.orders
  group  by store_id, customer_id
)
  select * from rws
  where  rn <= 3
  order  by store_id, rn;

But what if you want to show all the customers who made the same number of orders as the Nth too?

You can do this by switching the numbering function you use. There are three available:

  • row_number: Sequential values starting at one. All rows have a unique value and there are no gaps
  • rank: Rows with the same sort value have the same value. After ties there is a gap, with the counting starting again at the row's position in the results. This is also known as the Olympic rank
  • dense_rank: As with rank, tied values have the same values. Unlike rank, there are no gaps: all numbers in a group are sequential starting at one.

Connor McDonald explains these further in this video:

How to return all rows with the same value as the Nth

You may want to show all rows tied for position N. When getting top three customers per store, this means you'll get all four that made eight orders at store one. To do this, you want the fourth row to have the same row number as the third. The rank function does this, so all you need to do is change row_number to this:

with rws as (
  select store_id, customer_id, count (*) num_orders,
         rank () over (
           partition by store_id
           order by count(*) desc
         ) rn
  from   co.orders
  group  by store_id, customer_id
)
  select * from rws
  where  rn <= 3
  order  by store_id, rn, customer_id;

How to return all rows with the top-N values

Sometimes you need to see all the rows that have the first N distinct values for the sort. For store one, this means all customers than placed six, seven or eight orders.

To do this you need ties to have the same row number, with no gaps in the numbering sequence. This is what dense_rank does, so slot it in:

with rws as (
  select store_id, customer_id, count (*) num_orders,
         dense_rank () over (
           partition by store_id
           order by count(*) desc
         ) rn
  from   co.orders
  group  by store_id, customer_id
)
  select * from rws
  where  rn <= 3
  order  by store_id, rn, customer_id;

Note that when using row_number, the maximum number of rows the query can return is:

number of groups * N

For the other methods it's possible you could fetch all the rows in the table!

Subtleties like this can be lost when you're in a hurry. To reduce the chance of mistakes, it'd be nice if you could define a function which accepts a table, the grouping and sorting columns, and the number of rows you want from each group.

The good news is – from 19.6 – you can!

Let's see how.

How to write a generic top-N per group function

Photo by Shahadat Rahman on Unsplash

The general form of the top-N per group query is:

with rws as (
  select t.*,
         row_number () over (
           partition by <grouping columns>
           order by <sorting columns>
         ) rn
  from   <tablename> t
)
  select * from rws
  where  rn <= :num_rows;

SQL macros enable you to take this template and use it with any table you want. SQL macros do this by returning the text of a query. At parse time the database merges this with calling statement to give the final query.

You can reference the function's parameters in the string this returns. When resolving the query, the database does a find/replace of the actual text of the arguments you pass in the text of the string.

The really cool part is you can pass tables and columns as arguments. In essence this enables you to create query templates you can pass any table to.

For example, you can create a generic top-N macro like this:

create or replace function top_n (
  tab        dbms_tf.table_t,
  num_rows   integer
)
  return varchar2 sql_macro as
begin
  return
    'select * from tab
     fetch first num_rows rows only';
end top_n;

At parse time the database searches for tab and num_rows in the return string, swapping in the text of the parameters you pass. So if you call this like so:

select * from top_n (
  co.orders, :num_rows
);

The database resolves it into a query like:

select * from co.orders
fetch first :num_rows only;

Note that the bind variable passed is still present in the final query. It's not evaluated or peeked in any way. Similar logic applies to functions: the name of the function itself is still in the query, not the result of executing the function.

To see what the final query is, pass it to dbms_utility.expand_sql_text:

declare
  l_clob clob;
begin
  dbms_utility.expand_sql_text (
    input_sql_text  => q'!select * from top_n (  
  co.orders, :num_rows  
)!',
    output_sql_text => l_clob  );
  dbms_output.put_line(l_clob);
end;
/

To make generic top-N/group function, take the template query above and place it in a (table) SQL macro. To allow developers to change the direction of the sort, you'll also want to add a list of columns to order descending.

create or replace function top_n_per_group (
  tab            dbms_tf.table_t, 
  group_cols     dbms_tf.columns_t,
  order_cols     dbms_tf.columns_t, 
  num_rows       integer,
  desc_cols      dbms_tf.columns_t 
    default null
)
  return clob sql_macro as
  grouping_cols clob;
  ordering_cols clob;
  stmt clob;
begin
  
  for col in 1 .. group_cols.count loop
    grouping_cols := grouping_cols || group_cols ( col ) || ',';
  end loop;
  
  for col in 1 .. order_cols.count loop
    if order_cols ( col ) member of desc_cols then 
      ordering_cols := ordering_cols || order_cols ( col ) || 'desc,' ;
    else 
      ordering_cols := ordering_cols || order_cols ( col ) || ',' ;
    end if;
  end loop;
    
  stmt := 'with rws as (
  select t.*, row_number () over ( 
           partition by ' || rtrim ( grouping_cols, ',' ) || '
           order by ' || rtrim ( ordering_cols, ',' ) || '
         ) rn
  from   tab t
)
  select * from rws
  where  rn <= num_rows'; 
      
  return stmt;
  
end top_n_per_group;
/

Note that you can't reference the parameters for the partition by and order by clauses in the string. This is because the columns pseudo-operator returns an array. So you need to iterate through these values, building up the comma-separated list of values and paste this into the find string.

This solution also assumes that the table has no existing column named rn! If this will be an issue for you, add another dbms_tf.columns_t parameter to override the default.

You can now call this like so:

select * from top_n_per_group (
  co.orders, 
  columns ( customer_id ),
  columns ( store_id, order_datetime ), 
  3,
  columns ( order_datetime )
);

The table arguments don't have to be physical tables in the schema. They could also be views or named subqueries. So you can use this function to fetch the three customers that placed the most orders at each store. Calculate this in a CTE and pass it to the function:

with customer_orders_per_store as ( 
  select store_id, customer_id, count(*) num_orders 
  from   co.orders 
  group  by store_id, customer_id 
) 
  select * from top_n_per_group ( 
    customer_orders_per_store,  
    columns ( store_id ), 
    columns ( num_orders ),  
    3, 
    columns ( num_orders ) 
  );

Or show the three most expensive line-items per customer:

with paid_per_item as (
  select customer_id, 
         unit_price * quantity total_paid
  from   co.order_items oi
  join   co.orders o
  on     o.order_id = oi.order_id
)
  select * from top_n_per_group (
    paid_per_item, 
    columns ( customer_id ),
    columns ( total_paid ), 
    3,
    columns ( total_paid )
  );

Conclusion

If you want exactly one row per group and the highest or lowest values in the ungrouped columns, a plain group by will do the job.

But when it comes to getting the top-N rows in each group, there's no native SQL syntax to do this. Fortunately it's easy to do by numbering rows in a subquery and filtering the result.

You need to know this trick though. With SQL macros you can make this a reusable component, making it easier for other developers to solve this problem in the future.

Using macros also enables you to standardize handling of edge cases, such as dealing with ties or nulls. Or change the implementation if you find a better or faster template query.

Do you have any other methods for solving the top-N per group problem? What do you think of using SQL macros to make these reusable templates? Let us know in the comments!

Get the code from this Live SQL script.

Want to know more about analytic functions? Take Connor's Analytic SQL course on Oracle Dev Gym.

UPDATED: 11 Feb 2021: Changing top_n_per_group to use clob instead of varchar2; adding note about column clashing column names in this function.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.