Fetching a Row Plus N Rows Either Side in a Single SQL Statement

June 2, 2015 | 6 minute read
Chris Saxon
Developer Advocate
Text Size 100%:

A colleague recently posed me the following question:

I want to find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key. How can I do this using SQL?

It's an interesting problem. Let's look at how you do in this in a single, efficient select statement.

When constructing SQL queries it's always beneficial to have example data along with the expected output. In this case, given the following rows:

ID STRING 
-- ------
 1    abc
 2    def
 3    ghi
 4   this
 5    jkl
 6    mno
 7    pqr

When a user searches for the string "this" and N = 2, the query should return the rows with IDs 2-6.

How can you do this?

A first it's tempting to look at analytic functions. Lag() and lead() return values from the rows N positions before or after the current row. Using the classic employees table and N = 1, you could try a query like:

select employee_id, 
       last_name, 
       lead(employee_id,1,0) over (order by employee_id) as nextr,
       lag(employee_id,1,0) over (order by employee_id) as prevr
from   hr.employees
where  last_name = 'Johnson'; 

Nextr and prevr will return the employee IDs either side of Johnson, right?

Wrong.

Analytic functions are processed after the where clause. The condition "last_name = 'Johnson'" returns one row. Oracle then applies lag and lead on just this one row - not all the rows in the underlying table. Therefore instead of getting the employee_IDs next to Johnson's, these functions return zero:

EMPLOYEE_ID LAST_NAME NEXTR PREVR
----------- --------- ----- -----
        179 Johnson       0     0 

Hmmm.

To get around this, you need to execute these functions before the where clause. You can do this by creating an in-line view including them. The where clause then goes outside the subquery, like so:

select employee_id, last_name, prevr, nextr 
from   (
  select employee_id, 
         last_name, 
         lead(employee_id,1,0) over (order by employee_id) as nextr ,
         lag(employee_id,1,0) over (order by employee_id) as prevr        
  from   hr.employees
)
where  last_name = 'Johnson';

EMPLOYEE_ID LAST_NAME NEXTR PREVR
----------- --------- ----- -----
        179 Johnson     178   180 

Now you have the previous and next employee_IDs. With these values, you can join back to the original table to find all the IDs between them:

with boundaries as (
  select prevr, nextr 
  from  (
    select employee_id, 
           last_name, 
           lead(employee_id,1,0) over  (order by employee_id) as nextr,
           lag(employee_id,1,0) over  (order by employee_id) as prevr        
    from hr.employees
  )
  where last_name = 'Johnson'
)
select emp.employee_id, emp.last_name 
from   hr.employees emp
join   boundaries b 
on     emp.employee_id between b.prevr and b.nextr
order  by emp.employee_id; 

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        178 Grant                    
        179 Johnson                  
        180 Taylor                    

Success!

To change the number of rows returned you just need to adjust the second parameter to lag() and lead(). You can convert these to bind variables. This enables users to select how many rows they want at runtime.

Now you have a working query it's worth checking its performance. Here's where this query falls down. An index on last_name isn't going to help. Oracle has to scan all the rows in the employees table first, before applying the last name check (otherwise lag() and lead() could give us incorrect results).

This isn't going to scale well for tables with millions or billions of rows.

What else could you try?

There's the naive approach. Just add and subtract the number of rows wanted from the target employee_ID. Then find all the rows between these values:

with target as (
  select employee_id
  from   hr.employees
  where  last_name = 'Johnson'
)
select emp.employee_id, emp.last_name
from   hr.employees emp
where  (select employee_id from target) between emp.employee_id-1 and emp.employee_id+1; 

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        178 Grant                    
        179 Johnson                  
        180 Taylor                   

Looks like it works! It gives the expected output. And it can use an index on employees.last_name. Time to roll this into production, right?

Unfortunately, no.

This example works because there are no gaps between employee_IDs. In general this isn't guaranteed (what if you can delete rows?). This method may return too few rows either side of the target. It also only works on gap-free integer columns. It doesn't work on a general version the problem: finding the rows ordered by something other than ID (e.g. insert date).

Let's look at the problem again, restating it slightly.

You want to find the upper and lower IDs, N rows away from a given row. Then find all the rows between them these IDs. Starting with the ID range, Oracle will be able to fetch the rows using a nice efficient index range scan on the primary key (assuming small N).

So the real question is:

How can you find the upper and lower bounds efficiently?

Enter the min/max index range scan.

Oracle is able to find the maximum or minimum value stored in an indexed column efficiently. This is because indexes are ordered structures, so it can just look at the first or last value in the index. This principle applies even if you're looking for a value in the middle of an index - provided you know the value immediately before (or after) it!

A query in the form:

select max(employee_id) from employees where employee_id < :value; 

can still do an efficient min/max index range scan. Oracle is able to search the index for the value provided. From there it goes to the previous value - the maximum less than the search value. This fetches the employee ID before the variable - even if there are gaps. To find the next ID, change the max() to a min() and flip the less than to a greater than.

You can exploit this feature to efficiently find the upper and lower IDs either side of our target. Plugging the previous query in to a basic select from employees, along with one to find the next row, gives the following:

select (select max(employee_id) 
        from   hr.employees e 
        where  e.employee_id < emp.employee_id) prevr,
       (select min(employee_id) 
        from   hr.employees e 
        where  e.employee_id > emp.employee_id) nextr
from   hr.employees emp
where  last_name = 'Johnson'; 
PREVR      NEXTR
---------- ----------
178        180

This is all very well and good. You've only returned the first value either side of the target however. This only works if N = 1. How can you get the rest when N is greater?

Thinking programmatically, you could re-run this query N times. At each execution, use the maximum/minimum employee_IDs found in the previous pass - i.e. recursively call the query with the new boundaries.

At this point you may be tempted to reach for your favourite programming language to build a recursive function including this query. There's no need however - you can do it all in SQL!

This is possible using recursive query subfactoring. To use this you need to place the query in a with clause, define your base case, the recursive case and the exit condition(s).

The previous query gives your base case. You need to add another column to this to count how many times you've recursed, starting with one. The recursive case starts with the same query. This time however, you increment the recursion count and add a check that this value is less than or equal to N (the terminating condition). Union-alling these two together inside a with clause gives:

with boundaries (prevr, nextr, lvl) as (
  select (select max(e.employee_id) 
          from   hr.employees e 
          where  e.employee_id < emp.employee_id) prevr,
         (select min(e.employee_id) 
          from   hr.employees e 
          where  e.employee_id > emp.employee_id) nextr,
         1 lvl
  from   hr.employees emp
  where  last_name = 'Johnson'
  union all
  select (select max(employee_id) 
          from   hr.employees e 
          where  e.employee_id < prevr) prevr,
         (select min(employee_id) 
          from   hr.employees e 
          where  e.employee_id > nextr) nextr,
         lvl+1 lvl
  from   boundaries
  where  lvl+1 <= :lvl
)
  select * from boundaries;

-- for lvl/N = 2
     PREVR      NEXTR        LVL
---------- ---------- ----------
       178        180          1
       177        181          2

Now you have the upper and lower bounds. To get the rows asked for, just join employees to the above where the employee_ID is between prevr and nextr.

Before you do so, there is some tidying up steps needed. To prevent duplicating rows in the output, include a check that the computed recursion count (lvl) equals N. You also need to be wary of searching for rows near the start or end of the employee IDs. If recursion will take you beyond the max or min IDs, then the subquery will return null. This will cause the whole output to be empty. To prevent this, nvl() the results of these queries with employee_ID in the base case and prevr and nextr in the recursive case. Finally, use a bind variable for the last_name check!

Putting this all together gives the following query:

with boundaries (prevr, nextr, lvl) as (
  select nvl((select max(e.employee_id) 
              from   hr.employees e 
              where  e.employee_id < emp.employee_id), 
              emp.employee_id) prevr,
         nvl((select min(e.employee_id) 
              from   hr.employees e 
              where  e.employee_id > emp.employee_id),
              emp.employee_id) nextr,
         1 lvl
  from   hr.employees emp
  where  last_name = :lname
  union all
  select nvl((select max(employee_id) 
              from   hr.employees e 
              where  e.employee_id < prevr),
              prevr
         ) prevr,
         nvl((select min(employee_id) 
              from   hr.employees e 
              where  e.employee_id > nextr),
              nextr
         ) nextr,
         lvl+1 lvl
  from   boundaries
  where  lvl+1 <= :lvl
)
  select e.employee_id, e.last_name
  from   hr.employees e
  join   boundaries b
  on     e.employee_id between b.prevr and b.nextr
  and    b.lvl = :lvl
  order  by e.employee_id;  

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        177 Livingston
        178 Grant
        179 Johnson
        180 Taylor
        181 Fleaur

A nice, efficient way to find the N rows either side of a target!

Note: I have to thank Sayan Malakshinov for providing the inspiration for the final solution for this, which came from his query to efficiently find the top-N results within multiple categories.

Chris Saxon

Developer Advocate

Chris Saxon is an Oracle Developer Advocate for SQL. His job is to help you get the best out of the Oracle Database and have fun with SQL!

To help you with this he blogs at All Things SQL. He also creates videos combining SQL and magic on YouTube at the The Magic of SQL.

If you have questions about working with Oracle Database technology, please reach out to him. You can do this via Twitter or on Ask Tom.


Previous Post

Query Tuning 101: Comparing Execution Plans and Access vs. Filter Predicates

Chris Saxon | 1 min read

Next Post


Optimizing the PL/SQL Challenge VI: The Problem with Group By Subqueries

Chris Saxon | 6 min read