X

Celebrating the joy and power of Oracle SQL with the Oracle Developer Advocate team

Fetching a Row Plus N Rows Either Side in a Single SQL Statement

Chris Saxon
Developer Advocate

A colleague recently posed me the following question:

I want to find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key. How can I do this using SQL?

It's an interesting problem. Let's look at how you do in this in a single, efficient select statement.

When constructing SQL queries it's always beneficial to have example data along with the expected output. In this case, given the following rows:

ID STRING 
-- ------
 1    abc
 2    def
 3    ghi
 4   this
 5    jkl
 6    mno
 7    pqr

When a user searches for the string "this" and N = 2, the query should return the rows with IDs 2-6.

How can you do this?

A first it's tempting to look at analytic functions. Lag() and lead() return values from the rows N positions before or after the current row. Using the classic employees table and N = 1, you could try a query like:

select employee_id, 
       last_name, 
       lead(employee_id,1,0) over (order by employee_id) as nextr,
       lag(employee_id,1,0) over (order by employee_id) as prevr
from   hr.employees
where  last_name = 'Johnson'; 

Nextr and prevr will return the employee IDs either side of Johnson, right?

Wrong.

Analytic functions are processed after the where clause. The condition "last_name = 'Johnson'" returns one row. Oracle then applies lag and lead on just this one row - not all the rows in the underlying table. Therefore instead of getting the employee_IDs next to Johnson's, these functions return zero:

EMPLOYEE_ID LAST_NAME NEXTR PREVR
----------- --------- ----- -----
        179 Johnson       0     0 

Hmmm.

To get around this, you need to execute these functions before the where clause. You can do this by creating an in-line view including them. The where clause then goes outside the subquery, like so:

select employee_id, last_name, prevr, nextr 
from   (
  select employee_id, 
         last_name, 
         lead(employee_id,1,0) over (order by employee_id) as nextr ,
         lag(employee_id,1,0) over (order by employee_id) as prevr        
  from   hr.employees
)
where  last_name = 'Johnson';

EMPLOYEE_ID LAST_NAME NEXTR PREVR
----------- --------- ----- -----
        179 Johnson     178   180 

Now you have the previous and next employee_IDs. With these values, you can join back to the original table to find all the IDs between them:

with boundaries as (
  select prevr, nextr 
  from  (
    select employee_id, 
           last_name, 
           lead(employee_id,1,0) over  (order by employee_id) as nextr,
           lag(employee_id,1,0) over  (order by employee_id) as prevr        
    from hr.employees
  )
  where last_name = 'Johnson'
)
select emp.employee_id, emp.last_name 
from   hr.employees emp
join   boundaries b 
on     emp.employee_id between b.prevr and b.nextr
order  by emp.employee_id; 

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        178 Grant                    
        179 Johnson                  
        180 Taylor                    

Success!

To change the number of rows returned you just need to adjust the second parameter to lag() and lead(). You can convert these to bind variables. This enables users to select how many rows they want at runtime.

Now you have a working query it's worth checking its performance. Here's where this query falls down. An index on last_name isn't going to help. Oracle has to scan all the rows in the employees table first, before applying the last name check (otherwise lag() and lead() could give us incorrect results).

This isn't going to scale well for tables with millions or billions of rows.

What else could you try?

There's the naive approach. Just add and subtract the number of rows wanted from the target employee_ID. Then find all the rows between these values:

with target as (
  select employee_id
  from   hr.employees
  where  last_name = 'Johnson'
)
select emp.employee_id, emp.last_name
from   hr.employees emp
where  (select employee_id from target) between emp.employee_id-1 and emp.employee_id+1; 

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        178 Grant                    
        179 Johnson                  
        180 Taylor                   

Looks like it works! It gives the expected output. And it can use an index on employees.last_name. Time to roll this into production, right?

Unfortunately, no.

This example works because there are no gaps between employee_IDs. In general this isn't guaranteed (what if you can delete rows?). This method may return too few rows either side of the target. It also only works on gap-free integer columns. It doesn't work on a general version the problem: finding the rows ordered by something other than ID (e.g. insert date).

Let's look at the problem again, restating it slightly.

You want to find the upper and lower IDs, N rows away from a given row. Then find all the rows between them these IDs. Starting with the ID range, Oracle will be able to fetch the rows using a nice efficient index range scan on the primary key (assuming small N).

So the real question is:

How can you find the upper and lower bounds efficiently?

Enter the min/max index range scan.

Oracle is able to find the maximum or minimum value stored in an indexed column efficiently. This is because indexes are ordered structures, so it can just look at the first or last value in the index. This principle applies even if you're looking for a value in the middle of an index - provided you know the value immediately before (or after) it!

A query in the form:

select max(employee_id) from employees where employee_id < :value; 

can still do an efficient min/max index range scan. Oracle is able to search the index for the value provided. From there it goes to the previous value - the maximum less than the search value. This fetches the employee ID before the variable - even if there are gaps. To find the next ID, change the max() to a min() and flip the less than to a greater than.

You can exploit this feature to efficiently find the upper and lower IDs either side of our target. Plugging the previous query in to a basic select from employees, along with one to find the next row, gives the following:

select (select max(employee_id) 
        from   hr.employees e 
        where  e.employee_id < emp.employee_id) prevr,
       (select min(employee_id) 
        from   hr.employees e 
        where  e.employee_id > emp.employee_id) nextr
from   hr.employees emp
where  last_name = 'Johnson'; 
PREVR      NEXTR
---------- ----------
178        180

This is all very well and good. You've only returned the first value either side of the target however. This only works if N = 1. How can you get the rest when N is greater?

Thinking programmatically, you could re-run this query N times. At each execution, use the maximum/minimum employee_IDs found in the previous pass - i.e. recursively call the query with the new boundaries.

At this point you may be tempted to reach for your favourite programming language to build a recursive function including this query. There's no need however - you can do it all in SQL!

This is possible using recursive query subfactoring. To use this you need to place the query in a with clause, define your base case, the recursive case and the exit condition(s).

The previous query gives your base case. You need to add another column to this to count how many times you've recursed, starting with one. The recursive case starts with the same query. This time however, you increment the recursion count and add a check that this value is less than or equal to N (the terminating condition). Union-alling these two together inside a with clause gives:

with boundaries (prevr, nextr, lvl) as (
  select (select max(e.employee_id) 
          from   hr.employees e 
          where  e.employee_id < emp.employee_id) prevr,
         (select min(e.employee_id) 
          from   hr.employees e 
          where  e.employee_id > emp.employee_id) nextr,
         1 lvl
  from   hr.employees emp
  where  last_name = 'Johnson'
  union all
  select (select max(employee_id) 
          from   hr.employees e 
          where  e.employee_id < prevr) prevr,
         (select min(employee_id) 
          from   hr.employees e 
          where  e.employee_id > nextr) nextr,
         lvl+1 lvl
  from   boundaries
  where  lvl+1 <= :lvl
)
  select * from boundaries;

-- for lvl/N = 2
     PREVR      NEXTR        LVL
---------- ---------- ----------
       178        180          1
       177        181          2

Now you have the upper and lower bounds. To get the rows asked for, just join employees to the above where the employee_ID is between prevr and nextr.

Before you do so, there is some tidying up steps needed. To prevent duplicating rows in the output, include a check that the computed recursion count (lvl) equals N. You also need to be wary of searching for rows near the start or end of the employee IDs. If recursion will take you beyond the max or min IDs, then the subquery will return null. This will cause the whole output to be empty. To prevent this, nvl() the results of these queries with employee_ID in the base case and prevr and nextr in the recursive case. Finally, use a bind variable for the last_name check!

Putting this all together gives the following query:

with boundaries (prevr, nextr, lvl) as (
  select nvl((select max(e.employee_id) 
              from   hr.employees e 
              where  e.employee_id < emp.employee_id), 
              emp.employee_id) prevr,
         nvl((select min(e.employee_id) 
              from   hr.employees e 
              where  e.employee_id > emp.employee_id),
              emp.employee_id) nextr,
         1 lvl
  from   hr.employees emp
  where  last_name = :lname
  union all
  select nvl((select max(employee_id) 
              from   hr.employees e 
              where  e.employee_id < prevr),
              prevr
         ) prevr,
         nvl((select min(employee_id) 
              from   hr.employees e 
              where  e.employee_id > nextr),
              nextr
         ) nextr,
         lvl+1 lvl
  from   boundaries
  where  lvl+1 <= :lvl
)
  select e.employee_id, e.last_name
  from   hr.employees e
  join   boundaries b
  on     e.employee_id between b.prevr and b.nextr
  and    b.lvl = :lvl
  order  by e.employee_id;  

EMPLOYEE_ID LAST_NAME               
----------- -------------------------
        177 Livingston
        178 Grant
        179 Johnson
        180 Taylor
        181 Fleaur

A nice, efficient way to find the N rows either side of a target!

Note: I have to thank Sayan Malakshinov for providing the inspiration for the final solution for this, which came from his query to efficiently find the top-N results within multiple categories.

Join the discussion

Comments ( 9 )
  • Stew Ashton Tuesday, June 2, 2015

    In 12c, the MATCH_RECOGNIZE clause makes this easier:

    select match_number, employee_id, last_name
    from hr.employees
    match_recognize(
    order by employee_id
    measures match_number() match_number
    all rows per match
    pattern(a{2} b a{2})
    define b as last_name = 'Smith'
    );

  • Chris Saxon Tuesday, June 2, 2015

    Thanks Stew, that is easier than my final solution ;)

  • Koen Tuesday, June 2, 2015

    Well explained Chris - never would have found this solution myself

  • TheSQLGuru Monday, June 8, 2015

    You mention this is "an efficient way to find the N rows either side of a target". Did you do any benchmarking to see how efficient the final query really is? I would love to see a million row sample set and some testing of various solutions for this problem!!

  • Chris Saxon Thursday, June 11, 2015

    Great point SQLGuru. I did do testing on a million row table and my final solution was the best by far. I'll put together a blog post at some point with the full details.

  • Chris Saxon Thursday, June 25, 2015

    SQLGuru - I've now posted a follow up analysing the performance on a 1 million row table: https://blogs.oracle.com/sql/entry/fetching_a_row_plus_n

  • Enrique Aviles Thursday, January 7, 2016

    Excellent information as usual.

    The original requirement is to "find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key."

    Does the non-matching part mean the two strings before the current string and the two strings after the current string cannot match?

    The sample two column table contains different values on the STRING column on all rows. I ran a test against the EMPLOYEE table but updated LAST_NAME to 'LIVINGSTON' on employee_id = 178 so the two previous rows have employees with the same last name.

    This is the output:

    EMPLOYEE_ID LAST_NAME
    ----------- ------------
    177 Livingston
    178 Livingston
    179 Johnson
    180 Taylor
    181 Fleaur

    The last name of the first two rows match. Doesn't this violate the non-matching requirement?

  • Chris Saxon Thursday, January 7, 2016

    Thanks Enrique.

    The requirement was to find the rows that don't match the string you're searching for. So the two Livingston rows are fine, because they're different to Johnson.

    Does that makes sense?

  • Enrique Aviles Thursday, January 7, 2016

    Yes, that makes sense. For some reason I thought the non-matching part applied within the two before and two after rows. Thanks!

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.