A colleague recently posed me the following question:
I want to find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key. How can I do this using SQL?
It’s an interesting problem. Let’s look at how you do in this in a single, efficient select statement.
When constructing SQL queries it’s always beneficial to have example data along with the expected output. In this case, given the following rows:
ID STRING -- ------ 1 abc 2 def 3 ghi 4 this 5 jkl 6 mno 7 pqr
When a user searches for the string “this” and N = 2, the query should return the rows with IDs 2-6.
How can you do this?
A first it’s tempting to look at analytic functions. Lag() and lead() return values from the rows N positions before or after the current row. Using the classic employees table and N = 1, you could try a query like:
select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr, lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees where last_name = 'Johnson';
Nextr and prevr will return the employee IDs either side of Johnson, right?
Wrong.
Analytic functions are processed after the where clause. The condition “last_name = ‘Johnson'” returns one row. Oracle then applies lag and lead on just this one row – not all the rows in the underlying table. Therefore instead of getting the employee_IDs next to Johnson’s, these functions return zero:
EMPLOYEE_ID LAST_NAME NEXTR PREVR ----------- --------- ----- ----- 179 Johnson 0 0
Hmmm.
To get around this, you need to execute these functions before the where clause. You can do this by creating an in-line view including them. The where clause then goes outside the subquery, like so:
select employee_id, last_name, prevr, nextr from ( select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr , lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees ) where last_name = 'Johnson'; EMPLOYEE_ID LAST_NAME NEXTR PREVR ----------- --------- ----- ----- 179 Johnson 178 180
Now you have the previous and next employee_IDs. With these values, you can join back to the original table to find all the IDs between them:
with boundaries as ( select prevr, nextr from ( select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr, lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees ) where last_name = 'Johnson' ) select emp.employee_id, emp.last_name from hr.employees emp join boundaries b on emp.employee_id between b.prevr and b.nextr order by emp.employee_id; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 178 Grant 179 Johnson 180 Taylor
Success!
To change the number of rows returned you just need to adjust the second parameter to lag() and lead(). You can convert these to bind variables. This enables users to select how many rows they want at runtime.
Now you have a working query it’s worth checking its performance. Here’s where this query falls down. An index on last_name isn’t going to help. Oracle has to scan all the rows in the employees table first, before applying the last name check (otherwise lag() and lead() could give us incorrect results).
This isn’t going to scale well for tables with millions or billions of rows.
What else could you try?
There’s the naive approach. Just add and subtract the number of rows wanted from the target employee_ID. Then find all the rows between these values:
with target as ( select employee_id from hr.employees where last_name = 'Johnson' ) select emp.employee_id, emp.last_name from hr.employees emp where (select employee_id from target) between emp.employee_id-1 and emp.employee_id+1; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 178 Grant 179 Johnson 180 Taylor
Looks like it works! It gives the expected output. And it can use an index on employees.last_name. Time to roll this into production, right?
Unfortunately, no.
This example works because there are no gaps between employee_IDs. In general this isn’t guaranteed (what if you can delete rows?). This method may return too few rows either side of the target. It also only works on gap-free integer columns. It doesn’t work on a general version the problem: finding the rows ordered by something other than ID (e.g. insert date).
Let’s look at the problem again, restating it slightly.
You want to find the upper and lower IDs, N rows away from a given row. Then find all the rows between them these IDs. Starting with the ID range, Oracle will be able to fetch the rows using a nice efficient index range scan on the primary key (assuming small N).
So the real question is:
How can you find the upper and lower bounds efficiently?
Enter the min/max index range scan.
Oracle is able to find the maximum or minimum value stored in an indexed column efficiently. This is because indexes are ordered structures, so it can just look at the first or last value in the index. This principle applies even if you’re looking for a value in the middle of an index – provided you know the value immediately before (or after) it!
A query in the form:
select max(employee_id) from employees where employee_id < :value;
can still do an efficient min/max index range scan. Oracle is able to search the index for the value provided. From there it goes to the previous value – the maximum less than the search value. This fetches the employee ID before the variable – even if there are gaps. To find the next ID, change the max() to a min() and flip the less than to a greater than.
You can exploit this feature to efficiently find the upper and lower IDs either side of our target. Plugging the previous query in to a basic select from employees, along with one to find the next row, gives the following:
select (select max(employee_id) from hr.employees e where e.employee_id < emp.employee_id) prevr, (select min(employee_id) from hr.employees e where e.employee_id > emp.employee_id) nextr from hr.employees emp where last_name = 'Johnson';
PREVR NEXTR ---------- ---------- 178 180
This is all very well and good. You’ve only returned the first value either side of the target however. This only works if N = 1. How can you get the rest when N is greater?
Thinking programmatically, you could re-run this query N times. At each execution, use the maximum/minimum employee_IDs found in the previous pass – i.e. recursively call the query with the new boundaries.
At this point you may be tempted to reach for your favourite programming language to build a recursive function including this query. There’s no need however – you can do it all in SQL!
This is possible using recursive query subfactoring. To use this you need to place the query in a with clause, define your base case, the recursive case and the exit condition(s).
The previous query gives your base case. You need to add another column to this to count how many times you’ve recursed, starting with one. The recursive case starts with the same query. This time however, you increment the recursion count and add a check that this value is less than or equal to N (the terminating condition). Union-alling these two together inside a with clause gives:
with boundaries (prevr, nextr, lvl) as ( select (select max(e.employee_id) from hr.employees e where e.employee_id < emp.employee_id) prevr, (select min(e.employee_id) from hr.employees e where e.employee_id > emp.employee_id) nextr, 1 lvl from hr.employees emp where last_name = 'Johnson' union all select (select max(employee_id) from hr.employees e where e.employee_id < prevr) prevr, (select min(employee_id) from hr.employees e where e.employee_id > nextr) nextr, lvl+1 lvl from boundaries where lvl+1 <= :lvl ) select * from boundaries; -- for lvl/N = 2 PREVR NEXTR LVL ---------- ---------- ---------- 178 180 1 177 181 2
Now you have the upper and lower bounds. To get the rows asked for, just join employees to the above where the employee_ID is between prevr and nextr.
Before you do so, there is some tidying up steps needed. To prevent duplicating rows in the output, include a check that the computed recursion count (lvl) equals N. You also need to be wary of searching for rows near the start or end of the employee IDs. If recursion will take you beyond the max or min IDs, then the subquery will return null. This will cause the whole output to be empty. To prevent this, nvl() the results of these queries with employee_ID in the base case and prevr and nextr in the recursive case. Finally, use a bind variable for the last_name check!
Putting this all together gives the following query:
with boundaries (prevr, nextr, lvl) as ( select nvl((select max(e.employee_id) from hr.employees e where e.employee_id < emp.employee_id), emp.employee_id) prevr, nvl((select min(e.employee_id) from hr.employees e where e.employee_id > emp.employee_id), emp.employee_id) nextr, 1 lvl from hr.employees emp where last_name = :lname union all select nvl((select max(employee_id) from hr.employees e where e.employee_id < prevr), prevr ) prevr, nvl((select min(employee_id) from hr.employees e where e.employee_id > nextr), nextr ) nextr, lvl+1 lvl from boundaries where lvl+1 <= :lvl ) select e.employee_id, e.last_name from hr.employees e join boundaries b on e.employee_id between b.prevr and b.nextr and b.lvl = :lvl order by e.employee_id; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 177 Livingston 178 Grant 179 Johnson 180 Taylor 181 Fleaur
A nice, efficient way to find the N rows either side of a target!
Note: I have to thank Sayan Malakshinov for providing the inspiration for the final solution for this, which came from his query to efficiently find the top-N results within multiple categories.