X

Celebrating the joy and power of Oracle SQL with the Oracle Developer Advocate team

Finding the Previous Row in a Different Group Using SQL

Chris Saxon
Developer Advocate
Imagine you have a table with a series of transactions. These are classified into credits or debits:
  TRANS_ID TRANS_DATETIME    TRANS_TYPE     AMOUNT
  -------- ----------------- ----------   --------
         6 01-JAN-2015 00:00 CREDIT             40
         7 01-JAN-2015 00:00 CREDIT             90
         8 01-JAN-2015 00:00 DEBIT              80
         5 02-JAN-2015 00:00 CREDIT            200
         4 03-JAN-2015 00:00 DEBIT              99
         3 04-JAN-2015 00:00 CREDIT             50
         2 05-JAN-2015 00:00 DEBIT             200
         1 06-JAN-2015 00:00 DEBIT             100
Your boss has asked you to do the following:

"I want a report showing the ID of the previous transaction of the other type, according to the transaction date. Now!"

So for credits, you need the previous debit trans_id and vice-versa. In the case of ties (transactions at the same time), simply take the highest transaction id. So the output you want looks like:
TRANS_ID TRANS_DATETIME TRANS_TYPE    AMOUNT PREV_ID_OTHER_TYPE
-------- -------------  ------------- ------ ------------------
6        01-JAN-2015    00:00 CREDIT    40                   
7        01-JAN-2015    00:00 CREDIT    90                   
8        01-JAN-2015    00:00 DEBIT     80            7
5        02-JAN-2015    00:00 CREDIT    200           8
4        03-JAN-2015    00:00 DEBIT     99            5
3        04-JAN-2015    00:00 CREDIT    50            4
2        05-JAN-2015    00:00 DEBIT     200           3
1        06-JAN-2015    00:00 DEBIT     100           3

You need to write the SQL to do this.

How?

Let's look at three different approaches:

Analytics

First, let's analyze the problem.

You need to find values from previous rows. The obvious way to do this is with lag. Using this as-is doesn't do what you need though. A basic order by trans_datetime returns the value from the immediately preceding row. You need to skip back N rows.

The second parameter for lag is the offset. You can use this to provide the value for N. But the value for N is variable. You don't know how many credits (or debits) there are in the current series. Calculating the value for N is tricky.

Using partition by to group the results by trans_type doesn't help either. This just returns the previous ID of the same type.

So how can you do it?

You need a two-step process. First, check the type of the current transaction. If it's credit, then you want the previous row that was a debit (and the reverse). You can use a case expression to do this.

Next you need to find the trans_id value for this. Lag's out, so what can you use?

The last_value function enables you to look backwards in your results. This returns the final value according to the sorting and windowing you specify. You're looking for the last transaction of the other type. This looks like it fits.

Now put this all together. If your current transaction is credit, you want the last_value of debit transactions (and vice-versa). You can do this as follows:

  case 
    when trans_type = 'CREDIT' then
      last_value( case when trans_type = 'DEBIT' then trans_id end )
        over( order by trans_datetime, trans_id )
    else
      last_value( case when trans_type = 'CREDIT' then trans_id end )
        over( order by trans_datetime, trans_id )
  end
By default last_value considers all values, including null. It also has a windowing clause of:
  range between unbounded preceding and current row 
This isn't what you want. The case expression inside the last_value maps the current row to null. This is the final value in the ordered set (so far). So all the values for this are null. To overcome this, you need to skip back to the first non-null entry. You do this by adding the "ignore nulls" clause.

The complete query is then:
  select t.*,
         case 
           when trans_type = 'CREDIT' then
             last_value( case when trans_type = 'DEBIT' then trans_id end ) 
               ignore nulls over( order by trans_datetime, trans_id )
           else
             last_value( case when trans_type = 'CREDIT' then trans_id end ) 
               ignore nulls over( order by trans_datetime, trans_id )
         end prev_id_other_trans_type
  from   transactions t
  order  by trans_datetime, trans_id;
This does the job. Yet it takes a bit of thinking to understand the SQL.

Let's look at other ways to see if they are easier to understand.

Model Clause

Ny Nordisk mode, Catwalk by Benjamin Suomela (CC-BY-2.5-DK)

 

No, not that kind of model!

In 10g, Oracle introduced the model clause. This enables you to create columns using complex formulas. This gives spreadsheet-like abilities in SQL!

You can use it here to do the logic above.

To do this, you need to supply three parts for the model. Dimension by states how you're going to uniquely identify each cell. Measures lists the columns you're returning. Rules tells Oracle how to create the values for the previous ID.

Dimension By

You need to supply this to tell Oracle how to identify each row and column in the results. i.e. a unique identifier.

Row_number(), ordered by transaction_datetime, trans_id does this. To help with switching credits to debits and the reverse, you need a bit more. So also put trans_type in the dimension by, giving:
  dimension by (  
    trans_type tp,   
    row_number() over (order by trans_datetime, trans_id) rn  
  )

Measures

Here you just need all the columns in the table and the column we're creating, prev_id_other_type. You need to give this an initial value. It's a number, so zero works fine:
  measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type )  

Rules

This is where the fun starts. You want to assign the last debit to credits and vice-versa. First, you need to state which cells you're providing values for. This is where the columns you defined in the dimension by come in.

To aid understanding, you can do this in two separate rules. One for CREDIT and one for DEBIT. So you have two assignment targets:
  prev_id_other_type['CREDIT', any] = <expression>,
  prev_id_other_type['DEBIT', any] = <expression>
The any keyword means you want to calculate the values for all the rows.

Now you need to figure out what goes in the expression. In words, this is:

"The most recent prior trans_id of the other type. If there are two transactions at the same time, take the one with the greatest trans_id."

You need the greatest trans_id, which is a clue to use max(). Next you need to state which rows you want the get this max from. For credits, this is the debits. You want to consider all these previous rows. You can do this with the slightly cryptic expression:
  rn <= cv()
CV() returns the current value of the dimension. This is rn. So you're saying look at all rows where the row number is less than or equal to the current. Remember that you defined rn as the order to look at rows (trans_datetime, trans_id). So gives you:
  prev_id_other_type['CREDIT', any] = max(trans_id) ['DEBIT', rn <= cv()]
You can do a similar thing for debits by switching CREDIT and DEBIT in the above.

But this isn't quite right. Transaction dates and ids are in different orders. So the previous row ordered by trans_id isn't necessarily the previous row ordered by trans_datetime.

To overcome this, you can use the last analytic function.

This enables you to return the trans_id of the rows ordered by date. You use it as follows:
  max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id )
This gives a complete solution of:
  select * 
  from   transactions t  
  model  
    dimension by (   
      trans_type tp,   
      row_number() over (order by trans_datetime, trans_id) rn  
    )  
    measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type )  
    rules (  
      prev_id_other_type['CREDIT', any] = 
        max(trans_id) keep (
          dense_rank last order by trans_datetime, trans_id
        ) ['DEBIT', rn <= cv()],
      prev_id_other_type['DEBIT', any] = 
        max(trans_id) keep (
          dense_rank last order by trans_datetime, trans_id
        ) ['CREDIT', rn <= cv()]
     )  
  order by trans_datetime, trans_id;
Again, this works. While I think it makes it more obvious you're returning credit ids for debits and vice-versa, exactly what this does is still unclear.

So let's look at the final method, pattern matching.

Pattern Matching (Match_Recognize)

 

12c brought the match_recognize clause. This enables you to look for patterns in your data using regular expressions.

As with the model clause you need to supply several parts. Let's take a bottom-up approach to using this.

First, you need to define the variables. You want one for each transaction type. This is as easy as stating which trans_type they are equal to:
  define 
     c as trans_type = 'CREDIT',
     d as trans_type = 'DEBIT' 
Next, you need to define the pattern you're looking for. You need any sequence of one or more occurrences of either variable. Here you need a regular expression. The | symbol is the choice operator. This matches the expression either side of it. So you want ( c | d ).

You want to match at least one of these. So place the + symbol after it. This gives the following pattern:
  pattern( ( c | d )+ )
You need to return all the rows in the table. So specify
  all rows per match
To finish, state which order you need to process the rows and how to calculate the prev_id_other_type column. As before, the order is simply trans_datetime, trans_id.

As with the model clause, you use the measures clause to define the new column. To "switch over" the transaction types, use the classifier function. This gives the name of the variable that matches the current row. So if a row's trans_type = 'CREDIT', this is C and its D for debits.

To get the previous trans_id of the other type, simply decode classifier. Use this to return the trans_id of the other variable:
  decode(classifier(), 'D', c.trans_id, d.trans_id)
This works because if the current row is a debit, c.trans_id keeps the value from the last credit transaction.

So your final pattern matching SQL is:
  select * 
  from   transactions
    match_recognize (
      order by trans_datetime, trans_id
      measures decode(classifier(), 'D', c.trans_id, d.trans_id) prev_id_other_trans_type
      all rows per match
      pattern( (c|d)+ )
      define 
        c as trans_type = 'CREDIT',
        d as trans_type = 'DEBIT' 
  )
  order  by trans_datetime, trans_id;
Phew! That was hard work. You now have three different methods to choose from.

How do you decide which to use?

All need a bit of thinking to understand. Picking one over the other based on clarity comes down to personal preferences. In all cases, good commenting is essential.

Let's take an objective approach. Which is fastest?

Performance

All three methods only access the table once. So it comes down to how efficiently each technique can process the data.

To test this, I've loaded the table with 10,000 rows. I ran it on an Oracle Cloud Database (Extreme Performance Service) on version 12.1.0.2.

Let's look at the times for three executions of each.

 

Analytics

Run 1: Elapsed: 00:00:15.86
Run 2: Elapsed: 00:00:13.10
Run 3: Elapsed: 00:00:12.84

Mean: 13.93 seconds

Model

Run 1: Elapsed: 00:00:38.73
Run 2: Elapsed: 00:00:35.80
Run 3: Elapsed: 00:00:40.88

Mean: 38.47 seconds

Match Recognize

Run 1: Elapsed: 00:00:12.13
Run 2: Elapsed: 00:00:12.30
Run 3: Elapsed: 00:00:12.21

Mean: 12.21 seconds

The model clause is the clear loser here. It takes around three times longer than the other two approaches. Pattern matching has a slight edge on standard analytics. So if you're on 12c, you may want to go for this method.

Want to try this out for yourself? Have a go on LiveSQL.

What do you think? Which approach do you like best? Are there any better ways to do this?

Please let us know in the comments!

Thanks to Stew Ashton and Rajeshwaran Jeyabal for providing the pattern matching and analytic solutions in this Ask Tom thread.

Join the discussion

Comments ( 5 )
  • Iudith Mentzel Thursday, January 7, 2016

    Very nice idea to use the 12c pattern matching not for finding a pattern per se, but rather for just "dividing" the rows into the two categories and exploit the semantics of the pattern variables usage as column prefixes :)

    This pattern matching feature is indeed an extremely powerful tool :)

    Cheers,
    Iudith

  • Chris Saxon Thursday, January 7, 2016

    Thanks Iudith, credit to Stew Ashton for the match_recognize solution.

  • Martin Rose Friday, March 18, 2016

    Chris, does using in-memory analyticals give a performance increase (to the appropriate solution) ?

  • Chris Saxon Friday, March 18, 2016

    You mean placing the tables in-memory?

    It depends ;)

    In Memory is most effective when working with large data sets and aggregating them or selecting just a few columns from the table. So if you're only fetching a few rows probably not. If you're fetching many, it could.

  • guest Thursday, June 30, 2016

    Another way but i think it is more expensive -

    with tmp ( TRANS_ID, TRANS_DATETIME, TRANS_TYPE , AMOUNT) as (
    select 7, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 90 from dual union all
    select 6, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 40 from dual union all
    select 2, to_date('05-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 200 from dual union all
    select 8, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 80 from dual union all
    select 1, to_date('06-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 100 from dual union all
    select 5, to_date('02-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 200 from dual union all
    select 4, to_date('03-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 99 from dual union all
    select 3, to_date('04-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 50 from dual
    )
    select
    y.*,
    (select TRANS_ID from
    (select z.TRANS_ID,
    row_number() over
    (order by TRANS_DATETIME desc, TRANS_ID desc
    ) rwnum
    from tmp z
    where z.TRANS_TYPE != y.TRANS_TYPE and y.TRANS_DATETIME >= z.TRANS_DATETIME
    and (case when y.TRANS_DATETIME = z.TRANS_DATETIME then
    y.TRANS_ID
    else 2 end )
    > (case when y.TRANS_DATETIME = z.TRANS_DATETIME then
    z.TRANS_ID
    else 1 end )
    )
    where rwnum = 1
    )
    from
    (
    select
    x.*
    from tmp x
    order by TRANS_DATETIME, TRANS_TYPE, TRANS_ID
    ) y

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha