TRANS_ID TRANS_DATETIME TRANS_TYPE AMOUNT -------- ----------------- ---------- -------- 6 01-JAN-2015 00:00 CREDIT 40 7 01-JAN-2015 00:00 CREDIT 90 8 01-JAN-2015 00:00 DEBIT 80 5 02-JAN-2015 00:00 CREDIT 200 4 03-JAN-2015 00:00 DEBIT 99 3 04-JAN-2015 00:00 CREDIT 50 2 05-JAN-2015 00:00 DEBIT 200 1 06-JAN-2015 00:00 DEBIT 100Your boss has asked you to do the following:
TRANS_ID TRANS_DATETIME TRANS_TYPE AMOUNT PREV_ID_OTHER_TYPE -------- ------------- ------------- ------ ------------------ 6 01-JAN-2015 00:00 CREDIT 40 7 01-JAN-2015 00:00 CREDIT 90 8 01-JAN-2015 00:00 DEBIT 80 7 5 02-JAN-2015 00:00 CREDIT 200 8 4 03-JAN-2015 00:00 DEBIT 99 5 3 04-JAN-2015 00:00 CREDIT 50 4 2 05-JAN-2015 00:00 DEBIT 200 3 1 06-JAN-2015 00:00 DEBIT 100 3
You need to write the SQL to do this.
How?
Let's look at three different approaches:
First, let's analyze the problem.
You need to find values from previous rows. The obvious way to do this is with lag. Using this as-is doesn't do what you need though. A basic order by trans_datetime returns the value from the immediately preceding row. You need to skip back N rows.
The second parameter for lag is the offset. You can use this to provide the value for N. But the value for N is variable. You don't know how many credits (or debits) there are in the current series. Calculating the value for N is tricky.
Using partition by to group the results by trans_type doesn't help either. This just returns the previous ID of the same type.
So how can you do it?
You need a two-step process. First, check the type of the current transaction. If it's credit, then you want the previous row that was a debit (and the reverse). You can use a case expression to do this.
Next you need to find the trans_id value for this. Lag's out, so what can you use?
The last_value function enables you to look backwards in your results. This returns the final value according to the sorting and windowing you specify. You're looking for the last transaction of the other type. This looks like it fits.
Now put this all together. If your current transaction is credit, you want the last_value of debit transactions (and vice-versa). You can do this as follows:
case when trans_type = 'CREDIT' then last_value( case when trans_type = 'DEBIT' then trans_id end ) over( order by trans_datetime, trans_id ) else last_value( case when trans_type = 'CREDIT' then trans_id end ) over( order by trans_datetime, trans_id ) endBy default last_value considers all values, including null. It also has a windowing clause of:
range between unbounded preceding and current rowThis isn't what you want. The case expression inside the last_value maps the current row to null. This is the final value in the ordered set (so far). So all the values for this are null. To overcome this, you need to skip back to the first non-null entry. You do this by adding the "ignore nulls" clause.
select t.*, case when trans_type = 'CREDIT' then last_value( case when trans_type = 'DEBIT' then trans_id end ) ignore nulls over( order by trans_datetime, trans_id ) else last_value( case when trans_type = 'CREDIT' then trans_id end ) ignore nulls over( order by trans_datetime, trans_id ) end prev_id_other_trans_type from transactions t order by trans_datetime, trans_id;This does the job. Yet it takes a bit of thinking to understand the SQL.
No, not that kind of model!
In 10g, Oracle introduced the model clause. This enables you to create columns using complex formulas. This gives spreadsheet-like abilities in SQL!
You can use it here to do the logic above.
To do this, you need to supply three parts for the model. Dimension by states how you're going to uniquely identify each cell. Measures lists the columns you're returning. Rules tells Oracle how to create the values for the previous ID.
dimension by ( trans_type tp, row_number() over (order by trans_datetime, trans_id) rn )
measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type )
prev_id_other_type['CREDIT', any] = <expression>, prev_id_other_type['DEBIT', any] = <expression>The any keyword means you want to calculate the values for all the rows.
rn <= cv()CV() returns the current value of the dimension. This is rn. So you're saying look at all rows where the row number is less than or equal to the current. Remember that you defined rn as the order to look at rows (trans_datetime, trans_id). So gives you:
prev_id_other_type['CREDIT', any] = max(trans_id) ['DEBIT', rn <= cv()]You can do a similar thing for debits by switching CREDIT and DEBIT in the above.
max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id )This gives a complete solution of:
select * from transactions t model dimension by ( trans_type tp, row_number() over (order by trans_datetime, trans_id) rn ) measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type ) rules ( prev_id_other_type['CREDIT', any] = max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id ) ['DEBIT', rn <= cv()], prev_id_other_type['DEBIT', any] = max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id ) ['CREDIT', rn <= cv()] ) order by trans_datetime, trans_id;Again, this works. While I think it makes it more obvious you're returning credit ids for debits and vice-versa, exactly what this does is still unclear.
12c brought the match_recognize clause. This enables you to look for patterns in your data using regular expressions.
As with the model clause you need to supply several parts. Let's take a bottom-up approach to using this.define c as trans_type = 'CREDIT', d as trans_type = 'DEBIT'Next, you need to define the pattern you're looking for. You need any sequence of one or more occurrences of either variable. Here you need a regular expression. The | symbol is the choice operator. This matches the expression either side of it. So you want ( c | d ).
pattern( ( c | d )+ )You need to return all the rows in the table. So specify
all rows per matchTo finish, state which order you need to process the rows and how to calculate the prev_id_other_type column. As before, the order is simply trans_datetime, trans_id.
decode(classifier(), 'D', c.trans_id, d.trans_id)This works because if the current row is a debit, c.trans_id keeps the value from the last credit transaction.
select * from transactions match_recognize ( order by trans_datetime, trans_id measures decode(classifier(), 'D', c.trans_id, d.trans_id) prev_id_other_trans_type all rows per match pattern( (c|d)+ ) define c as trans_type = 'CREDIT', d as trans_type = 'DEBIT' ) order by trans_datetime, trans_id;Phew! That was hard work. You now have three different methods to choose from.
All three methods only access the table once. So it comes down to how efficiently each technique can process the data.
To test this, I've loaded the table with 10,000 rows. I ran it on an Oracle Cloud Database (Extreme Performance Service) on version 12.1.0.2.
Let's look at the times for three executions of each.
AnalyticsRun 1: Elapsed: 00:00:15.86Run 2: Elapsed: 00:00:13.10 Run 3: Elapsed: 00:00:12.84 Mean: 13.93 seconds ModelRun 1: Elapsed: 00:00:38.73Run 2: Elapsed: 00:00:35.80 Run 3: Elapsed: 00:00:40.88 Mean: 38.47 seconds Match RecognizeRun 1: Elapsed: 00:00:12.13 Mean: 12.21 seconds |
![]() |
The model clause is the clear loser here. It takes around three times longer than the other two approaches. Pattern matching has a slight edge on standard analytics. So if you're on 12c, you may want to go for this method.
Want to try this out for yourself? Have a go on LiveSQL.
What do you think? Which approach do you like best? Are there any better ways to do this?
Please let us know in the comments!
Thanks to Stew Ashton and Rajeshwaran Jeyabal for providing the pattern matching and analytic solutions in this Ask Tom thread.
Very nice idea to use the 12c pattern matching not for finding a pattern per se, but rather for just "dividing" the rows into the two categories and exploit the semantics of the pattern variables usage as column prefixes :)
This pattern matching feature is indeed an extremely powerful tool :)
Cheers,
Iudith
Thanks Iudith, credit to Stew Ashton for the match_recognize solution.
Chris, does using in-memory analyticals give a performance increase (to the appropriate solution) ?
You mean placing the tables in-memory?
It depends ;)
In Memory is most effective when working with large data sets and aggregating them or selecting just a few columns from the table. So if you're only fetching a few rows probably not. If you're fetching many, it could.
Another way but i think it is more expensive -
with tmp ( TRANS_ID, TRANS_DATETIME, TRANS_TYPE , AMOUNT) as (
select 7, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 90 from dual union all
select 6, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 40 from dual union all
select 2, to_date('05-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 200 from dual union all
select 8, to_date('01-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 80 from dual union all
select 1, to_date('06-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 100 from dual union all
select 5, to_date('02-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 200 from dual union all
select 4, to_date('03-JAN-2015', 'dd-mon-yyyy'), 'DEBIT' , 99 from dual union all
select 3, to_date('04-JAN-2015', 'dd-mon-yyyy'), 'CREDIT' , 50 from dual
)
select
y.*,
(select TRANS_ID from
(select z.TRANS_ID,
row_number() over
(order by TRANS_DATETIME desc, TRANS_ID desc
) rwnum
from tmp z
where z.TRANS_TYPE != y.TRANS_TYPE and y.TRANS_DATETIME >= z.TRANS_DATETIME
and (case when y.TRANS_DATETIME = z.TRANS_DATETIME then
y.TRANS_ID
else 2 end )
> (case when y.TRANS_DATETIME = z.TRANS_DATETIME then
z.TRANS_ID
else 1 end )
)
where rwnum = 1
)
from
(
select
x.*
from tmp x
order by TRANS_DATETIME, TRANS_TYPE, TRANS_ID
) y