By Klaker-Oracle on Jan 27, 2015
- MATCH RECOGNIZE
- DETERMINISTIC FINITE AUTO
I have just uploaded a new workshop on sessionization analytics using the 12c pattern matching feature, MATCH_RECOGNIZE, to the Oracle Learning Library. The workshop is based on analysis of the log files generated by our the Big Data Lite Movieplex application, which is part of our Big Data Lite virtual machine. Oracle Movieplex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalised experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.
All the activity from our application is captured in a log file and we are going to analyze the data captured in that file by using SQL pattern matching to create a sessionization result set for our business users and data scientists to explore and analyze. The sections in the workshop (I have recorded a video of this workshop, see links below) will step you through the process of creating our sessionization result set using the Database 12c pattern matching features.
The workshop and video are available on the Oracle Learning Library using the following links:
For more information (whitepapers, multi-media Apple iBooks, tutorials etc) about SQL pattern matching and analytical SQL then checkout our home page on OTN: http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html.
Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature:
To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
The data set is based around an online customer movie application. Here is a little background…
Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.
As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:
In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information. Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!
At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.
The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.
Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code. Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):
The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):
As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.
My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.
First step is to create a DIRECTORY object to access the raw data file:
CREATE DIRECTORY session_file_dir AS '/home/oracle/applog'; GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;
Next I created an external table over the data file. Notice that in the avro file, each key column has an identifier and a value.
The objective for this external table was to keep the processing as simple as possible:
CREATE TABLE RAW_SESSION_DATA ( CUST_ID VARCHAR2(20) , MOVIE_ID VARCHAR2(20) , GENRE_ID VARCHAR2(20) , SESSION_DATE VARCHAR2(50) , RECOMMENDED_ID VARCHAR2(20) , ACTIVITY_ID VARCHAR2(20) , RATING_ID VARCHAR2(20) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY SESSION_FILE_DIR ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE NOBADFILE NODISCARDFILE NOLOGFILE FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'" MISSING FIELD VALUES ARE NULL) LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') ) REJECT LIMIT 1;ALTER TABLE RAW_SESSION_DATA PROJECT COLUMN REFERENCED;
Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:
CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT SUBSTR(CUST_ID,11) AS CUST_ID , CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID , CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID , TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE , CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID , (CASE SUBSTR(ACTIVITY_ID,12,2) WHEN '1' THEN '1' WHEN '2}' THEN '2' WHEN '3}' THEN '3' WHEN '4}' THEN '4' WHEN '5}' THEN '5' WHEN '6}' THEN '6' WHEN '7}' THEN '5' WHEN '8}' THEN '8' WHEN '9}' THEN '9' WHEN '10' THEN '10' WHEN '11' THEN '11' ELSE null END) AS ACTIVITY_ID , CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID , case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate , case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete , case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause , case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start , case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse , case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list , case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search , case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login , case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout , case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete , case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase FROM RAW_SESSION_DATA;
Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.
Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.
The code to calculate the sessionization information is as follows:
SELECT * FROM vwsession_data MATCH_RECOGNIZE (PARTITION BY cust_id ORDER BY sess_date MEASURES match_number() session_id, COUNT(*) no_of_events, FIRST(sess_date) start_date, TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time, LAST(sess_date) end_date, TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration, LAST(activity_id) last_act_id, COUNT(act_rate) act_rate, COUNT(act_complete) act_complete, COUNT(act_pause) act_pause, COUNT(act_start) act_start, COUNT(act_browse) act_browse, COUNT(t(act_list) act_list, COUNT(act_search) act_search, COUNT(act_login) act_login, COUNT(act_logout) act_logout, COUNT(act_incomplete) act_incomplete, COUNT(act_purchase) act_purchase ONE ROW PER MATCH PATTERN (strt s+) DEFINE s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5) );
This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about 5 minutes then I will assume that a new session has started even if the user has not logged out.
Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least one (or more) "complete" event is logged,
PATTERN (strt s* f+ c+) DEFINE f as act_start = 'Y', c as act_complete = 'Y', s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)
The output from this query is shown below:
Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams.
While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.
So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:
SELECT * FROM SESSION_DATA td MATCH_RECOGNIZE (PARTITION BY movie_id ORDER BY sess_date MEASURES COUNT(*) no_of_events, FIRST(sess_date) start_date, TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time, LAST(sess_date) end_date, TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration, MIN(activity_id) act_id, MAX(activity_id) last_act_id, COUNT(act_rate) act_rate, COUNT(act_complete) act_complete, COUNT(act_pause) act_pause, COUNT(act_start) act_start, COUNT(act_browse) act_browse, COUNT(act_list) act_list, COUNT(act_search) act_search, COUNT(act_login) act_login, COUNT(act_logout) act_logout, COUNT(act_incomplete) act_incomplete, COUNT(act_purchase) act_purchase ONE ROW PER MATCH PATTERN (strt s*) DEFINE s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120) ) MR;
The speed at which you can create these additional/alternative result sets is really great! It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.
The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis. Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….
Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:
Most of these have been related to explaining the basic concepts and along with some specific use cases.
In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.
To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.
To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:
The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:
The 12c SQL to create the initial result set is as follows:
SELECT user_id, session_id start_time, no_of_events, duration FROM Events MATCH_RECOGNIZE (PARTITION BY User_ID ORDER BY Time_Stamp MEASURES match_number() session_id, count(*) as no_of_events, first(time_stamp) start_time, last(time_stamp) - first(time_stamp) duration ONE ROW PER MATCH PATTERN (b s*) DEFINE s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10) ) ;
as a comparison here is how to achieve the above using 11g analytical window functions
CREATE VIEW Sessionized_Events as SELECT Time_Stamp, User_ID, Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID FROM ( SELECT Time_Stamp, User_ID, CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10 THEN 0 ELSE 1 END Session_Increment FROM Events);SELECT User_ID, Min(Time_Stamp) Start_Time, Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration FROM Sessionized_Events GROUP BY User_ID, Session_ID;
As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.
Part 3 - However, to get business value from this derived data set we need to do some additional processing. Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:
To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2) we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (b s*) DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100) ) ) where No_Of_Events >= 20 );
Here is code based on using the NEXT() function to compare the current record against the next record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (s* e) DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100) ) ) where No_Of_Events >= 20
Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses window functions (which in my opinion is a lot more complex):
select count(*) from ( select /* wf */ * from (select O_pbykey, Session_ID, min(O_Orderdate) Start_Time, count(*) No_Of_Events, (max(O_Orderdate) - Min(O_Orderdate)) Duration from (select O_Orderdate, O_Custkey, o_pbykey, sum(Session_Increment) over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID from ( select O_Custkey, O_Orderdate, O_pbykey, case when (O_Orderdate – Lag(O_Orderdate) over(partition by o_pbykey order by O_custkey, O_Orderdate)) <= 100 -- Threshold then 0 else 1 end Session_Increment from orders_v ) ) group by o_pbykey, Session_ID ) where No_Of_Events >= 20
The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">
Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.
Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.
However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.
Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.
In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (firstname.lastname@example.org).
Have fun with MATCH_RECOGNIZE….
Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.
The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer).
The link to fantastic OTN VDD platform is here: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite and this is what the landing pad page looks like:
This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption. The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.
You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html (this contains everything you need to run my SQL Pattern Matching workshop).
While you doing the workshop, if you have any questions then please feel free to email me - email@example.com.
Make sure you are free on Tuesday February 4 because the OTN team are hosting another of their virtual developer day events. Most importantly it is FREE. Even more importantly is the fact that I will be running a 12c pattern matching workshop at 11:45am Pacific Time. Of course there are lots other sessions that you can attend relating to big data and Oracle Database 12c and the OTN team has created two streams to help you learn about this two important areas:
You can sign-up for the event and pick your tracks and sessions via this link: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite.
My pattern matching session is included in the Oracle 12c DBA section of the application development track and the workshop will cover the following topics:
As my session is only 45 minutes long I am only going to cover the first three topics and leave you to work through the last three topics in your own time. During the 45 minute workshop I will be available to answer any questions via the live Q&A chat feature.
There is a link to the full agenda on the invitation page. The OTN team will be providing a Database 12c Virtualbox VM that you will be able to download later this week. For the pattern matching session I will be providing the scripts to install our sample schema, the slides from the webcast and the workshop files which include a whole series of exercises that will help you learn about pattern matching and test your SQL skills.
The big data team has kindly included my pattern matching content inside their Virtualbox image so if you want to focus on the sessions offered on the big data tracks but still want to work on the pattern matching exercises after the event then you will have everything you need already installed and ready to go!
Don't forget to register as soon as possible and I hope you have a great day…Let me know if you have any questions or comments.
One of the overlooked capabilities of SQL as an analysis engine, because we all just take it for granted, is that you can mix and match analytical features to create some amazing mash-ups. As we move into the exciting world of big data these mash-ups can really deliver those "wow, I never knew that" moments.
While Java is an incredibly flexible and powerful framework for managing big data there are some significant challenges in using Java and MapReduce to drive your analysis to create these "wow" discoveries. One of these "wow" moments was demonstrated at this year's OpenWorld during Andy Mendelsohn's general keynote session.
Here is the scenario - we are looking for fraudulent activities in our big data stream and in this case we identifying potentially fraudulent activities by looking for specific patterns. We using geospatial tagging of each transaction so we can create a real-time fraud-map for our business users.
Where we start to move towards a "wow" moment is to extend this basic use of spatial and pattern matching, as shown in the above dashboard screen, to incorporate spatial analytics within the SQL pattern matching clause. This will allow us to compute the distance between transactions. Apologies for the quality of this screenshot….hopefully below you see where we have extended our SQL pattern matching clause to use location of each transaction and to calculate the distance between each transaction:
This allows us to compare the time of the last transaction with the time of the current transaction and see if the distance between the two points is possible given the time frame. Obviously if I buy something in Florida from my favourite bike store (may be a new carbon saddle for my Trek) and then 5 minutes later the system sees my credit card details being used in Arizona there is high probability that this transaction in Arizona is actually fraudulent (I am fast on my Trek but not that fast!) and we can flag this up in real-time on our dashboard:
In this post I have used the term "real-time" a couple of times and this is an important point and one of the key reasons why SQL really is the only language to use if you want to analyse big data. One of the most important questions that comes up in every big data project is: how do we do analysis? Many enlightened customers are now realising that using Java-MapReduce to deliver analysis does not result in "wow" moments. These "wow" moments only come with SQL because it is offers a much richer environment, it is simpler to use and it is faster - which makes it possible to deliver real-time "Wow!". Below is a slide from Andy's session showing the results of a comparison of Java-MapReduce vs. SQL pattern matching to deliver our "wow" moment during our live demo.
You can watch our analytical mash-up "Wow" demo that compares the power of 12c SQL pattern matching + spatial analytics vs. Java-MapReduce here:
You can get more information about SQL Pattern Matching on our SQL Analytics home page on OTN, see here http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html.
You can get more information about our spatial analytics here: http://www.oracle.com/technetwork/database-options/spatialandgraph/overview/index.html
If you would like to watch the full Database 12c OOW presentation see here: http://medianetwork.oracle.com/video/player/2686974264001
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.