Supermodels, ETL and Data Mining
By Jean-Pierre Dijcks on Mar 31, 2009
I figured, since we proclaim data warehousing such a broad area to venture out and look at some of the data mining functionality in the database. Now I'm only scratching the surface of that wonderful world here, but I figured it is worthwhile.
We used a couple of things here, one is OWB (Oracle Warehouse Builder) the other is the data mining option in Oracle Database 11g.
Supermodels; you are kidding right?
Well actually we are not and that is of course why you are reading this. So what are we talking about here, really? We are not going to discuss how you can load cool pictures from your favorite supermodel into a CLOB in the Oracle database. Nope, we will discuss data mining models and their latest attributes and how you can apply these in ETL scenarios.
What is a supermodel
Before that question can be answered we need to discuss a bit of background of data mining and what you run data mining models on. To do data mining and understand things that are not obvious in your data you need to first consolidate and prepare your data set. Depending on the algorithms you use these steps can vary in complexity. Some of the data preparation can be considered pure ETL. Move data from various sources into a single consolidated data set (a table or a view). On top of that you may need to massage the data to fit the algorithms, and that is where some more data preparation comes into play. So a supermodel in data mining performs automatic data preparation while the model is being build. A supermodel therefore adds more functionality to the model creation and gives a superior out-of-the-box experience to the user. Instead of doing all sorts of data preparation outside the model, Oracle’s supermodels deploy the data preparation with the model.
Therefore the supermodel does not just predict and perform your desired data mining tasks, the supermodel also prepares your data for the required algorithms. With Oracle 11g – so don’t go looking around in the 10g database – the supermodels are now at your disposal.
ETL and Data Mining and ETL
As we said earlier in the article, before you start to do data mining you will have to do some data consolidation. With Oracle 11g, Oracle also packages Oracle Warehouse Builder, which can be used to do all the ETL needed for the data mining preparation. However we are not going to discuss that so much here.
Instead we will focus on leveraging the Data Mining models within the ETL to augment for example a data warehouse load with more relevant information for the business.
Who does what job?
First things first, are we implying here that the ETL developers now start learning how to develop data mining models? No, the beauty of Oracle’s architecture is that both the ETL and the data mining is done in the database. The data mining expert, who knows how to build the models, will build the model and deploy it into the database. From there the APIs are such – they are PL/SQL – that the ETL developer can call the models directly from within the ETL process, which is create in PL/SQL with Oracle Warehouse Builder.
The idea is therefore very much to leverage each individual’s role to enhance the data accessible for the entire enterprise.
Why should I use Data Mining within the ETL process?
The simple answer is, because you get more complete information for everyone with access to the resulting target system. But lets look at some examples out of the press these days.
Relevant information when looking at the risk banks run in various business units is a hot topic. For example, would it not be nice to load your data warehouse with the trades of the day and while you load to analyze trends etc. mark those trades with a likely to be high risk, or likely to cause large exposures?
Or when you load you daily mortgage sales into the data warehouse, marking loans with information qualifying the customer as high value, low value or risky business, would give a wide range of analysts insight into the exposure of a financial institution. Now of course we are oversimplifying things a bit here, but hopefully you understand what the added value can be.
Show me how this works!
Going with the assumption that we poor ETL developers know nothing about data mining, lets start with the basics, and with some good news. What we are going to discuss here was introduced in Oracle 10g and is something you can use today (while you are waiting to upgrade to 11g for the supermodels).
The DBMS_PREDICTIVE_ANALYTICS package provides an important tool that makes data mining possible for a wider audience of users, in particular, ETL developers and business analysts.
This package automates the entire data mining process from data preprocessing through model building to scoring new data. The package contains two procedures:
The PREDICT procedure is the most useful for us ETL developers. This can be used to predict values of a specific column. The input consists of a table and a target column, the target column containing the values to predict. The input data must contain some cases in which the target value is known (that is, is not NULL). Cases where the target values are known are used to train models.
PREDICT returns a predicted value for every case, including those where the value is known. In the example below the PREDICT procedure can be used to predict the customer response to an affinity card program using a classifier based on Naïve Bayes algorithm.
accuracy => v_accuracy,
data_table_name => 'customer_AGE',
case_id_column_name => 'cust_id',
target_column_name => 'affinity_card',
result_table_name => 'predict_result');
DBMS_OUTPUT.PUT_LINE('Accuracy = ' || v_accuracy);
Within Warehouse Builder you create a custom procedure to capture the calls to the package, just like this:
Note that we went very basic here. All of the values are hard-coded. Obviously you want in real life to cover flexibility and at least make the data and result table names parameters to the procedure.
In this example, the model is used to determine the value for the high value customer flag. E.g. is this particular customer a high-value customer and with what probability did we determine this.
The basic inputs define the following:
· Source table
· Case key to uniquely identify each record in the input table
· Name of the column containing the value to be predicted
· Name of the target table
The output from the call is passed to the table PREDICT_RESULT for storage. This output table has a predefined format as shown below:
CASE_ID_COLUMN_NAME - Each of the cases identified in the case_id column. This is the same as the name that was passed in. The data type is the same as input case_id type.
PREDICTION - The predicted value of the target column for the given case. The data type is the same as the input target_column_name type.
PROBABILITY - For classification (categorical target), the probability of the prediction. For regression problems (numerical target), this column contains NULL.
Make sure you have this table in the database and make sure it is imported into the Warehouse Builder metadata repository, as we will leverage this table in the following step.
Above you see the mapping that does the work for you. The first step, this is called a pre-mapping process, Warehouse Builder executes the procedure we created to predict whether a customer is a high value customer. That procedure fills the PREDICT_RESULT table which we join again to our initial customer table CUSTOMER_AGE. The result is loaded into a target table called PREDICTED_CUSTOMERS.
Applying data mining in ETL is really that simple. Now you say, didn’t you just show me an even cooler model than the supermodels you described? Not really, the above is more of a mining model for dummies. The process is fully automated with all the normal inputs to the model defaulted to most common values. To gain more control over this type of process you want to move to the next stage of data mining and build the supermodel in Oracle Data Miner.
Back to Supermodels
So now that you have a basic understand of some useful data mining algorithms and how to easily use them in your ETL to augment the data, lets return to our favorite topic, supermodels.
For certain algorithms like decision trees, you as an ETL developer do not need to recode values into an expected format. For others, like support vector machines, you will need to recoding. Now did you know that? Do you want to know? Most likely you do not, and that is where the supermodels remove almost any barrier to the usage of the models.
Leveraging the supermodel, feeding data into it is now a logical activity. You consolidate and find the correct data, then give that data to the model. The model, if required will do the data preparation, and you get the correct results. That sounds pretty super to me.
Now, this was first published as an article in the ODTUG magazine, which brings me to the ODTUG conference. Sign up for cool stuff like this article! Last but not least, Keith Laker actually wrote some of the original work for this, so he is blogging too...
For more on data mining, look here.