Oracle Data Mining a Star Schema: Telco Churn Case Study
By Charlie Berger-Oracle on Dec 09, 2010
There is a complete and detailed Telco Churn case study "How to" Blog Series just posted by Ari Mozes, ODM Dev. Manager. In it, Ari provides detailed guidance in how to leverage various strengths of Oracle Data Mining including the ability to:
- mine Star Schemas and join tables and views together to obtain a complete 360 degree view of a customer
- combine transactional data e.g. call record detail (CDR) data, etc.
- define complex data transformation, model build and model deploy analytical methodologies inside the Database
His blog is posted in a multi-part series. Below are some opening excerpts for the first 3 blog entries. This is an excellent resource for any novice to skilled data miner who wants to gain competitive advantage by mining their data inside the Oracle Database.
Many thanks Ari!
Churn is a critical problem in the telecommunications industry, and companies go to great lengths to reduce the churn of their customer base. One case study1 describes a telecommunications scenario involving understanding, and identification of, churn, where the underlying data is present in a star schema. That case study is a good example for demonstrating just how natural it is for Oracle Data Mining to analyze a star schema, so it will be used as the basis for this series of posts.
1) Handling missing values for call data records
The CDR_T table records the number of phone minutes used by a customer per month and per call type (tariff). For example, the table may contain one record corresponding to the number of peak (call type) minutes in January for a specific customer, and another record associated with international calls in March for the same customer. This table is likely to be fairly dense (most type-month combinations for a given customer will be present) due to the coarse level of aggregation, but there may be some missing values. Missing entries may occur for a number of reasons: the customer made no calls of a particular type in a particular month, the customer switched providers during the timeframe, or perhaps there is a data entry problem. In the first situation, the correct interpretation of a missing entry would be to assume that the number of minutes for the type-month combination is zero. In the other situations, it is not appropriate to assume zero, but rather derive some representative value to replace the missing entries. The referenced case study takes the latter approach. The data is segmented by customer and call type, and within a given customer-call type combination, an average number of minutes is computed and used as a replacement value.
In SQL, we need to generate additional rows for the missing entries and populate those rows with appropriate values. To generate the missing rows, Oracle's partition outer join feature is a perfect fit.
from cdr_t cdr partition by (cust_id) right outer join
(select distinct tariff, month from cdr_t) cdre
on (cdr.month = cdre.month and cdr.tariff = cdre.tariff);
The case study suggests that separate models be built for different customer segments (high, medium, low, and very low value customer groups). To reduce the data to a single segment, a filter can be applied:
select * from churn_prep where value_band = 'HIGH';
It is simple to take a quick look at the predictive aspects of the data on a univariate basis. While this does not capture the more complex multi-variate effects as would occur with the full-blown data mining algorithms, it can give a quick feel as to the predictive aspects of the data as well as validate the data preparation steps. Oracle Data Mining includes a predictive analytics package which enables quick analysis.
select * from expl_churn_tab where rank <= 5 order by rank;
-------------------- ----------------- ----------------- ----------
LOS_BAND .069167052 1
MINS_PER_TARIFF_MON PEAK-5 .034881648 2
REV_PER_MON REV-5 .034527798 3
DROPPED_CALLS .028110322 4
MINS_PER_TARIFF_MON PEAK-4 .024698149 5
From the above results, it is clear that some predictors do contain information to help identify churn (explanatory value > 0). The strongest uni-variate predictor of churn appears to be the customer's (binned) length of service. The second strongest churn indicator appears to be the number of peak minutes used in the most recent month. The subname column contains the interior piece of the DM_NESTED_NUMERICALS column described in the previous post. By using the object relational approach, many related predictors are included within a single top-level column.