By Charlie Berger, Advanced Analytics-Oracle on Jan 18, 2010
Here is a quick and simple application for fraud and anomaly detection. To replicate this on your own computer, download and install the Oracle Database 11g Release 1 or 2. (See http://www.oracle.com/technology/products/bi/odm/odm_education.html for more information). This small application uses the Automatic Data Preparation (ADP) feature that we added in Oracle Data Mining 11g. Click here to download the CLAIMS data table. [Download the .7z file and save it somwhere, unzip to a .csv file and then use SQL Developer data import wizard to import the claims.csv file into a table in the Oracle Database.]
First, we instantiate the ODM settings table to override the defaults. The default value for Classification data mining function is to use our Naive Bayes algorithm, but since this is a different problem, looking for anomalous records amongst a larger data population, we want to change that to SUPPORT_VECTOR_MACHINES. Also, as the 1-Class SVM does not rely on a Target field, we have to change that parameter to "null". See http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/anomalies.htm for detailed Documentation on ODM's anomaly detection.
drop table CLAIMS_SET;
create table CLAIMS_SET (setting_name varchar2(30), setting_value varchar2(4000));
insert into CLAIMS_SET values ('ALGO_NAME','ALGO_SUPPORT_VECTOR_MACHINES');
insert into CLAIMS_SET values ('PREP_AUTO','ON');
Then, we run the dbms_data_mining.create_model function and let the in-database Oracle Data Mining algorithm run through the data, find patterns and relationships within the CLAIMS data, and infer a CLAIMS data mining model from the data.
'CLAIMS', 'POLICYNUMBER', null, 'CLAIMS_SET');
After that, we can use the CLAIMS data mining model to "score" all customer auto insurance policies, sort them by our prediction_probability and select the top 5 most unusual claims.
-- Top 5 most suspicious fraud policy holder claims
select * from
(select POLICYNUMBER, round(prob_fraud*100,2) percent_fraud,
rank() over (order by prob_fraud desc) rnk from
(select POLICYNUMBER, prediction_probability(CLAIMSMODEL, '0' using *) prob_fraud
where PASTNUMBEROFCLAIMS in ('2 to 4', 'more than 4')))
where rnk <= 5
order by percent_fraud desc;