Wednesday Feb 26, 2014

How to generate training and test dataset using SQL Query node in Data Miner

Overview

In Data Miner, the Classification and Regression Build nodes include a process that splits the input dataset into training and test dataset internally, which are then used by the model build and test processes within the nodes. This internal data split feature alleviates user from performing external data split, and then tie the split dataset into a build and test process separately as found in other competitive products. However, there are times user may want to perform an external data split. For example, user may want to generate a single training and test dataset, and reuse them in multiple workflows. The generation of training and test dataset can be done easily via the SQL Query node.

Stratified Split

The stratified split is used internally by the Classification Build node, because this technique can preserve the categorical target distribution in the resulting training and test dataset, which is important for the classification model build. The following shows the SQL statements that are essentially used by the Classification Build node to produce the training and test dataset internally:

SQL statement for Training dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by {target column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input data} a

) v1,

(

-- get the count of subgroups based on target classes

SELECT {target column},

COUNT(*) "_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target column} = v2. {target column}

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * {percent of training dataset} / 100)


SQL statement for Test dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by {target column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input data} a

) v1,

(

-- get the count of subgroups based on target classes

SELECT {target column},

COUNT(*) "_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target column} = v2. {target column}

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * {percent of training dataset} / 100)

The followings describe the placeholders used in the SQL statements:

{target column} - target column. It must be categorical type.

{case id column} - case id column. It must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40% of the input dataset. The training and test dataset are mutually exclusive.

Random Split

The random split is used internally by the Regression Build node because the target is usually numerical type. The following shows the SQL statements that are essentially used by the Regression Build node to produce the training and test dataset:

SQL statement for Training dataset

SELECT

v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column}, 99, 0) <= {percent of training dataset}

SQL statement for Test dataset

SELECT

    v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column}, 99, 0) > {percent of training dataset}

The followings describe the placeholders used in the SQL statements:

{case id column} - case id column. It must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40% of the input dataset. The training and test dataset are mutually exclusive.

Use SQL Query node to create training and test dataset

Assume you want to create the training and test dataset out of the demo INSUR_CUST_LTV_SAMPLE dataset using the stratified split technique; you can create the following workflow to utilize the SQL Query nodes to execute the above split SQL statements to generate the dataset, and then use the Create Table nodes to persist the resulting dataset.

Assume the case id is CUSTOMER_ID, target is BUY_INSURANCE, and the training dataset is 60% of the input dataset. You can enter the following SQL statement to create the training dataset in the “SQL Query Stratified Training” SQL Query node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by "BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the count of subgroups based on target classes

SELECT "BUY_INSURANCE",

COUNT(*) "_partition_target_cnt"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE" = v2."BUY_INSURANCE"

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * 60 / 100)



Likewise, you can enter the following SQL statement to create the test dataset in the “SQL Query Stratified Test” SQL Query node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by "BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the count of subgroups based on target classes

SELECT "BUY_INSURANCE",

COUNT(*) "_partition_target_cnt"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE" = v2."BUY_INSURANCE"

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * 60 / 100)

Now run the workflow to create the training and test dataset. You can find the table names of the persisted dataset in the associated Create Table nodes.


Conclusion

This blog shows how easily to create the training and test dataset using the stratified split SQL statements via the SQL Query nodes. Similarly, you can generate the training and test dataset using the random split technique by replacing SQL statements with the random split SQL statements in the SQL Query nodes in the above workflow. If a large dataset (tens of millions of rows) is used in multiple model build nodes, it may be a good idea to split the data ahead of time to optimize the overall processing time (avoid multiple internal data splits inside the model build nodes).

Friday Feb 14, 2014

dunnhumby Accelerates Complex Segmentation Queries from Weeks to Minutes—Gains Competitive Advantage

See original story on http://www.oracle.com/us/corporate/customers/customersearch/dunnhumby-1-exadata-ss-2137635.html

dunnhumby Accelerates Complex Segmentation Queries from Weeks to Minutes—Gains Competitive Advantage

dunnhumby is the world’s leading customer-science company. It analyzes customer data and applies insights from more than 400 million customers across the globe to create better customer experiences and build loyalty. With its unique analytical capabilities, dunnhumby helps retailers better serve customers, create a competitive advantage, and enjoy sustained growth.


Challenges

A word from dunnhumby Ltd.

  • “Oracle Exadata Database Machine is helping us to transform our business and improve our competitive edge. We can now complete queries that took weeks in just minutes—driving new product offerings, more competitive bids, and more accurate analyses based on 100% of data instead of just a sampling.” – Chris Wones, Director of Data Solutions, dunnhumby USA

  • Expand breadth of services to maintain a competitive advantage in the customer-science industry
  • Provide clients, including major retail organizations in the United Kingdom and North America, with expanded historical and real-time insight into customer behavior, buying tendencies, and response to promotional campaigns and product offerings
  • Ensure competitive pricing for the company’s customer-analysis services while delivering added value to a growing client base
  • Analyze growing volumes of data rapidly and comprehensively
  • Ensure the security of sensitive information, including protected personal information to reduce risk and support compliance
  • Protect against data loss and reduce the backup and recovery window, as data is crucial to the competitive advantage and success of the business
  • Optimize IT investment and performance across the technology-intensive business
  • Reduce licensing and maintenance costs of previous analytical and data warehouse software

Solutions

  • Deployed Oracle Exadata Database Machine and accelerated queries that previously took two-to-three weeks to just minutes, enabling the company to bid on more complex, custom analyses and gain a competitive advantage
  • Achieved 4x to 30x more data compression using Hybrid Columnar Compression and Oracle Advanced Compression across sets—reducing storage requirements, increasing analysis and backup performance, and optimizing IT investment
  • Consolidated data marts securely with data warehouse schemas in Oracle Exadata, enabling extremely faster presummarizations of large volumes of data
  • Accelerated analytic capabilities to near real time using Oracle Advanced Analytics and third-party tools, enabling analysis of unstructured big data from emerging sources, like smart phones
  • Accelerated segmentation and customer-loyalty analysis from one week to just four hours—enabling the company to deliver more timely information as well as finer-grained analysis
  • Improved analysts’ productivity and focus as they can now run queries and complete analysis without having to wait hours or days for a query to process
  • Generated more accurate business insights and marketing recommendations with the ability to analyze 100% of data—including years of historical data—instead of just a small sample
  • Improved accuracy of marketing recommendations by analyzing larger sample sizes and predicting the market’s reception to new product ideas and strategies
  • Improved secure processing and management of 60 TB of data, growing at a rate of 500 million customer records a week, including information from clients’ customer loyalty programs 
  • Ensured data security and compliance with requirements for safeguarding protected personal information and reduced risk with Oracle Advanced Security, Oracle Directory Services Plus, and Oracle Enterprise Single Sign-On Suite Plus
  • Gained high-performance identity virtualization, storage, and synchronization services that meet the needs of the company’s high-volume environment
  • Ensured performance scalability even with concurrent queries with Oracle Exadata, which demonstrated higher throughput than competing solutions under such conditions
  • Deployed integrated backup and recovery using Oracle’s Sun ZFS Backup Appliance—to support high performance and continuous availability and act as a staging area for both inbound and outbound extract, transform, and load processes

Why Oracle

dunnhumby considered Teradata, IBM Netezza, and other solutions, and selected Oracle Exadata for its ability to sustain high performance and throughput even during concurrent queries. “We needed to see how the system performed when scaled to multiple concurrent queries, and Oracle Exadata’s throughput was much higher than competitive offerings,” said Chris Wones, director of data solutions, dunnhumby, USA.

Implementation Process

dunnhumby began its Oracle Exadata implementation in September 2012 and went live in April 2013. It has installed four Oracle Exadata machine units in the United States and four in the United Kingdom. The company is using three of the four machines in each country as production environments and one machine in each country for development and testing. dunnhumby runs an active-active environment across its Oracle Exadata clusters to ensure high availability.

Monday Feb 03, 2014

How to generate Scatterplot Matrices using R script in Data Miner

Data Miner provides Explorer node that produces descriptive statistical data and histogram graph, which allows analyst to analyze input data columns individually. Often time an analyst is interested in analyzing the relationships among the data columns, so that he can choose the columns that are closely correlated to the target column for model build purpose. To examine relationships among data columns, he can create scatter plots using the Graph node.

For example, an analyst may want to build a regression model that predicts the customer LTV (long term value) using the INSUR_CUST_LTV_SAMPLE demo data. Before building the model, he can create the following workflow with the Graph node to examine the relationships between interested data columns and the LTV target column.

In the Graph node editor, create a scatter plot with an interested data column (X Axis) against the LTV target column (Y Axis). For the demo, let’s create three scatter plots using these data columns: HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT.

Here are the scatter plots generated by the Graph node. As you can see the HOUSE_OWNERSHIP and N_MORTGAGES are quite positively correlated to the LTV target column. However, the MORTGAGE_AMOUNT seems less correlated to the LTV target column.

The problem with the above approach is it is laborious to create scatter plots one by one and you cannot examine relationships among those data columns themselves. To solve the problem, we can create a Scatterplot matrix graph as the following:

This is a 4 x4 scatterplot matrix of data column LTV, HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT. In the top row, you can examine the relationships between HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT against the LTV target column. In the second row, you can examine the relationships between LTV, N_MORTGAGES, and MORTGAGE_AMOUNT against the HOUSE_OWNERSHIP column. In the third and forth rows, you can examine the relationships of other columns against the N_MORTGAGES, and MORTGAGE_AMOUNT respectively.

To generate this scatterplot matrix, we need to invoke the readily available R script RQG$pairs (via the SQL Query node) in the Oracle R Enterprise. Please refer to http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html?ssSourceSiteId=ocomen for Oracle R Enterprise installation.

Let’s create the following workflow with the SQL Query node to invoke the R script. Note: a Sample node may be needed to sample down the data size (e.g. 1000 rows) for large data set before it is used for charting.

Enter the following SQL statement in the SQL Query editor. The rqTableEval is a R SQL function that allows user to invoke R script from the SQL side. The first SELECT statement within the function specifies the input data (LTV, HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT). The second SELECT statement specifies the optional parameter to the R script, where we define the graph title “Scatterplot Matrices”. The output of the function is an XML document with the graph data embedded in it.

SELECT VALUE FROM TABLE
(
rqTableEval(
cursor(select "INSUR_CUST_LTV_SAMPLE_N$10001"."LTV",
"INSUR_CUST_LTV_SAMPLE_N$10001"."HOUSE_OWNERSHIP",
"INSUR_CUST_LTV_SAMPLE_N$10001"."N_MORTGAGES",
"INSUR_CUST_LTV_SAMPLE_N$10001"."MORTGAGE_AMOUNT"
from "INSUR_CUST_LTV_SAMPLE_N$10001"), -- Input Cursor
cursor(select 'Scatterplot Matrices' as MAIN from DUAL), -- Param Cursor
'XML', -- Output Definition
'RQG$pairs' -- R Script
)
)

You can see what default R scripts are available in the R Scripts tab. This tab is visible only when the Oracle R Enterprise installation is detected.

Click the button in the toolbar to invoke the R script to produce the Scatterplot matrix below.

You can copy the Scatterplot matrix image to a clipboard or save it to an image file (PNG) for reporting purpose. To do so, right click on the graph to bring up the pop-up menu below.

The Scatterplot matrix is also available in the Data Viewer of the SQL Query node. To open the Data Viewer, select the “View Data” item in the pop-up menu of the node.

The returning XML data is shown in the Data Viewer as shown below. To view the Scatterplot matrix embedded in the data, click on the XML data to bring up the icon in the far right of the cell, and then click on the icon to bring up the viewer.

About

Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Search

Categories
Archives
« February 2014 »
SunMonTueWedThuFriSat
      
1
2
4
5
6
7
8
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
27
28
 
       
Today