How to generate training and test dataset using SQL Query node in Data Miner

Overview

In Data Miner, the Classification and Regression Build nodes include a process that splits the input dataset into training and test dataset internally, which are then used by the model build and test processes within the nodes. This internal data split feature alleviates user from performing external data split, and then tie the split dataset into a build and test process separately as found in other competitive products. However, there are times user may want to perform an external data split. For example, user may want to generate a single training and test dataset, and reuse them in multiple workflows. The generation of training and test dataset can be done easily via the SQL Query node.

Stratified Split

The stratified split is used internally by the Classification Build node, because this technique can preserve the categorical target distribution in the resulting training and test dataset, which is important for the classification model build. The following shows the SQL statements that are essentially used by the Classification Build node to produce the training and test dataset internally:

SQL statement for Training dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by {target column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input data} a

) v1,

(

-- get the count of subgroups based on target classes

SELECT {target column},

COUNT(*) "_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target column} = v2. {target column}

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * {percent of training dataset} / 100)


SQL statement for Test dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by {target column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input data} a

) v1,

(

-- get the count of subgroups based on target classes

SELECT {target column},

COUNT(*) "_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target column} = v2. {target column}

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * {percent of training dataset} / 100)

The followings describe the placeholders used in the SQL statements:

{target column} - target column. It must be categorical type.

{case id column} - case id column. It must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40% of the input dataset. The training and test dataset are mutually exclusive.

Random Split

The random split is used internally by the Regression Build node because the target is usually numerical type. The following shows the SQL statements that are essentially used by the Regression Build node to produce the training and test dataset:

SQL statement for Training dataset

SELECT

v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column}, 99, 0) <= {percent of training dataset}

SQL statement for Test dataset

SELECT

    v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column}, 99, 0) > {percent of training dataset}

The followings describe the placeholders used in the SQL statements:

{case id column} - case id column. It must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40% of the input dataset. The training and test dataset are mutually exclusive.

Use SQL Query node to create training and test dataset

Assume you want to create the training and test dataset out of the demo INSUR_CUST_LTV_SAMPLE dataset using the stratified split technique; you can create the following workflow to utilize the SQL Query nodes to execute the above split SQL statements to generate the dataset, and then use the Create Table nodes to persist the resulting dataset.

Assume the case id is CUSTOMER_ID, target is BUY_INSURANCE, and the training dataset is 60% of the input dataset. You can enter the following SQL statement to create the training dataset in the “SQL Query Stratified Training” SQL Query node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by "BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the count of subgroups based on target classes

SELECT "BUY_INSURANCE",

COUNT(*) "_partition_target_cnt"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE" = v2."BUY_INSURANCE"

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * 60 / 100)



Likewise, you can enter the following SQL statement to create the test dataset in the “SQL Query Stratified Test” SQL Query node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups based on target classes

SELECT a.*,

row_number() OVER (partition by "BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the count of subgroups based on target classes

SELECT "BUY_INSURANCE",

COUNT(*) "_partition_target_cnt"

FROM "INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE" = v2."BUY_INSURANCE"

-- random sample subgroups based on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * 60 / 100)

Now run the workflow to create the training and test dataset. You can find the table names of the persisted dataset in the associated Create Table nodes.


Conclusion

This blog shows how easily to create the training and test dataset using the stratified split SQL statements via the SQL Query nodes. Similarly, you can generate the training and test dataset using the random split technique by replacing SQL statements with the random split SQL statements in the SQL Query nodes in the above workflow. If a large dataset (tens of millions of rows) is used in multiple model build nodes, it may be a good idea to split the data ahead of time to optimize the overall processing time (avoid multiple internal data splits inside the model build nodes).

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today