X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • February 26, 2014

How to generate training and test dataset using SQL Query node in Data Miner

Overview

In Data Miner, the Classification and Regression Build nodes
include a process that splits the input dataset into training and test dataset
internally, which are then used by the model build and test processes within
the nodes. This internal data split feature
alleviates user from performing external data split, and then tie the split dataset
into a build and test process separately as found in other competitive
products. However, there are times user
may want to perform an external data split. For example, user may want to generate a single training and test dataset,
and reuse them in multiple workflows. The generation of training and test dataset can be done easily via the
SQL Query node.

Stratified Split

The stratified split is used internally by the Classification
Build node, because this technique can preserve the categorical target
distribution in the resulting training and test dataset, which is important for
the classification model build. The
following shows the SQL statements that are essentially used by the Classification
Build node to produce the training and test dataset internally:

SQL statement for Training dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups
based on target classes

SELECT a.*,

row_number() OVER (partition by {target
column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input
data} a

) v1,

(

-- get the
count of subgroups based on target classes

SELECT {target
column},

COUNT(*)
"_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target
column} = v2. {target column}

-- random sample subgroups based
on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * {percent of training dataset} /
100)


SQL statement for Test dataset

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups
based on target classes

SELECT a.*,

row_number() OVER (partition by {target
column} ORDER BY ORA_HASH({case id column})) "_partition_caseid"

FROM {input
data} a

) v1,

(

-- get the
count of subgroups based on target classes

SELECT {target
column},

COUNT(*)
"_partition_target_cnt"

FROM {input data} GROUP BY {target column}

) v2

WHERE v1. {target
column} = v2. {target column}

-- random sample subgroups based
on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * {percent of training dataset} /
100)

The followings describe the placeholders used in the SQL
statements:

{target column} - target column. It
must be categorical type.

{case id column} - case id column. It
must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input
dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40%
of the input dataset. The training and
test dataset are mutually exclusive.

Random Split

The random split is used internally by the Regression Build
node because the target is usually numerical type. The following shows the SQL statements that are
essentially used by the Regression Build node to produce the training and test
dataset:

SQL statement for Training dataset

SELECT

v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column},
99, 0) <= {percent of training dataset}

SQL statement for Test dataset

SELECT

    v1.*

FROM

{input data} v1

WHERE ORA_HASH({case id column},
99, 0) > {percent of training dataset}

The followings describe the placeholders used in the SQL
statements:

{case id column} - case id column. It
must contain unique numbers that identify the rows.

{input data} - input data set.

{percent of training dataset} - percent of training dataset. For example, if you want to split 60% of input
dataset into training dataset, use the value 60. The test dataset will contain 100%-60% = 40%
of the input dataset. The training and
test dataset are mutually exclusive.

Use SQL Query node to create training and test dataset

Assume you want to create the training and test dataset out
of the demo INSUR_CUST_LTV_SAMPLE dataset using the stratified split technique;
you can create the following workflow to utilize the SQL Query nodes to execute
the above split SQL statements to generate the dataset, and then use the Create
Table nodes to persist the resulting dataset.

Assume the case id is CUSTOMER_ID, target is BUY_INSURANCE, and
the training dataset is 60% of the input dataset. You can enter the following SQL statement to
create the training dataset in the “SQL Query Stratified Training” SQL Query
node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups
based on target classes

SELECT a.*,

row_number() OVER (partition by
"BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM
"INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the
count of subgroups based on target classes

SELECT
"BUY_INSURANCE",

COUNT(*)
"_partition_target_cnt"

FROM
"INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE"
= v2."BUY_INSURANCE"

-- random sample subgroups based
on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) <= (v2."_partition_target_cnt" * 60 / 100)



Likewise, you can enter the following SQL statement to
create the test dataset in the “SQL Query Stratified Test” SQL Query node:

SELECT

v1.*

FROM

(

-- randomly divide members of the population into subgroups
based on target classes

SELECT a.*,

row_number() OVER (partition by
"BUY_INSURANCE" ORDER BY ORA_HASH("CUSTOMER_ID")) "_partition_caseid"

FROM
"INSUR_CUST_LTV_SAMPLE_N$10009" a

) v1,

(

-- get the
count of subgroups based on target classes

SELECT
"BUY_INSURANCE",

COUNT(*)
"_partition_target_cnt"

FROM
"INSUR_CUST_LTV_SAMPLE_N$10009" GROUP BY "BUY_INSURANCE"

) v2

WHERE v1."BUY_INSURANCE"
= v2."BUY_INSURANCE"

-- random sample subgroups based
on target classes in respect to the sample size

AND ORA_HASH(v1."_partition_caseid", v2."_partition_target_cnt"-1, 0) > (v2."_partition_target_cnt" * 60 / 100)

Now run the workflow to create the training and test dataset. You can find the table names of the persisted
dataset in the associated Create Table nodes.


Conclusion

This blog shows how easily to create the training and test
dataset using the stratified split SQL statements via the SQL Query nodes. Similarly, you can generate the training and
test dataset using the random split technique by replacing SQL statements with
the random split SQL statements in the SQL Query nodes in the above workflow. If a large dataset (tens of millions of rows)
is used in multiple model build nodes, it may be a good idea to split the data ahead
of time to optimize the overall processing time (avoid multiple internal data
splits inside the model build nodes).

Join the discussion

Comments ( 1 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.