To sample or not to sample... Part 4

This post continues "To sample or not to sample..." Part 1, Part 2, and Part 3.  In Part 1, we looked at the general motivation for sampling. In Part 2, we looked at simple random sampling without replacement, the most common sampling technique. In Part 3, we looked at stratified sampling, which helps to ensure a representative sample. In this post, we focus on simple random sampling with replacement, which has the following characteristics:
  • records are selected at random, each with equal probability
  • each selected record is immediately available to be selected again in the same sample
  • a record can be selected multiple times
  • the population of records remains constant, i.e., none are excluded
One data mining technique that uses simple random sampling with replacement is bootstrap aggregating, also known as bagging. The technique involves combining the predictions of multiple models usually from the same algorithm within either classification or regression, e.g., classification decision trees. Each model is built from a different sample of the original dataset using simple random sampling with replacement. These samples are called bootstrap samples.

A model composed of multiple other models is often referred to as an ensemble model. For regression models, the predictions of each component model are averaged. For classification models, voting is used to select the prediction agreed upon by the majority of models. Bagging often increases model accuracy and stability by reducing variance and avoiding overfitting.  This, however,  is a topic for another post.

There are many possible ways to produce a random sample with replacement. One approach is to identify each row by its row number and then to use a hash function to select the records for our sample.  The following script samples 1200 records from the CUSTOMERS table. We first create a view appending the row_number column using the rownum pseudo column. Iterating for the number of records in our sample, we populate the table CUSTOMERS_MAP with the hash value, which is based on the iterator i and ranges from 1 to the number of records in CUSTOMERS. The final step is to create the sample view by joining the CUSTOMERS_MAP table to the CUSTOMERS_V table using the sample_number and row_number in the join condition.

CREATE VIEW CUSTOMERS_V AS
  select rownum row_number, o.* from CUSTOMERS o;

CREATE TABLE CUSTOMERS_MAP (sample_number NUMBER);

DECLARE
  v_row_count NUMBER;
  c_sample_size NUMBER := 1200;
BEGIN
  SELECT count(*) INTO v_row_count FROM CUSTOMERS;
  FOR i IN 1..c_sample_size+1 LOOP
    INSERT INTO CUSTOMERS_MAP (sample_number) VALUES (ORA_HASH(i, v_row_count, 12345));
  END LOOP;
  COMMIT;
END;

CREATE VIEW CUSTOMERS_SAMPLE_V AS
SELECT v.*
FROM CUSTOMERS_V v, CUSTOMERS_MAP m
WHERE v.row_number = m.sample_number;


To obtain a different sample, change the rows returned by modifying the seed value provided to ORA_HASH. Note that the sample view is dependent on the contents of CUSTOMERS_MAP. Materializing the sample view ensures it will not change if CUSTOMERS_MAP is reused or if the order of rows in the original table is not maintained.

To conveniently produce multiple samples, we can extend the CUSTOMERS_MAP table as follows:

CREATE TABLE CUSTOMERS_MAP (sample_number_1 NUMBER, sample_number_2 NUMBER, sample_number_3 NUMBER);

The INSERT statement populates each of these sample numbers using a different seed value for each ORA_HASH invocation:

INSERT INTO CUSTOMERS_MAP (sample_number_1, sample_number_2, sample_number_3)
VALUES (ORA_HASH(i, v_row_count, 12345),
ORA_HASH(i, v_row_count, 23456),ORA_HASH(i, v_row_count, 34567));

Lastly, the sample views can be created each using a different column from the CUSTOMERS_MAP table. Here is an example for sample 2:

CREATE VIEW CUSTOMERS_SAMPLE_2_V AS
SELECT v.*
FROM CUSTOMERS_V v, CUSTOMERS_MAP m
WHERE v.row_number = m.sample_number_2;


This extension avoids having to maintain multiple map tables or materializing each of the sample tables. Of course, as stated above, since the rows in the initial CUSTOMER table are not guaranteed to maintain the same order, either the CUSTOMERS_V view or each of the sample views must be materialized to guarantee reproducible results.
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today