In my previous post To sample or not to sample, we discussed how sampling can be used to scale down a large dataset for exploratory mining, as well
as to produce train and test datasets.
One type of sampling is called simple random sampling, which means that each record in the table has an equal probability of being selected. Simple random sampling can be with replacement, or without replacement. Without replacement means that each record will be selected at most once. With replacement means that selected records are returned to the population so they can be reselected.
Simple random sampling without replacement is applicable to most situations where the data is reasonably distributed, e.g, a binary target containing a reasonable split between the target values such as 50-50, 30-70. We consider two methods for simple random sampling without replacement: the SAMPLE SQL clause, and ORA_HASH.
SAMPLE SQL Clause
Oracle Database provides the SAMPLE clause that can be issued with a SELECT statement over a table. In the following query, we're randomly selecting
records from the CUSTOMERS table with a 20% probability.
This means that each record has a 20% chance of being selected. Perhaps contrary to expectations, this will normally not produce a result with 20% of the records from the CUSTOMERS table. This approximate sample size is quite adequate for most situations.
A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of records has the same chance of being selected, 20% in our example. Since records are selected at the block level, this offers a performance improvement for large tables and should not adversely impact the randomness of the sample.
SAMPLE and SAMPLE BLOCK allow the sample percent to range from .000001 to, but not including, 100. An optional second parameter is the seed
value, used to help ensure repeatability between executions. The seed value can range from 0 to 4294967295.
SAMPLE BLOCK (20, 8621);
Repeatability is important if, e.g., you want to compare models or test results across model builds, where the only variation is model settings.
While SAMPLE is built into the SQL syntax, it has limitations when applied to views or complex joins. SAMPLE relies on the existence of primary keys. If the underlying tables have the necessary primary keys, you may be able to use a view, but if the primary keys are absent, the SAMPLE clause may not work.
To address this limitation, we can use a technique that uses ORA_HASH.
The ORA_HASH technique can be used with both tables and views, whether or not primary keys are specified. The Oracle Data Miner user interface uses this technique to support sampling of both tables and views.
The ORA_HASH function computes a hash value for a given expression. The following example produces a sample of 60% of the CUSTOMERS table.
WITH row_count AS (SELECT count(*) count FROM customers )
FROM customers c
ORA_HASH(CUST_ID,(SELECT count FROM ROW_COUNT)-1,12345) <
(SELECT count FROM ROW_COUNT) * 60 / 100
Note that the first argument to ORA_HASH is the expression on which to compute a hash value, in this case, the unique identifier CUST_ID. The second argument determines the maximum value returned by the hash function. The third argument is an optional seed value, similar to that used on the SAMPLE clause.
The WHERE clause specifies to select records where the hash value is less than 60% of the number of records. As with the SAMPLE clause, the number of records returned is likely not to be 60% of the total records, since each record has a 60% probability of being selected.
In my next post, we'll look at stratified sampling.