X

An Oracle blog about Oracle’s Machine Learning and Advanced Analytics

  • June 22, 2010

To sample or not to sample... Part 2

Mark Hornick
Director, Advanced Analytics and Machine Learning
In my previous post To sample or not to sample, we discussed how sampling can be
used to scale down a large dataset for exploratory mining, as well
as to produce train and test datasets.

One type of sampling is called simple random sampling, which means that each record in the table
has an equal probability of being selected. Simple random sampling can be with replacement, or without replacement. Without replacement
means that each record will be selected at most once.  With replacement means that selected records are returned to the population so they can be reselected.

Simple random
sampling without replacement is applicable to most situations where the
data is reasonably distributed, e.g, a binary target containing a reasonable split between the target values such as 50-50, 30-70. We consider two methods for simple random sampling without replacement: the SAMPLE SQL clause, and ORA_HASH.



SAMPLE SQL Clause



Oracle Database provides the SAMPLE clause that can be issued with a
SELECT statement over a table. In the following query, we're randomly selecting
records from the CUSTOMERS table with a 20% probability.



SELECT *

FROM customers

SAMPLE (20);




This means
that each record has a 20% chance of being selected. Perhaps contrary to expectations, this will
normally not produce a result with 20% of the records from
the CUSTOMERS table. This approximate sample size is quite adequate for most situations.



A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records
has the same chance of being selected, 20% in our example. Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.



SAMPLE and SAMPLE BLOCK allow the sample percent to range from .000001
to, but not including, 100. An optional second parameter is the seed
value, used to help ensure repeatability between executions. The seed
value can range from 0 to 4294967295.



SELECT *

FROM customers

SAMPLE BLOCK (20, 8621);




Repeatability is important if, e.g., you want to compare models or test
results across model builds, where the only variation is model settings.



While SAMPLE is built into the SQL syntax, it has limitations when
applied to views or complex joins. SAMPLE relies on the existence of primary keys. If the underlying tables have
the necessary primary keys, you may be able to use a view, but if the
primary keys are absent, the SAMPLE clause may not work.



To address this limitation, we can use a technique that uses ORA_HASH.



ORA_HASH


The ORA_HASH technique can be used with both tables and views, whether or not primary keys are specified. The Oracle Data Miner user interface uses this technique to support sampling of both tables and
views.

The ORA_HASH
function computes a hash value for a given expression. The following
example produces a sample of 60% of the CUSTOMERS table.



WITH row_count AS (SELECT count(*) count FROM customers )

SELECT c.*

FROM
customers c

WHERE
ORA_HASH(CUST_ID,(SELECT count FROM ROW_COUNT)-1,12345) <
     (SELECT count FROM ROW_COUNT) * 60 / 100


 


Note that the first argument to ORA_HASH is the expression on which to compute a
hash value, in this case, the unique identifier CUST_ID. The second
argument determines the maximum value returned by the hash function.
The third argument is an optional seed value, similar to that used on
the SAMPLE clause.



The WHERE clause specifies to select records where the hash value is
less than 60% of the number of records. As with the SAMPLE clause, the
number of records returned is likely not to be 60% of the total records, since each record has a 60% probability of being selected.



In my next post, we'll look at stratified sampling.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha