Ideally, we would know the exact answer to every question. How many people support presidential candidate A vs. B? How many people suffer from H1N1 in a given state? Does this batch of manufactured widgets have any defective parts? Knowing exact answers is expensive in terms of time and money and, in most cases, is impractical if not impossible. Consider asking every person in a region for their candidate preference, testing every person with flu symptoms for H1N1 (assuming every person reported when they had flu symptoms), or destructively testing widgets to determine if they are "good" (leaving no product to sell).
Knowing exact answers, fortunately, isn't necessary or even useful in many situations. Understanding the direction of a trend or statistically significant results may be sufficient to answer the underlying question: who is likely to win the election, have we likely reached a critical threshold for flu, or is this batch of widgets good enough to ship? Statistics help us to answer these questions with a certain degree of confidence. This focuses on how we collect data.
In machine learning, we focus on the use of data, that is data that has already been collected. In some cases, we may have all the data (all purchases made by all customers), in others, the data may have been collected using sampling (voters, their demographics and candidate choice).
Building machine learning models on all of your data can be expensive in terms of time and compute resources. Consider a company with 40 million customers. Do we need to mine all 40 million customers to get useful data mining models? The quality of models built on all data may be no better than models built on a relatively small, but representative sample. Determining how much is a reasonable amount of data involves experimentation.
When starting the model building process on large data sets, it is often more efficient to begin with a small sample, perhaps 1000 - 10,000 cases (records) depending on the algorithm, source data, and hardware. This allows you to see quickly what issues might arise with choice of algorithm, algorithm settings, data quality, and need for further data preparation. Instead of waiting for a model on a large data set to build only to find that the results don't meet expectations, once you are satisfied with the results on the initial sample, you can take a larger sample to see if model quality improves, and to get a sense of how the algorithm scales to the particular data set. If model accuracy or quality continues to improve, consider increasing the sample size.
Sampling in data mining is also used to produce a held-aside or test data set for assessing classification and regression model accuracy. Here, we reserve some of the build data (data that includes known target values) to be used for an honest estimate of model error using data the model has not seen before. This sampling transformation is often called a split because the build data is split into two randomly selected sets, often with 60% of the records being used for model building and 40% for testing.
Sampling must be performed with care, as it can adversely affect model quality and usability. Even a truly random sample doesn't guarantee that all values are represented in a given categorical attribute. This is particularly troublesome when the attribute with omitted values is the classification target attribute. A classification model that has not seen any examples for a particular target value can never predict that target value! Further, for predictor attributes, sampled values may consist of a single value (a constant attribute) or all unique values (an identifier attribute), each of which may be excluded during mining. Values from categorical predictor attributes that didn't appear in the training data are not used when testing or scoring data sets. Some algorithms may fail is presented with values not seen in the training data.
In subsequent posts, we'll talk about three sampling techniques using Oracle R Enterprise with Oracle Database: simple random sampling without replacement, stratified sampling, and simple random sampling with replacement.