In my previous post To sample or not to sample, we discussed some of the issues involved in sampling data for use in machine learning. In this post, we look at...

In my previous post To sample or not to sample, we discussed some of the issues involved in sampling data for use in machine learning. In this post, we look at using Oracle R Enterprise transparency layer to perform a few types of sampling: simple random sampling, with and without replacement, and stratified sampling. When your data is too large to fit in memory, you're left with a paradox: you need to sample the data so it fits in memory, but you need to load it into memory...

In my previous post To sample or not to sample, we discussed some of the issues involved in sampling data for use in machine learning. In this post, we look at using Oracle R Enterprise transparency...

Ideally, we would know the exact answer to every question. How many people support presidential candidate A vs. B? How many people suffer from H1N1 in a given...

Ideally, we would know the exact answer to every question. How many people support presidential candidate A vs. B? How many people suffer from H1N1 in a given state? Does this batch of manufactured widgets have any defective parts? Knowing exact answers is expensive in terms of time and money and, in most cases, is impractical if not impossible. Consider asking every person in a region for their candidate preference, testing every person with flu symptoms for H1N1 (assuming...

Ideally, we would know the exact answer to every question. How many people support presidential candidate A vs. B? How many people suffer from H1N1 in a given state? Does this batch of manufactured ...

Data sets come in many shapes and sizes. Some are tall and thin, others are short and wide. Some take on the form of dense data, a.k.a., single-record case,...

Data sets come in many shapes and sizes. Some are tall and thin, others are short and wide. Some take on the form of dense data, a.k.a., single-record case, where each row represents one entity, such as a customer or vehicle. Others take on the form of sparse data, a.k.a., transactional data, where each row typically consists of an identifier, variable name, and value, and a single "case" is represented by multiple rows sharing the same identifier. R provides a variety of...

Data sets come in many shapes and sizes. Some are tall and thin, others are short and wide. Some take on the form of dense data, a.k.a., single-record case, where each row represents one entity, such...

Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data...

Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get insight into the inner structure of a dataset. A good introduction to this method can be found in the datascience.com blog. However, there is one key parameter for K-means clustering which needs to be selected appropriately. That is K, the number of clusters for the algorithm to...

Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get...

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise...

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use that model to score new text. While creating your own domain-specific model may be necessary in many situations, others may benefit from a pre-built model based on millions of Wikipedia articles reduced to 200,000 topics. This model is downloadable here with details of how to...

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and...

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or...

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that Oracle R Enterprise users sometimes employ is to partition data based on the distinct values of one or more columns and build a model for each partition. By building a model on each partition, forming a kind of ensemble model, better accuracy is possible. The embedded R...

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that...

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If...

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If they are using a tool like Oracle Advanced Analytics -- with Oracle R Enterprise and Oracle Data Mining -- there's a desire to use these algorithms within that tool's framework. Using ORE's embedded R execution, users can already use 3rd party R packages in combination with Oracle Database for execution at the...

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If they are using a tool like...

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry,...

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for each household with installed sensors that measure temperature, light or power consumption. Another example can be found in the online advertising industry. To serve personalized online advertisements or recommendations, a huge number of individualized models has to be built and...

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for...

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set....

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of dimensions? In predictive analytics, more columns normally means more time required to build models and score data. If some columns have no predictive value, this means wasted time, or worse, those columns contribute noise to the model and reduce model quality or predictive...

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number...