Jie Liu, PhD | Data Scientist
Phone: +17816850132 | Mobile: +15748550695 Oracle Oracle Advanced Analytics and Machine Learning 10 Van de Graaff | Burlington, MA 01803
Metrics for Regression The goal of a regression task is to build models based on features to predict a target quantity, that is, a numeric value. After a regression model is applied to a test set, the next step is to evaluate the model performance by checking the error between the regression output and the true value. A certain set of metrics is often used to evaluate the regression model such as mean squared error (MSE), root mean squared error (RMSE), mean absolute...
Metrics for Regression The goal of a regression task is to build models based on features to predict a target quantity, that is, a numeric value. After a regression model is applied to a test set, the...
Cross validation is a widely used model validation approach. As we discussed in Part I, it has the benefit of providing a more complete picture of model performance. However, cross validation can be costly with large datasets since one needs to repeat the training and testing for K times if a K-fold cross validation is chosen. By leveraging embedded Python execution in OML4Py, we can parallelize the K train/test processes, which is suitable for the use case when the user has...
Cross validation is a widely used model validation approach. As we discussed in Part I, it has the benefit of providing a more complete picture of model performance. However, cross validation can be...
Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get insight into the inner structure of a dataset. A good introduction to this method can be found in the Oracle datascience blog. However, there is one key parameter for K-means clustering that needs to be selected appropriately. That is K, the number of clusters for the algorithm...
Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get...
In part I, we discussed popular metrics such as accuracy, confusion matrix, precision, recall and the F1 score. The common characteristic for those metrics is that they rely on a given threshold for producing the ultimate prediction. In most cases, a classification model originally produces a probably score. In order to arrive at a prediction, one needs to come up with a threshold: a case is predicted as positive when the probably score is greater than the threshold and vice...
In part I, we discussed popular metrics such as accuracy, confusion matrix, precision, recall and the F1 score. The common characteristic for those metrics is that they rely on a given threshold for...
Metrics for Classification Think about the following scenario. As a seasoned data scientist, you spent a lot of time and effort tackling a challenging dataset. Finally, you built a good model, with creative feature engineering and smart data cleaning. The screen displays a great AUC (0.9, unbelievable!) score. How exciting! Then the stakeholder comes in and throws out a question: How does such a high AUC score help me in targeting customers? Sometimes there is a gap between...
Metrics for Classification Think about the following scenario. As a seasoned data scientist, you spent a lot of time and effort tackling a challenging dataset. Finally, you built a good model, with...
Why Cross Validation Cross validation is a widely used model validation approach. After a machine learning model is built, it is essential to measure the performance of the model before deployment. Cross validation allows us to understand the model performance from the perspective of both bias and variance, which is a sound and reliable way. In many cases, to get some sense of the model performance quickly, you just divide the dataset into train and test sets once,...
Why Cross Validation Cross validation is a widely used model validation approach. After a machine learning model is built, it is essential to measure the performance of the model before deployment....
Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. In a previous blog, we explained the importance and the application of WOE and its byproduct Information Value (IV). One important problem to apply this powerful tool is the scalability of the computation especially when the dataset grows large. In that blog, we presented a scalable approach of computing those values by leveraging the transparency layer provided in OML5R,...
Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. In a previous blog, we explained the importance and the application of WOE and its byproduct...
Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get insight into the inner structure of a dataset. A good introduction to this method can be found in the datascience.com blog.However, there is one key parameter for K-means clustering which needs to be selected appropriately. That is K, the number of clusters for the algorithm to...
In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the automated process of returning the data frame schema from the output of embedded R execution runs. Embedded R Execution ORE embedded R execution provides a powerful and convenient way to execute custom R scripts at the database server, from either R or SQL. It also enables running...
In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the...