Machine learning may seem like a bit more sizzle than steak to the average person, but data scientists and visualization programmers are agog over the notion that a computer system can program itself to grow, change, and develop as new information is introduced.
Let's say you are researching for an upcoming trip to go ice fishing. A machine learning algorithm would craft a personalized offer in such a way that while you are web surfing, your initial advertisement for fishing gear transforms into a suggestion to buy boots, hats, gloves, and freeze-proof storage containers based on what pages you visited last. Machine learning is especially important for business analytics and data visualization as the insights can be adjusted simply by swapping out related datasets, with a few modifications.
In this blog, we will discuss related datasets produced by machine learning algorithms in Oracle Data Visualization.
Related datasets are generated when we train and/or create a machine learning model in Oracle Data Visualization (present in version 12.2.4.0 or v4.0 for short). These datasets contain details about the model like: Prediction rules, Accuracy metrics, Confusion Matrix, Key Drivers for prediction etc. depending on the type of algorithm.
Related datasets can be found in inspect model menu: Inspect Model --> Related tab.
These datasets are useful in more ways than one. These datasets let users examine and understand the rules used by model to do prediction and classification, this in-turn will help in fine tuning the model to get better results. Related datasets are also useful in comparing models, in determining which is better than others for solving the same problem.
Here is a pictorial representation of Related datasets generated by different out of the box Machine algorithms in Oracle Data Visualization v4.0:
Different machine learning algorithms generate similar related datasets and all of them can be clubbed into eight datasets. Individual parameters and column names may change in dataset depending on the type of algorithm, but the functionality of dataset remains the same. For example, columns in the Statistics dataset may change Linear Regression and Logistic Regression, but the Statistics dataset contains accuracy metrics of the model.
Here is a brief description of each of these datasets:
- Drivers: This dataset gives information on columns that are key determinants/drivers of the target column value. Train/Create model performs linear regression and identifies columns that take part in predicting the values for target column. Each of the identified columns are assigned coefficient and correlation values. Coefficient value talks about the weight-age given to that column in determining the target column value and correlation refers to the direction of relationship with target column (i.e., if the target value increases or decreases with corresponding change in dependent column).
- Residuals: This dataset also gives information on the quality of model prediction, Residuals are the difference between the measured value and the predicted value of a regression model. This dataset gives an aggregated (or sum) value of absolute difference between actual and predicted values for all the columns in dataset. This dataset is visualized using a bar graph in the Quality tab Linear Regression model Inspect menu.
- CARTree: This dataset is a tabular representation of Decision Tree computed to predict the target column values. It contains columns that represent the conditions and criteria for conditions in decision tree, prediction for each group, prediction confidence. Inbuilt Tree Diagram visualization can be used to visualize this decision tree.
- Confusion Matrix: Confusion Matrix also known as error matrix is a specific table(pivot) layout that allows visualization of performance of an algorithm. Each row of the matrix represents instances of predicted class while each column represents instances in an actual class. This table reports the number of false positives, false negatives, true positives, and true negatives based on which precision, recall, F1 accuracy metrics are computed.
- Hitmap: This dataset contains information on leaf nodes in the decision tree. Each row in the table represents a leaf node and it contains information the criteria/Branch-segment that leaf node represents, Segment Size, Confidence and Expected # of rows i.e., expected number of correct predictions = Segment Size * Confidence.
- Classification Report: This dataset is a tabular representation of accuracy metrics for each distinct value of target column. For ex: if the target column can have two distinct values 'Yes' and 'No,' this dataset shows accuracy metrics like F1, Precision, Recall, Support(number of rows in Training dataset with this value) for each and every distinct value of Target column.
- Summary: This dataset contains a summary of input and optional parameters to the model specified during model creation and contains details like Target name and Model name.
- Statistics: This dataset contains metrics that quantify model accuracy. Depending on the algorithm/model that generates this dataset metrics present in the dataset will vary. Here is a list of metrics based on the model:
- Linear Regression, CART numeric, Elastic Net Linear:
- R-Square, R-Square Adjusted, Mean Absolute Error(MAE), Mean Squared Error(MSE), Relative Absolute Error(RAE), Related Squared Error(RSE), Root Mean Squared Error(RMSE)
- CART (Classification And Regression Trees), Naive Bayes Classification, Neural Network, Support Vector Machine(SVM), Random Forest, Logistic Regression:
Now you know what the related datasets are and how they can be useful for fine tuning your machine learning model or for comparing two different models.
Of course, you can't take advantage of these features if you are not using Oracle Data Visualization—Download your free trial today.