John Peach

Principal Data Scientist

During the exploratory data analysis (EDA) phase, a data scientist wants to understand the relationships among the features — because these correlations offer a great deal of insight. Generally, you want to make your model as parsimonious as possible. This means determining what features are highly correlated and removing highly correlated features..

While some models, such as decision trees, are not sensitive to correlated variables, others, such as ordinary least squares regression, are. You may also want to remove correlated variables to reduce the cost of collecting and processing the data, and improving model performance. Let’s look at how to use three different correlation methods provided by the Accelerated Data Science (ADS) library and how to generate correlation tables and plots.

By the way, these correlation methods are part of the ADS feature type system, and if you are new to it, then you might want to check out the post How feature types improve your data science workflow. In summary, the feature type system helps codify what your data represents. Doing this makes the time-consuming validation process for each new dataset run better and faster. It also has tools to compute custom statistics, visualize information, check accuracy via validator and warning systems, and select columns based on feature types.

The EDA features in ADS speed up your analysis by providing methods to compute different types of correlations. You can choose among several different correlation techniques depending on your use case. Further, there are two sets of methods, one to return a dataframe with the correlation information and a partner method that generates a plot.

Which correlation technique you use depends on the type of data that you are working with. When using these correlation techniques, you will need to slice your dataframe so that the calculation uses only the appropriate feature types. Here's a summary of the different correlation techniques and the data they use:

`pearson`

: The Pearson correlation coefficient is a normalized measure of the covariance between two sets of data. In essence, it measures the linear correlation between the datasets. This method is used when both datasets consist of continuous values.`correlation_ratio`

: The Correlation ratio measures the extent to which a distribution is spread out within individual categories relative to the spread of the entire population. This metric is used to compare categorical variables to continuous values.`cramersv`

: The Cramér's V provides a measure of the degree of association between two categorical/nominal datasets.

The Pearson correlation coefficient is known by several names: Pearson's r, Pearson product moment correlation coefficient, bivariate correlation, or the correlation coefficient. It has a range of [-1, 1] where 1 means that the two datasets are perfectly correlated and -1 means that the correlation is perfectly out of phase. Thus, when one dataset is increasing, the other one is decreasing.

The Pearson correlation coefficient is a normalized value of the covariance between the continuous datasets X and Y. It is normalized by the product of the standard deviation between X and Y and is given by the following formula:

To perform a Pearson correlation coefficient, use feature type selection to choose only features that are continuous in nature. Now that you have a dataframe with only continuous features, call the `.ads.pearson()`

method. This will return a dataframe with each row representing a pair of features and the correlation between them. Notice that each pair is listed twice. In the following table, you can see that the tuples (`YearsInIndustry, Age`

) and (`Age, YearsInIndustry`

) are listed. This makes it easier to extract the data that you need.

To see this same information but in a heatmap, use the `.pearson_plot()`

method.

Statistical dispersion, or scatter, is a measure of the spread of a distribution with variance being a common metric. The correlation ratio is a measure of dispersion with categories relative to the dispersion across the entire dataset. The correlation ratio is a weighted variance of the category means over the variance of all samples. It is given by the formulas:

To perform a correlation ratio, use feature type selection to choose only features that are categorical or continuous in nature. Now that you have a dataframe with only categorical or continuous features, call the `.ads.correlation_ratio()`

method. This will return a dataframe with each row representing a pair of features and the correlation ratio between them. Notice that each pair is listed twice. In the following table, you can see that the tuples (`JobFunction, Age`

) and (`Age, JobFunction`

) are listed. This makes it easier to extract the data that you need.

While a table of correlation ratio values is handy, quite often a data scientist would prefer to look at a plot of the values. The `.ads.correlation_ratio_plot()`

method has the same data as the `.ads.correlation_ratio()`

method, represented in a heatmap.

Cramér's V is used to measure the amount of association between two categorical/nominal variables. A value of 0 means that there is no association between the bivariates and a value of 1 means that there is complete association.

Just like the other correlation tests, it is important that the dataframe that you are using only contains categorical values. The easiest way to do this is to use the feature type selection technique to filter out all features that are not categorical. Then call `.ads.cramersv()`

to compute the Cramér's V value for each pair of categorical features. As with the other correlation methods, each pair is listed twice. In the following table, you can see that the tuples (`EducationField, EducationLevel`

) and (`EducationField, EducationLevel`

) are listed and have the same value.

The `.ads.cramersv_plot()`

method is used to produce a heatmap of the correlation data. It returns a Seaborn object that you can customize.

Understanding the relationships among features is a key part of the EDA process. The ADS feature type system lets you analyze correlations based on the type of data. It supports Pearson’s correlation (`.pearson()`

), Correlation Ratio (`.ads.correlation_ratio())`

, and Cramér's V (`.ads.cramersv()`

). The output can be in a dataframe where the first two columns represent the pair of features that are having their correlation computed and the last column is the correlation. If you prefer to see a heat map of the correlation data, use the methods Pearson’s correlation (`.pearson_plot()`

), Correlation ratio (`.ads.correlation_ratio_plot()`

), and Cramér's V (`.ads.cramersv_plot()`

).

- How feature types improve your data science workflow
- Faster exploratory data analysis with feature type stats
- How to create custom feature types for exploratory data analysis
- Customize your feature type plots for better visualizations

Try Oracle Cloud Free Tier! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service.

Ready to learn more about the OCI Data Science service?

- Configure your OCI tenancy with these setup instructions and start using OCI Data Science.
- Star and clone our new GitHub repo! We’ve included notebook tutorials and code samples.
- Visit our service documentation.
- Watch our tutorials on our YouTube playlist.
- Subscribe to our Twitter feed.
- Visit the Oracle Accelerated Data Science Python SDK Documentation.
- Try one of our LiveLabs. Search for "data science".

A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate among chatbot skills, entity resolution, search, and personalization.

As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests, and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.