Faster exploratory data analysis with feature type stats

August 24, 2021 | 9 minute read
John Peach
Principal Data Scientist
Text Size 100%:

Before you can do any effective machine learning, you need to understand the shape and nature of your data. During the exploratory data analysis (EDA) phase, a data scientist generates many exploratory plots and summary statistics, mainly to understand the data’s distribution: Is it unimodal, bimodal or multimodal? Is the data symmetric, or skewed to the left or right? Are the classes balanced? On all but the smallest of datasets, you'll need summary statistics to answer these questions.

 

This is where the Oracle Accelerated Data Science (ADS) feature type statistics can be a powerful tool.  If you are new to the feature type system, then you might want to check out this overview of the feature type system. In short, the feature type system helps codify what your data represents. Doing this speeds up the onerous validation process each time you work with a new data set. It also has tools to compute custom statistics, data visualization, data validation with the use of a validator and a warning system, and the ability to select columns based on the feature types.

Save your code and use it again

How do feature type statistics help with this problem? Generally, data scientists write code to compute the summary statistics that they are interested in. Unfortunately, they then discard that code, as there is no infrastructure to manage it; that means starting from scratch the next time they are looking at similar features. The feature type system lets you capture and reuse that code in future analysis. This reusability not only encourages data scientists to carefully consider the kinds of summary statistics they want, it empowers them to move beyond the default output of the Pandas .describe() method and generate relevant summaries for specific feature types. There is a tendency to default to tools like .describe() that may be suboptimal for the data you are working with, simply because any extra effort that you put in is going to be thrown away. By contrast, the feature type statistics system captures that effort so that you can use it again and again.

 

Let’s start by examining the .feature_count()  and .feature_stat() methods, which are your primary tools for generating summary statistics. In other blog posts in this series, we will discuss how to quickly create your own feature types and customize the statistics that they produce. These statistics are complemented by the feature type visualizations.

The power of defining your own feature types

Each column in a Pandas dataframe is associated with at least one feature type. This would be the default feature type and it is determined by the Pandas dtype. However, the power of the feature type system comes from the fact that you can quickly define your own feature types and assign them to a column. In the feature type system, a column of data is referred to as a feature.

Each feature can itself have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents the wholesale price of a car. This could have a feature set of wholesale_price, car_price, USD, and continuous. In this system, the default feature type is continuous, as it has no parent feature type. It is based on the Pandas dtype. The feature type wholesale_price is the primary feature type as it does not have any children.

In the multiple inheritance system that is used in the feature type system, you can think of the inheritance as a chain. We start with the primary feature type,  wholesale_price, and look for the desired property, say a method to compute summary statistics. If that feature type does not have statistics defined, it will go to the next parent feature type. In this example, it would be the car_price Feature type. If car_price has summary statistics defined, then it will dispatch on that. If not, it will go repeat the process of going to the next parent until it finds one. It will always find a property as the default feature types have a complete set of properties defined. There are some parts of the system where the entire feature type set is always traversed, such as the feature type warning system, however, that is beyond the scope of this article.

Use feature count to understand the nature of your data

After you load in your data you will want to use the .feature_type property to assign feature types to each feature. While the Oracle Acelerated Data Science (ADS) SDK, comes with several predefined feature types, you will generally want to create your own. How to do that is beyond the scope of this blog post.

The next image shows the code needed to load in the orcl_attrition dataset. The feature types for the selected columns are assigned and the top of the dataframe is displayed.

pastedGraphic.png

The .ads.feature_type command is used to assign feature types to each feature, and you  can also use it to return a dictionary that lists the feature types assigned to each feature, shown below. Notice that the Attrition feature has the feature types boolean, category, and string associated with it. However, in the example above we specify only boolean and category. That is because the feature type system will automatically append the feature type string based on the pandas dtype.

pastedGraphic_1.png

 

Once the data is loaded and the feature types assigned, it is a best practice to review all the different feature types that you have. This can be important when you are deciding on what types of model classes you should look at. For example, if you have quite a few categorical variables with many levels in each one, then a linear regression is probably not a good idea, especially if your dataset is relatively small. A better approach might be to use a random forest, as they tend to be slightly less sensitive to this higher dimensionality.

The .feature_count() method is a good way of getting an overview of the feature types being used. It returns a dataframe that provides a summary of the number of feature types in a dataframe, with each row representing a feature type. It provides a count of the number of times that feature type is used in the dataframe as well as a count of the number of times that the feature type was the primary feature type.

The .feature_count() method is called on the dataframe below. The output dataframe has one row for each feature type represented in the dataframe, listed in the Feature Type column. The next column lists the number of times the feature type appears in any of the columns. For example, the category feature type appears in the Attrition, TravelForWork, and JobFunction columns. Therefore, it has a count of three. The Primary column is the count of the number of times that the feature type is listed as the primary feature type. For the category feature type, the value is two as TravelForWork and JobFunction have this as their primary feature type.

pastedGraphic_2.png

Generate summary statistics

One of the main goals of the EDA is to gain an understanding of the nature of your data. The goal of the .feature_stat() method is to produce relevant summary statistics for the feature set.

This method outputs a Pandas dataframe where each row represents a summary statistic. Reported statistics depend on the multiple inheritance of the feature types. The feature type framework will iterate from the primary feature type to the default feature type, looking for a feature type that has the .feature_stat() method defined, and then will dispatch on that.

Below, the .feature_stat() for the integer feature type is run. This feature set will return the count of the observations, the mean value, the standard deviation, and Tukey's Five Numbers (sample minimum, lower quartile, median, upper quartile, and sample maximum).

pastedGraphic_3.png

The summary statistics that are created depend on the feature type. For example, the JobFunction column is categorical, so it produces a count of the number of observations and the number of unique categories.

pastedGraphic_4.png

This may not be ideal summary for the JobFunction feature. Instead, you may want to know the number of job functions in each category. You can do this by creating a new feature type and associated .feature_stat() method. Below, we create a new feature type called JobFunction. It overrides the .feature_stat() method to produce a count of the number of each job functions in the data. This feature type is then registered and the JobFunction column is updated so that it now inherits from the JobFunction feature type. Then it prints the feature summary statistics for the JobFunction column.

pastedGraphic_5.png

The .feature_stat() method also works at the dataframe level. It will produce a similar output to the output for the series, except it will have an additional column that lists the column name where the metric was computed.

pastedGraphic_6.png

The .feature_stat() method outputs its data in row-dominate format as it is easy to work with. However, there are times when column dominate format helps to better understand the data. This is often the case when the data all have similar summary statistics. You can convert between the two using Pandas .pivot_table() method. Missing values are replaced with NaN.

 

pastedGraphic_7.png

Speed up your EDA work

Now we've seen how to use the .feature_count() to create a summary of what feature types are being used, how common they are, and information on which ones are primary feature types. We also demonstrated how the .feature_stat() method returns a dataframe where each row represents a summary statistic and the numerical value for that statistic. This helps you glean useful summary statistical information on your dataset. This method can be used on a series to get information on one feature, or on a dataframe to summarize each feature while containing all that information in a single Pandas dataframe.

While it was beyond the scope of this article to show you in detail how to create a feature type, we did create one for the custom .feature_count() method. This method takes a series and returns a dataframe with the columns Metric and Value. Each row is a metric and the value of the metric. There is no minimum or maximum number of rows that can be returned.

The feature type system is designed to help speed up your EDA work and make it repeatable across the analysis. The feature type statistics allow you to create summary statistics that are specific to each feature type in your data set. Check out my other blog posts for more EDA features or details on how to customize your own feature types.

Other blogs in this series

Explore OCI Data Science

Try Oracle Cloud Free Tier! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service.

Ready to learn more about the OCI Data Science service?

 

 

John Peach

Principal Data Scientist

A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate among chatbot skills, entity resolution, search, and personalization.

 

As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests, and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.


Previous Post

Tips for uploading large models to the Data Science Model Catalog

Shail Raj Singh Jain | 3 min read

Next Post


How to create a new conda environment in OCI Data Science

JR Gauthier | 7 min read