Customize your feature type plots for better visualizations

October 22, 2021 | 10 minute read
John Peach
Principal Data Scientist
Text Size 100%:

If you understand the distribution of your data, you position yourself to make powerful models. For this reason, data scientists spend so much time during the exploratory data analysis (EDA) phase plotting their data. For continuous data, you can often create a box-and-whisker plot or a histogram. The box-and-whisker plot tends to hide many details on the shape of the distribution. A histogram can give you the shape, but it varies with the bin sizes. For long tailed distributions, you might want bins that are uneven in width.

So, these plots can require some tuning to get a good representation of the data. For non-continuous data, histograms don’t work. Creating univariate plots that captures the nature of the data can be challenging. The EDA process is time-consuming and often results in throw away work. But it doesn’t have to be this way.

The feature type system in Oracle Accelerated Data Science (ADS) allows you to define feature types specific to the features that you’re working with. They aren’t tied to how the data is stored in the computer—the dtype. This system includes a concept called feature type plots, which allow you to capture the code that you develop for creating your univariate plots. You can then reuse and share them across your team. Creating these plots can be time-consuming because you often have too many plots to make, and each different data type requires a different plot.

In this post, we show you a tool that understands the type of data that you have and creates a plot that is relevant to the data. The plot returns a Seaborn plot object that you can customize to meet your specific needs. This post also introduces feature type plots as part of the feature type system in ADS.  If you’re new to the feature type system, check out the post How feature types improve your data science workflow.

The feature type system can help you codify what your data represents. This process can speed up and improve the time-consuming data validation process each time you work with a new data set. It also has tools to compute custom statistics, data visualization, data validation with the use of a validator and a warning system, and the ability to select columns based on the feature types.

Feature plot

Visualization of a dataset is a quick way to gain insights into the distribution of values. The feature type system in ADS provides plots for all ADS-supported feature types, including the default feature type. So, every feature has a default plot. Calling .feature_plot() on a Pandas series produces a univariate plot. The following segment produces a bar chart with a count of the number of employees and how often they travel:


      df = pd.read_csv(attrition_path,
                      usecols=['Attrition', 'TravelForWork', 'JobFunction', 
                      'TrainingTimesLastYear'])
      df.ads.feature_type = {'Attrition': ['category'], 
                             'TravelForWork': ['category'],
                             'JobFunction': ['category'], 
                             'TrainingTimesLastYear': ['continuous']}
      df['TravelForWork'].ads.feature_plot()

A graphic depicting the bar chart for work travel

The .ads.feature_plot() method on a Pandas series returns a matplotlib.pyplot object. In the next code snippet, the plot is stored in the variable travel_plot, and then the .set_title() method, part of matplotlib, is used to add a title to the plot. This process allows you to modify provide extra customization to a feature plot.


      travel_plot = df['TravelForWork'].ads.feature_plot()
      travel_plot.set_title("Count of the Number of Employees and How Much they Travel")

A graphic depicting the updated bar chart with the title "Count of the Number of Employees and How Much They Travel"

Producing the feature plots for all the features in the dataframe is often faster. You can take this shortcut by calling .ads.feature_plot() on the dataframe. The command returns a dataframe where each row represents a feature. The dataframe has two columns: Column, which is the name of the column, and plot, which is the plot object. This approach enables you to perform bulk operations on the plot objects because they exist in one data structure.

Graphics depicting the updated bar and line graphs with the added information.

Creating custom feature type plots

So far, we’ve looked at the default feature types plots that have been provided by ADS. They’re generic fallback plots that handle most cases. However, the power of the feature type system is that you can define a feature so that it’s customized to your data and then use the code that was created.

Each feature can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents the purchase price of a car (the wholesale price). This feature set can include wholesale_price, car_price, USD, and continuous. We’ve been looking at features that use the ADS defined feature types.

To get the most out of the feature type system, you need to create custom feature types, which is a simple process. You can quickly create these feature types and share them across your organization. Create a custom feature type by defining a class derived from the FeatureType class or one of its subclasses. This class comes with several attributes and methods that you can override so that it meets your specific needs. However, you don’t need to override any property of the base class. If you’re not familiar with creating a custom feature type, check out How to create custom feature types for exploratory data analysis.

Let’s create an example where the data is a collection of strings that represent credit card numbers. We want to produce a plot that gives us summary information about the data that we have. The default behavior treats these number as categorical data and produces a plot of the count of each unique card. For most cases, this result isn’t very informative.

Create a CustomCreditCard class where the .feature_plot() method produces a count of the issuer of each card, where the issuers are companies such as American Express, Visa, MasterCard, and others. Because all data is dirty, you might also be interested in knowing how many credit cards are invalid or don’t belong to the issuers that we’re interested in. Further, records might be missing values, and you want to know how many records this represents.

We can create the CustomCreditCard class by inheriting the FeatureType class. We can override several methods and attributes, but in this example, you only need the override the .feature_plot() method. This method takes a Pandas series and returns a matplotlib.Axes object. Use @staticmethod when declaring the feature_plot.

Before you create the class, you need a helper function to do the heavy lifting of determining who issued a credit card based on the number. The ADS library has a method .identify_issue_network() to help you.


      import ads
      import numpy as np
      import pandas as pd
      import seaborn as sns

      from ads.common.card_identifier import card_identify
      from ads.feature_engineering import feature_type_manager, FeatureType

      def assign_issuer(cardnumber):
	      """ Identifies the credit card type """
	      if pd.isnull(cardnumber):
		      return "missing"
	      else:
		      return card_identify().identify_issue_network(cardnumber)


With this helper function defined, we can define the CustomCreditCard class.


      class CustomCreditCard(FeatureType):
            """ Type representing custom credit card numbers.

            Methods
            --------
            feature_plot(x: pd.Series) -> plt.Axes
            Generates plot of the count of cards by issuer.
            """
	        @staticmethod
	        def feature_plot(x: pd.Series) -> plt.Axes:
                  """ Generate a plot of the count of cards by issuer. """
                  card_types = x.apply(assign_issuer)
                  df = card_types.value_counts().to_frame()
	                  if len(df.index):
		                  ax = sns.barplot(x=df.index, y=list(df.iloc[:, 0]))
		                  ax.set(xlabel=”Issuing Financial Institution”)
		                  ax.set(ylabel=”Count”)
		                  return ax

The .feature_plot() method accepts a series of credit card numbers. It then converts the credit card numbers into a list of financial institutions that issued the card. Then it produces a count of cards for each issuer. This information is then used to create a custom plot.

Before you can use the custom feature type, register with ADS using the following command:


      feature_type_manager.feature_type_register(CustomCreditCard)

You’re now ready to use start using your custom feature type and the feature type plot. In the following code snippet, you create some valid and invalid credit card numbers. You can also create them in a Pandas series, but in this example, they go into a dataframe, where your data normally exists. Before you can create your plot, tell ADS that the credit_card column is a CustomCreditCard feature type.


      creditcard_numbers = ["4532640527811543", "4556929308150929", 
                       "4539944650919740", "4485348152450846", 
                       "4556593717607190", "5406644374892259",      
                       "5440870983256218", "5446977909157877", 
                       "5125379319578503", "5558757254105711",
                       "371025944923273", "374745112042294", 
                       "340984902710890", "375767928645325", 
                       "370720852891659", np.nan, None, "", "111", "0"]
      df = pd.DataFrame({'credit_card': creditcard_numbers})
      df['credit_card'].ads.feature_type = [CustomCreditCard]

Now that you have a dataframe with credit card numbers and the column is a CustomCreditCard feature type, the following command generates the feature plot:;


      df['credit_card'].ads.feature_plot()

The resulting bar graph, showing the issuing financial institutions and credit card count.

The power of the feature type system is that you can reuse this code. Generally, data scientists examine the same type of data repeatedly. After the custom feature type is defined, you can store it in a library and GitHub so that you can reuse and share it.

Summary

The plotting capabilities in feature plots allow you to create univariate plots that are relevant for your feature types. It also allows you to modify the plots so that the visualization is ideal for your audience.

Other blogs in this series

Explore OCI Data Science

Try Oracle Cloud Free Tier! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service.

Ready to learn more about the OCI Data Science service?

John Peach

Principal Data Scientist

A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate among chatbot skills, entity resolution, search, and personalization.

 

As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests, and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.


Previous Post

Labeling terabytes of data for ML? There's an app for that

Praveen Patil | 8 min read

Next Post


Have confidence in your data with Feature Type Validators

John Peach | 10 min read