How to create custom feature types for exploratory data analysis

October 1, 2021 | 9 minute read
John Peach
Principal Data Scientist
Text Size 100%:

Separating how your data is stored (its dtype) from what the feature represents is a powerful concept in that you can customize the behavior of your tools to fit the data. The Feature Type system in the Accelerated Data Science (ADS) SDK empowers you to define what exactly a feature represents. You can build custom statistics and univariate plots, and define tools to validate your data and warn you when there is a problem. While ADS comes with a collection of common feature types, the power of the system is that your organization can create custom feature types and then share them with your team.

You only need to understand a few concepts to create custom feature types within ADS.  If you are new to the feature type system, you might want to check out How feature types improve your data science workflow. In summary, the feature type system helps you codify what your data represents. Doing this speeds up and improves the data validation processes each time you work with a new data set. It also has tools to compute custom statistics, data visualization, data validation with the use of a validator and a warning system, and the ability to select columns based on the feature types.

Multiple inheritance increases code reuse

Each feature can have multiple feature types through a system of multiple inheritances. For example, an organization that sells cars might have a set of data that represents the purchase price of a car (the wholesale price). This could have a feature set of wholesale_price, car_price, USD, and continuous. You can quickly create these feature types and share them across your organization. A custom feature type is created by defining a class that is derived from the FeatureType class or one of its subclasses. This class comes with several attributes and methods that can be overridden so that it meets your specific needs. However, there is no requirement to override any property of the base class.

A feature type has the following attributes that can be overridden:
description: A description of the feature type.
name: The name of the feature type.

If you wish to create custom summary statistics for a feature type, then override the .feature_stat() method. To create a custom summary plot, override the .feature_plot() method. These methods are the ones that you would override to customize a feature type. However, the multiple inheritance system means that you only have to override it when another feature type in the inheritance chain does not support the output that you want. This reduces code duplication, and you spend less time defining your custom feature types. This gives you more time for your important analysis.

Custom feature types also have a warning and validator system where you can register handlers. These will not be discussed in detail in this blog post.

Create a custom feature type

To create a custom feature type, create a class that is inherited from the FeatureType class and register it with ADS. In the next code block, the custom feature type, CustomCreditCard, is created and is inherited from the FeatureType base class. The class overrides the name attribute to set a custom name for the class. If it is not overridden, then the name will automatically be determined by the class name and converted to snake case. The name can be used to register and unregister a class. It is also used in several outputs to identify the class. When assigning a feature type to a series, the name can be used to identify the class.

The description attribute lets you provide detailed information about a feature type. If not overridden, the description will default to “Base Feature Type.”

from ads.feature_engineering import FeatureType
class CustomCreditCard(FeatureType):

     """Type representing custom credit card numbers.

     Attributes
     ----------
     description: str
           The feature type description.
     name: str
           The feature type name.

     Methods
     --------
     feature_stat(x: pd.Series) -> pd.DataFrame
           Generates feature statistics.
     feature_plot(x: pd.Series) -> plt.Axes
           Generates plot object.
     """

     description = "This is an example of a custom credit card feature type."
     name="Custom Credit Card"

The .feature_stat() provides a mechanism to provide summary statistics for a feature type. It takes a Pandas series and returns a dataframe with two columns. The first is the metric, such as mean, count, etc., and the second is the value of the metric. The purpose of this method is to generate metrics that are meaningful to the feature type. For example, if we have a credit card, we would probably not be interested in the mean credit card number but may be interested in a count of each issuer.

The feature_plot() method returns a univariate plot that represents the distribution of the data. The idea of this method is that you create a plot that creates a summary that best describes your data. For example, if you have geo coordinate data, then a scatter plot that overlays a map may be appropriate.

These methods do not have to be defined in each feature type class. Let us assume that there is a call to the .feature_stat() method. If this method is not overridden on a specific feature type the multiple inheritance mechanism will be used. For example, we may create a feature type called wholesale_price and then a Pandas series called car_price. car_price has the following feature types: wholesale_price, car_price, USD, and continuous. Assuming that .feature_stat() is not overridden in the wholesale_price class, the system will attempt to generate the feature statistics using the .feature_stat() method defined in its parent feature type, car_price. This will continue until a feature type has defined the .feature_stat() method. If none of the custom feature type classes implement .feature_stat() it will fall through to the default feature type that always has this method implemented. In this case it would be the continuous default feature type.

The important rule is this: All methods of the feature type must be static or class level methods and should take a Pandas series as a first argument.

Creating custom methods and attributes

As with all classes, you can also add additional helpful attributes and methods. The following code snippet implements a method called .issuer() that returns the name of the bank that issued the credit card number. Since we are creating a custom credit card class, this is a handy piece of information.


from ads.feature_engineering import FeatureType
class CustomCreditCard(FeatureType):
      @staticmethod
      def issuer(series: pd.Series) -> pd.Series:
             """Identifies the credit card type.

             Parameters
             ----------
             series: pd.Series
                   The data to process.

             Returns
             -------
             pd.Series: The result of processing data.
             """
             
             def assign_issuer(cardnumber):
                   """
                   Identifies the credit card type.
                   """

                   if pd.isnull(cardnumber):
                   return "missing"
                   else:
                         return
                   card_identify().identify_issue_network(cardnumber)

             return series.apply(lambda card_number: 'Missing'
                               if pd.isnull(card_number) else card_identify(),
                               identify_issue_network(card_number))

You can access the custom method by calling .issuer() on a series using the feature type class. The next code snippet uses this approach. It has the advantage that the series does not have to have the feature type CustomCreditCard.

from ads.feature_engineering import feature_type_manager
CustomCreditCard = feature_type_manager.feature_type_object('Custom Credit_Card')
CustomCreditCard.issuer(df['credit_card'])

The more common approach is to call .issuer() on the Pandas series itself. Before doing this, the series needs to be associated with the feature type ‘Custom Credit Card’. The next code snippet makes df['credit_card'] have the feature type ‘Custom Credit Card’ and issuer of each credit card number.

 

df['credit_card'].ads.feature_type = ['Custom Credit Card']

df['credit_card'].ads.issuer()

Registering and unregistering a custom feature type

Once the feature type class has been created, it must be registered with ADS using the feature_type_manager.feature_type_register() method. The next code snippet registers the CustomCreditCard using the feature type object that is obtained from the .feature_type_object() method. The feature type name, ‘Custom Credit Card’, is used to identify the custom feature type.

from ads.feature_engineering import feature_type_manager
CustomCreditCard = feature_type_manager.feature_type_object('Custom Credit_Card')

feature_type_manager.feature_type_register(CustomCreditCard)

To see a list of all the registered feature types, use the following command:

feature_type_manager.feature_type_registered()

It will return a dataframe where each row is a feature type. It has the columns, Class, Name, and Description. These values are set in the class definition.

To unregister custom feature type, the feature_type_manager.feature_type_unregister() method can be used. This method accepts either a feature type object feature_type_manager.feature_type_unregister(CustomCreditCard) or its name feature_type_manager.feature_type_unregister("Custom Credit Card").

Custom feature types are a smart way to improve your EDA

When you create a custom feature type, you improve the speed and reliability of your data analysis. It allows you to tailor your exploratory data analysis (EDA) to fit the nature of the data. Creating a custom feature type is as simple as creating a class and overriding the attributes and methods that are specific to the feature type. There is no need to define anything extra if it is not needed, as smart default values are used.

This blog post is part of a series that discuss the feature type system in ADS. This post focused on creating a basic custom feature type. It briefly mentioned feature type plots, validators, and warnings. Coming up will be detailed posts on how to create those to customize the quality checks you normally perform on your data.

Other blogs in this series

Explore Oracle Cloud Infrastructure Data Science

Try Oracle Cloud Free Tier! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service.

Ready to learn more about the OCI Data Science service?

John Peach

Principal Data Scientist

A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate among chatbot skills, entity resolution, search, and personalization.

 

As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests, and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.


Previous Post

Introduction to correlation plots: 3 ways to discover data relationships

John Peach | 7 min read

Next Post


Ocifs: Read Object Storage Natively with Pandas

Allen Hosler | 4 min read