How feature types improve your data science workflow

August 17, 2021 | 8 minute read
John Peach
Principal Data Scientist
Text Size 100%:

There is a new concept in the Oracle Accelerated Data Science (ADS) SDK that will greatly improve your workflow: feature types. A feature type is a property of the data that defines what it is. If the data is a customer ID, its data type might be int64, while its feature type is a customer ID. By codifying what your data represents, you can speed up the validation process each time you work with a new data set.

The feature type system comes with tools so you can compute custom statistics, do visualizations, use a validator and a warning system, and select columns based on the feature types. In this series of blog posts, we will go over each of these components. To start, let us look at what feature types are and how to assign them to your data.

The feature type system: What, where, why

Feature types classify data based on what they represent, not how they are stored in memory. Each set of data can have multiple feature types through a system of multiple inheritances. As a concrete example, an organization that sells cars might have a set of data that represents the wholesale price of a car. You could have a feature set of wholesale_price, car_price, USD, and continuous. This multiple inheritance allows a data scientist to create feature type warnings and feature type handlers for each feature type.

Feature type warnings help you do rapid validation. For example, wholesale_price might have a method that ensures that the value is a positive number because you cannot purchase a car with negative money. The car_price feature type may have a check to ensure that it is within a reasonable price range. USD can check the value to make sure that it represents a valid US dollar amount not below one cent. The default continuous feature type represents the way the data is stored internally.

The feature type handlers are a set of is _* methods, where * is generally the name of the feature type. For example, the method .is_wholesale_price() can create a Boolean Pandas series that indicates what values meet the validation criteria. This lets you quickly identify which values need to be filtered or where you might have problems in the data pipeline. Feature type handlers can be complex, too: For example, they might take a client ID and call an API to validate each one is active.

The feature type system improves the exploratory data analysis (EDA) process. There are also several built-in statistics that work across the different columns in a dataframe. With feature type statistics, you create summary statistics that are relevant to the feature type. This way you get a meaningful summary. There are also correlation statistics that allow you to examine the relationships between the different features in the dataset.

The feature type graphs allow you to create custom univariate graphs that meaningfully describe the feature type. You get a Seaborn object back so that you can do any additional customization that may be needed for your target audience.

How inheritance helps

Pandas dtypes are physical data types that indicate how data are stored. You can call .dtype on your Pandas dataframe or series to inspect the physical types. Feature types are the logical types that define how the data should be interpreted by the end user, and they also categorize features from the machine learning perspective. Different feature types could be the same physical type. For example, both categorical and ordinal can be an integer dtype. However, the difference between categorical and ordinal feature types is that ordinal features have an ordering while categorical features do not.

ADS allows a set of data to have multiple feature types through a system of inheritance. For example, a hospital may have a medical record number for each patient. That data might have the feature types of patient_id, id, and integer. The patient_id is the child feature type with id being its parent. The integer is the parent of the id feature type. It is also the last feature type in the inheritance chain and is called the default feature type.

In addition to the regular feature types, there are two special versions. The default type is based on the Pandas dtype and cannot be changed without changing the Pandas dtype. There is no need to set it because it is always the last feature type in the inheritance chain. The tag feature type supports neither feature type warnings nor handlers. It is designed to allow you to tag data with extra information but not have to go through the work of creating a feature type class.

Calling feature_type_manager.registered_type() gives you an overview of all the registered feature types. ADS comes with various common feature types, but the idea is that you create feature types that explicitly define your data.

 

 

Setting feature types

The .feature_type property is used to store the feature types that are to be associated with a dataset. It accepts an ordered list of the feature types associated with the dataset. The next image shows a series of credit card numbers and then defines the feature type to be the built-in credit_card and string feature types.

 

Series

To assign feature types to a Pandas series, use the .ads.feature_type property on the series. In the preceding image, .feature_type accepted the name of the class, credit_card, as a string. It is also possible to pass the class itself. This gives you flexibility for how you want to define the feature types.

The feature type manager can be used to get a FeatureType object that is based on the credit_card feature type. You do this with the .get_type() method. The .get_type() method takes a class name and returns an object based on FeatureType. This object represents the feature type. For example, string is the class name of the feature type String.

You can repeat the preceding example by replacing credit_card with CreditCard and obtain the same results.

 

Dataframe

Like a Pandas series, .feature_type can be used on a dataframe to set the feature types for the columns in the dataframe. The property accepts a dictionary where the key in the dictionary is the column name, and the value is a list of feature types associated with that column. For example, assume you have a dataframe with the columns Attrition, TravelForWork, JobFunction, EducationalLevel. See the next image for an example of how to set the feature types.

 

Defining the default feature type

There is a special feature type called the default feature type based on the Pandas dtype. It does not have to be set by the user, but it can be.

Feature types allow for multiple inheritances and the default feature type is an ancestor to all other feature types, except for tags. Each series only has one default feature type. It cannot be muted or removed unless the underlying Pandas dtype has changed. For example, you have a Pandas series called series that has a dtype of string so its default feature type is string. If you change the type by calling series = series.astype('category'), then the default feature type is categorical instead of string.

You can use the method .default_type to determine the default feature type. An example would be series.ads.default_type, which can return the following possible values: boolean, date_time, category, string, continuous, integer, object.

Optional tags for additional semantics

A non-tag feature type must have a Python class defined and registered with ADS. However, it is often convenient to tag a dataset with additional information without the need to create a feature type class. This is the role of the Tag, which allows you to create a feature type without having to explicitly define and register a class. The tradeoff is that you cannot have feature type warnings nor feature type handlers. Tags are semantic and provide more context about the actual meaning of a feature. This could directly affect the interpretation of the information. Tags are optional for any dataset.

The process of creating your tag is the same as setting the feature types because it is a feature type. You use the .feature_type property to create tags.

The following image demonstrates how to create a set of credit card numbers and set the feature type to credit_card and tags the dataset as being inactive cards. Also, the cards are from North American financial institutions. You can put any text any want in Tag() because no underlying feature type class has to exist.

You can obtain a list of tags on a series using the tags attribute.

Create feature types that are specific to your data

The feature type system uses a multiple inheritance system to allow you to define the properties of a feature. Each column in a dataframe is a feature and it can have feature types assigned to it using the .feature_type property. There are several built-in feature types, but the idea is that you will create feature types that are specific to your data. You can also create tag feature types. While these do not let you use those statistics, plotting, validator, or warning systems, they are a quick way to descriptively label your data.

Other blogs in this series

Explore OCI Data Science

Try Oracle Cloud Free Tier! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service.

Ready to learn more about the OCI Data Science service?

John Peach

Principal Data Scientist

A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate among chatbot skills, entity resolution, search, and personalization.

 

As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests, and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.


Previous Post

Reproduce and audit your models easily with a model catalog

JR Gauthier | 5 min read

Next Post


Tips for uploading large models to the Data Science Model Catalog

Shail Raj Singh Jain | 3 min read