Feature stores are emerging as a fundamental component of machine learning (ML) platforms. As data-driven organizations increasingly incorporate AI into their operations, they are starting to centralize the management of features within a dedicated repository, known as a feature store. This specialized storage solution, often referred to as feature store service, tackles one of the most significant and complex challenges in the ML life cycle: dealing with data or, more specifically, the process of feature engineering.
Throughout the ML life cycle, features and models constitute the most precious assets generated, and a feature store essentially acts as a specialized data warehouse tailored for data science needs. Importantly, features differ from raw data. They're derived from raw data and are crafted for predictive or causal analysis. The capacity to produce high-quality, predictive features is a distinguishing factor between a good and an exceptional data scientist.
Crafting such potent features demands domain expertise related to the specific business problem at hand, experience, and a thorough comprehension of the constraints within the model production environment. Feature engineering, a resource-intensive process, greatly benefits from a cohesive framework where features are meticulously documented, shared, stored, and served in a streamlined manner. According to the 2022 State of data science report, Anaconda Inc about 51% of a data scientist's time is spent in data-related tasks.
In this blog post, we detail the challenges data scientists face in feature engineering, such as inconsistency between training and serving features and redundancy in feature computation. We then introduce the OCI Data Science feature store as a centralized repository to manage, store, and serve features, ensuring consistency and saving computational resources. We illustrate how to get started on using the OCI Data Science feature by taking an example of the flights' on-time performance dataset use case where feature stores have streamlined the feature engineering process and facilitated collaboration among data science teams. The post concludes by offering insights into best practices for setting up and managing a feature store, emphasizing the importance of versioning, monitoring, and establishing a feedback loop for continuous improvement.
OCI Data Science feature store is a one-stop shop for the entire life-cycle management of features with the following key benefits:
Centralized feature management: One of the most important advantages of a feature store is that it provides a centralized repository for features. In the world of Big Data, where data scientists and engineers are working with diverse, distributed datasets, having a single source of truth is essential. A feature store organizes features, ensures that they're consistent, and makes them easily accessible to different teams and models.
Feature consistency and reproducibility: ML models are only as good as the data they're trained on. Without a feature store, features can be inconsistently calculated, leading to inaccurate and unreliable models. Feature stores ensure that features are computed consistently and can be reused across different models, significantly improving the reproducibility of ML projects.
Real time and batch features: Modern ML applications often require features to be available in real-time. Feature stores enable seamless management and serving of both batch and real-time features, thus catering to a wide array of ML applications, from fraud detection to customer recommendations
Accelerating time to production: Feature stores significantly reduce the time needed to take a model from development to production. They allow for smooth and efficient collaboration between data scientists and engineers because the former can easily access the processed features they need, while the latter can focus on maintaining the data pipeline.
Reducing costs and improving resource efficiency: Feature duplication and redundant computation are major issues without a centralized feature store. By using a feature store, organizations can reduce storage and computation costs when features are computed once and reused instead of than recalculated for every new model.
Collaboration and governance: Feature stores encourage collaboration between teams by providing a shared, centralized repository of features. They also include tools for monitoring, validation, and version control, which are critical for governance and compliance requirements.
Programmatic and declarative programming support: The OCI Data Science feature store supports programmatic interfaces via SQL, Python, and Pyspark interfaces.
A feature store has different components, as shown in the above logical architecture diagram. The diagram also shows the following personas:
Feature generators can create features through client interfaces, use of a feature engine, write the metadata and definition of features by using a feature registry, and store them depending on the latency and data refresh needs. At the same time, feature consumers can access different datasets that contain the curated features, build an ML model and deploy them to the target environments. Let's expand on each of these components.
Components |
Description |
---|---|
Client interfaces |
Client interfaces such as software developer kits (SDKs), WebUI, and CLI. Users can extract the data from different data sources into the feature store using the client interfaces. Data connectors ingest customer data into a pandas or Spark data frame. |
Feature registry | Acts as a metadata repository that stores features' definitions, metadata, and computation logic. It allows users to discover, document, and share features across teams and enables the versioning of feature definitions. |
Feature engine | A feature engine within the context of a feature store that refers to the set of tools, processes, and frameworks designed to facilitate feature extraction, transformation, and creation. Essentially the backbone that powers the generation and management of features used in ML workflows |
Feature storage |
In the context of a feature store, feature storage refers to the underlying infrastructure, mechanisms, and protocols used to persistently save and manage ML features, both in their raw and processed forms. This storage is essential for ensuring consistent and rapid access to features for training and inference. There are two-tiered storage options: offline storage and online storage. We have offline storage capabilities where features and the associated metadata are stored, powered by OCI Object Storage. |
Feature registry |
A feature registry in a feature store is akin to a catalog or a metadata repository that manages and keeps track of the various features created and stored. It provides an organized, centralized system to monitor, discover, and manage features, ensuring that data scientists and ML engineers can efficiently utilize them across different projects. |
Maintaining feature consistency becomes paramount as ML projects grow in complexity and scale. For newcomers and seasoned professionals alike, understanding the intricacies of a feature store can be a game-changer. In this section, we offer a step-by-step walkthrough to get started with the OCI Data Science feature store. We take a flight on-time performance dataset, create a feature store instance, create entities, perform transformation on the targeted features, create feature groups and datasets, and use the feature consumer persona to select a query to extract the dataset into a ML model. Let’s look into the each of these steps in detail.
Before getting started, we must set policies and authentication methods. You can check the following documentation here on setting the following requisites:
We take the dataset from the US Department of Transportation's (DOT) Bureau of Transportation Statistics, which tracks the on-time performance of domestic flights operated by large air carriers.
First, we load the flight dataset into a pandas data frame.
|
The Oracle Accelerated Data Science SDK (ADS) controls the authentication mechanism with the notebook cluster.
To set up authentication, use the command, ads.set_auth("resource_principal")
or ads.set_auth("api_key")
|
Refer to the authentication policies section in the Quickstart section of the feature store documentation.
To create and run a feature store, we need to specify <compartment_id>
and <metastore_id>
for offline feature store.
|
The first part of the scenario consists of creating a feature store instance. The feature store is the top-level entity for the feature store service. Run the following command:
|
An entity is a group of semantically related features. To create entities, use the following command:
|
Transformations in a feature store refer to the operations and processes applied to raw data to create, modify or derive new features that can be used as inputs for ML models.
The following Python function defines 2 pandas transformations used for feature engineering. impute_columns replaces missing Nan with the mode value for the column and transform_flight_df which selects specific columns from the imputed dataframe.
|
|
A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource. Run the command:
|
The ".show" function shows the feature group after creation.
|
This method doesn't persist any metadata or feature data in the feature store on its own. To persist the feature group and save feature data along the metadata in the feature store, call the materialise()
method with a DataFrame with the following command:
|
We can call the select()
method of the FeatureGroup instance to return Query
interface. We can use the Query
interface can be used to join and filter the feature group.
|
A dataset is a collection of features used to train a model or perform model inference. In an ADS feature store module, you can either use the Python API or YAML to define a dataset. With the following medthod, you can define a dataset and give it a name.
|
By calling the .create method, we would create the flights dataset.
|
We can use these datasets to train ML models.
We can call the sql()
method of the FeatureStore instance to query a feature store.
|
Using the selected dataset from the feature store, we can now train an ML model to capture patterns and insights, ensuring consistent, high-quality features that can drive the model's predictive accuracy. In subsequent blogs, we showcase how you can use datasets to train ML models and store the models in the model catalog.
We covered how to set up a feature store instance, create entities, create transformations, feature groups, and datasets. But the features and functionalities that the OCI Data Science feature store don't just end there. We have the followingother key features:
The world of data science and machine learning is complex, with various components working in tandem to produce actionable insights. At the heart of this ecosystem lies the concept of a feature store. Feature stores serve as centralized repositories for features meticulously crafted from raw data to be used in machine learning models. They aim to streamline the workflow of data scientists by ensuring that features are consistent, easily accessible, and primed for both real-time and batch processing.
In this blog, we covered some of the concepts related to feature stores, such as entities, feature groups, datasets, and creating datasets that are reusable in ML models. In this process, we created the fundamental building blocks of an OCI Data Science feature store.
In summary, a feature store isn’t merely a data repository, but a sophisticated platform that bridges the gap between raw data and insightful machine learning models. Feature stores are shaping the future of how data science operations are conducted by emphasizing organization, validation, transparency, and visualization, offering a blend of efficiency and assurance.
To get started on the feature store, you can try the sample notebooks or watch the demos.
Try Oracle Cloud Free Trial for yourself! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service. For more information, see the following resources:
Srikanta (Sri) is working as a Principal product Manager, in OCI Data Science. He is leading efforts related to experiment tracking with oci-mlflow, OCI Data Science Feature Store and Model catalog capabilities in OCI Data Science portfolio. Sri brings about 19+ years of work experience working in Industry verticals such as Aviation and Aerospace, Semiconductor manufacturing and print and media verticals. He holds a Master’s Degree from National University of Singapore, MBA from University of North Carolina.
As a senior member of technical staff at OCI, I work on the AI and ML platform that enables data scientists and engineers to build and deploy scalable and reliable machine learning models. I am responsible for designing, developing, and testing the platform features, as well as mentoring and guiding the junior developers on the team.
Prior to joining Oracle, I was a Software Development Engineer at InMobi, where I worked on various projects related to machine learning, creative ingestion, and creative approval. I developed a service from scratch in Scala that ingested images and videos in the system at scale, and integrated it with various exchanges to allow submission of ads. I also contributed to the creative selector project that segregated ads based on clicks and installs.
I have a strong background in computer science, having graduated with a Bachelor's degree from Vellore Institute of Technology in 2016. I participated in the ACM ICPC Kolkata Regionals and secured a position in the top 60. I also obtained certifications in IBM networking and Introduction to R from DataCamp.
I am passionate about learning new technologies and solving challenging problems. I enjoy working with a diverse and collaborative team that values innovation and quality. My goal is to leverage my skills and experience to create impactful and user-friendly solutions that can make a difference in the world.
I am an enthusiastic engineer and love to solve challenging tasks. I'm currently working in Oracle India Private Ltd. in Data Science[OCI] domain as a Member of Technical Staff Engineer. My role involves taking up technical challenges, designing solutions to problems and code development and testing. Currently working with Java, Dropwizard, Oracle DB, OCI Services, Docker, Python.
Najiya is currently working as Senior Engineer in AI Platform team in Oracle Cloud, she is a experienced Software Engineer with a demonstrated history of working in the Product development Skilled in Requirements Analysis,System Design,Java,Python,Data Structures and Algorithms.Possessing knowledge in Machine Learning and Deep Learning.Passionate about learning new technologies.