Introduction

In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI), embeddings have become the backbone of many state-of-the-art systems, including large language models (LLMs). These dense vector representations encapsulate complex data modalities, from words and sentences to entire documents, images, and more. Feature stores, designed to streamline the ML pipeline by providing consistent and efficient access to preprocessed features, are now facing the challenge of effectively handling these embeddings. But the plot thickens with the rise of LLMs.

With their unparalleled capability to generate and interpret embeddings, LLMs have significantly reshaped the realm of natural language processing and understanding. This blog explores the necessity of incorporating embedding support into feature stores, highlighting the significant influence and possibilities unlocked by LLMs. In this exploration, we uncover the interconnectedness among embeddings, feature stores, and LLMs, delving into the future possibilities of this combination in contemporary AI applications.

The power of embeddings and the feature store

Managing a plethora of datasets, data sources, and transformations for machine learning, including word embeddings, can be complex and costly. Issues such as insufficient data cleaning, inconsistencies, transformation errors, data drift, and disparities between training and serving stages can delay model development and reduce its effectiveness, particularly in handling word embeddings. A feature store emerges as an all-encompassing solution in this scenario, providing a unified platform for data transformation and access throughout the training and serving phases. It ensures a steadfast pipeline for the ingestion and querying of data, including word embeddings, enhancing their usability and performance. This notebook showcases how to leverage a feature store to create and store word embeddings and in this process highlight the efficiency of re-usability and efficient storage methods to store word embeddings. It helps in the following areas:

Efficient storage for text embeddings: A fundamental role of the embedding feature store is housing pretrained word embeddings, such as Word2Vec, GloVe, FastText, or BERT. Originating from vast text datasets, these embeddings capture word semantics, proving beneficial for numerous natural language processing (NLP) tasks, including text categorization, entity identification, sentiment interpretation, among others.
Support for tailormade and custom embeddings: Some firms choose to craft their embeddings based on specific domain data or modify existing pretrained embeddings to align more closely with their distinct needs. An embedding feature store caters to these custom embeddings, making them available for diverse projects or model applications. Within the feature store, users can save these tailored embeddings with the transformation feature.
Swift access: The architecture of embedding feature stores is geared towards the rapid and effective extraction of embeddings. This efficiency is paramount, given the potentially large dimensionality of embeddings and the computational demand of their calculation. With a dedicated storage system, recalculating the same embeddings becomes redundant.
Uniformity and versioning: Version management and control are integral to an embedding feature store. Despite the emergence of new embedding versions, models can consistently utilize the same embeddings, preserving reproducibility and ensuring stable model operations over durations.
Handling vast data volumes: For entities grappling with colossal data volumes, the streamlined storage and extraction of embeddings can be daunting. However, a feature store powered by the Spark processing engine is adept at managing large-scale embeddings, offering tools for distributed data storage and access.

BERT and DistilBERT

DistilBERT, a lighter and faster variant of the BERT model, has shown promise in various Natural Language Processing tasks without compromising much on performance. Its compact design, achieved by distillation, makes it particularly suitable for storing embeddings efficiently in a Feature Store. Feature Stores, which serve as centralized repositories for machine learning features, need to manage storage effectively to ensure quick retrieval and usability of features. Embeddings from large models can consume significant storage space and increase retrieval times, thus affecting the efficiency of downstream tasks. By utilizing DistilBERT embeddings, data scientists can strike a balance between the quality of the embeddings and storage efficiency, ensuring that features are both relevant for model training and promptly accessible within the Feature Store. Without further delay, let’s get started.

Summary of steps

The following steps summarize the activities taking place:

Install the feature store instance with the following prerequisites:
- Policies
- Authentication and authorization
- Variables declaration
Import feature store-dependent Python libraries into the notebook.
Load the SQuAD dataset.
Install the feature store construct.
Create entities and transformations on the word embeddings.
Create feature groups.
Create datasets.
Train an ML model using the DistilBERT algorithm.

Prerequisites

Policies

Before getting started, set the policies and authentication methods. You can check the following documentation on setting the following requisites:

For this blog, we have taken the dataset from the Stanford Question Answering Dataset (SQuAD). SQuAD is a reading comprehension dataset widely used for training and evaluating ML models on the task of question answering. Each example in the dataset consists of a question posed by a crowd worker based on a given passage from Wikipedia. The answer to every question is a segment or span of the passage.

Authentication

Install ADS with the following command:

python -m pip install --pre oracle-ads==2.9.0rc0

You can set the authentication to “resource principal” or “api_gateway” with the following command:

import ads
ads.set_auth(auth="resource_principal", client_kwargs={"service_endpoint": "<api_gateway>"})

Variables

The feature store stores the metadata associated in the hive metastore, providing the compartment ID and the metastore ID.

import os
metastore_id = "<metastore_id>"

Import the required libraries

By default the PySpark 3.2, the feature store and Data Flow Conda environment includes preinstalled python libraries for feature validation and monitoring such as great-expectations and monitoring libraries. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join, or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them, and which features to use in join conditions:

import warnings
warnings.filterwarnings("ignore", message="iteritems is deprecated")
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pandas as pd

from ads.feature_store.feature_store import FeatureStore
from ads.feature_store.feature_group import FeatureGroup
from ads.feature_store.model_details import ModelDetails
from ads.feature_store.dataset import Dataset

from ads.feature_store.common.enums import DatasetIngestionMode
from ads.feature_store.feature_group_expectation import ExpectationType
from great_expectations.core import ExpectationSuite, ExpectationConfiguration
from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar

Load the SQuAD Dataset

Now load the SQuAD dataset. The following fields are important:

answers: The starting location of the answer token and the answer text,
context: Background information from which the model needs to extract the answer.
question: The question a model answers.

from datasets import load_dataset
 squad = load_dataset("squad", split="train[:5000]")

Create the feature store construct

If you have an existing feature store, you can refer to the existing feature store OCID. You can also follow the documentation or refer to the “Setting up Feature Store” section of our previous blog post. The feature store is the top-level entity of the service. Use the following steps.

Call the FeatureStore(). command to start the feature store
To materialize, call the .create() command.

feature_store_resource = (
  FeatureStore().
    with_description("SQUAD Dataset Feature Store").
    with_compartment_id(compartment_id).
    with_display_name("SQUAD details").
    with_offline_config(metastore_id=metastore_id)
)
#use the .Create to create the feature store
feature_store = feature_store_resource.create()
feature_store

The following image shows the feature store construct:

A screenshot of the code output to create the feature store instances.

Create entities and transformations

An entity is a logical segregation of feature store entities. In the context of a feature store, transformations refer to operations or a series of operations applied to raw data to derive features that train machine learning models. These transformations are critical because raw data often isn’t in a format that’s directly usable or optimal for ML tasks.

To create entities, use the following command:

entity = feature_store.create_entity(
    display_name="Squad Entity",
    description="description for Squad entity details"
)
entity

A screenshot of the code output for the feature store entities.

Now, we generate embeddings for sentences by utilizing the transformation construct within the feature store. This process involves grouping sentences into batches and transforming words and sentences into embeddings.

def squad_embedding_transformation(df):
    from datasets import Dataset
    from transformers import AutoTokenizer
    import json
    import numpy as np

    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    def preprocess_function(examples):
        questions = [q.strip() for q in examples["question"]]

        inputs = tokenizer(
            questions,
            examples["context"],
            max_length=384,
            truncation="only_second",
            return_offsets_mapping=True,
            padding="max_length",
        )

        offset_mapping = inputs.pop("offset_mapping")
        answers = examples["answers"]

        start_positions = []
        end_positions = []

        for i, offset in enumerate(offset_mapping):
            answer = answers[i]
            start_char = answer["answer_start"][0]
            end_char = answer["answer_start"][0] + len(answer["text"][0])
            sequence_ids = inputs.sequence_ids(i)

            # Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1

            # If the answer is not fully inside the context, label it (0, 0)
            if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                # Otherwise it's the start and end token positions
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

        inputs["start_positions"] = start_positions
        inputs["end_positions"] = end_positions
        return inputs

    df['answers'] = df['answers'].apply(json.loads)

    # Convert lists back to NumPy arrays within dictionaries
    df['answers'] = df['answers'].apply(lambda x: {k: np.array(v) for k, v in x.items()})
    dataset = Dataset.from_pandas(df)
    dataset = dataset.train_test_split(test_size=0.2)
    dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

    return dataset["train"].to_pandas()

Materialize this transformation with the following code:

from ads.feature_store.transformation import TransformationMode 

 squad_transformation = feature_store.create_transformation(
  transformation_mode=TransformationMode.PANDAS,
  source_code_func=squad_embedding_transformation,
  display_name="squad_embedding_transformation",
)
  squad_transformation

Feature groups

Now, create a feature group to store the embeddings created from the SQuAD dataset.

import json
squad_df = squad.to_pandas() 

 # Convert NumPy arrays within dictionaries to lists
squad_df['answers'] = squad_df['answers'].apply(lambda x: {k: v.tolist() for k, v in x.items()}) 

 # Convert the 'data' column back to dictionaries of NumPy ndarrays
squad_df['answers'] = squad_df['answers'].apply(json.dumps) 

 squad_embedding_feature_group = (
  FeatureGroup()
  .with_feature_store_id(feature_store.id)
  .with_primary_keys([])
  .with_name("squad_embedding_feature_group")
  .with_entity_id(entity.id)
  .with_compartment_id(compartment_id)
  .with_schema_details_from_dataframe(squad_df)
  .with_transformation_id(squad_transformation.id)
)

Materialize the SQuAD feature group with the following command:

squad_embedding_feature_group.create()

A screenshot of the materialized feature embeddings in a feature group.

Show the feature group with the following command:

squad_embedding_feature_group.show()

A screenshot of the status of the materialized feature group.

View the history of the feature group versions

Using the .history().show() function, you can view the history and versions of the feature group that has evolved over a period of time.

squad_embedding_feature_group.history().show()

A screenshot of the history of the feature group.

Transforming the feature groups to Pandas dataframes

You can also transform the feature groups to convert to Pandas dataframes.

tranining_df = squad_embedding_feature_group.select().read().toPandas() tranining_df

You can now use this dataframe to train a DistilBERT model.

Train a DistilBERT model

You’re now ready to start training your model! Load DistilBERT with AutoModel for question answering. Use create_optimizer():

from transformers import create_optimizer 

batch_size = 16
num_epochs = 2
total_train_steps = (len(tranining_df) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=total_train_steps,
)

The TFAutoModelForQuestionAnswering class in the Hugging Face transformers library is designed for TensorFlow users and provides a simple way to automatically load a pretrained model suitable for question-answering (QA) tasks.

from transformers import TFAutoModelForQuestionAnswering
 model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

In the Hugging Face’s transformers library, the DefaultDataCollator serves a key role in the data preparation pipeline for model training. It converts a list of samples into a format suitable for model training, especially when training using mini-batches.

from transformers import DefaultDataCollator 

data_collator = DefaultDataCollator(return_tensors="tf")
from datasets import Dataset
tf_train_set = model.prepare_tf_dataset(
     Dataset.from_pandas(tranining_df),
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

Import the Tensorflow libraries and its dependencies. Use the .compile() command with the optimizer to compile the model.

import tensorflow as tf
model.compile(optimizer=optimizer) model.fit(x=tf_train_set, epochs=3)

Use the .fit() function to train the Tensorflow model:

model.fit(x=tf_train_set, epochs=3)

When this model is trained and evaluated, we can deploy the model through Oracle Cloud Infrastructure (OCI) Data Science model deployment.

Conclusion

In this blog, we saw how text embeddings can be stored efficiently and reused to train an ML model. The synergy between embeddings, especially those from LLMs, and feature stores is undeniable. As we continue to push the boundaries of what’s possible in machine learning, ensuring that our tools and infrastructure evolve in tandem is critical. The age of embeddings in feature stores, powered by the might of LLMs, is just beginning, and it promises a future of unparalleled innovation and discovery. For a detailed walkthrough on integrating a feature store with your workflows, refer to our Demo Jupyter Notebooks (Link -> Jupyter Notebooks ).

References

To get started on the feature store, you can try the sample notebooks or watch the demos. Try Oracle Cloud Free Trial for yourself! A 30-day trial with US$300 in free credits gives you access to OCI Data Science service. For more information, see the following resources:

Feature Store documentation
Full sample including all files in OCI Data Science sample repository on GitHub.
Visit our OCI Data Science service documentation.
Configure your OCI tenancy with these setup instructions and start using OCI Data Science.
Star and clone our new GitHub repo! We included notebook tutorials and code samples.
Watch our tutorials on our YouTube playlist
Try one of our LiveLabs. Search for “data science.”

Embedded intelligence: Storing and retrieving embeddings in a feature store

Introduction

The power of embeddings and the feature store

BERT and DistilBERT

Summary of steps

Prerequisites

Policies

Authentication

Variables

Import the required libraries

Load the SQuAD Dataset

Create the feature store construct

Create entities and transformations

Feature groups

View the history of the feature group versions

Transforming the feature groups to Pandas dataframes

Train a DistilBERT model

Conclusion

References

Kshitiz Lohia

Senior Member of Technical Staff

Divyank Gupta

Yogesh Kumawat

Member of Technical Staff

Srikanta Prasad (Sri)

PRINCIPAL PRODUCT MANAGER

Fathimath Najiya Chemnad Lesiyath

Revolutionizing AI-driven research with Cleveland Clinic and OCI

Multi-GPU multinode fine-tuning Llama2 on OCI Data Science

Embedded intelligence: Storing and retrieving embeddings in a feature store

Introduction

The power of embeddings and the feature store

BERT and DistilBERT

Summary of steps

Prerequisites

Policies

Authentication

Variables

Import the required libraries

Load the SQuAD Dataset

Create the feature store construct

Create entities and transformations

Feature groups

View the history of the feature group versions

Transforming the feature groups to Pandas dataframes

Train a DistilBERT model

Conclusion

References

Authors

Kshitiz Lohia

Senior Member of Technical Staff

Divyank Gupta

Yogesh Kumawat

Member of Technical Staff

Srikanta Prasad (Sri)

PRINCIPAL PRODUCT MANAGER

Fathimath Najiya Chemnad Lesiyath

Revolutionizing AI-driven research with Cleveland Clinic and OCI

Multi-GPU multinode fine-tuning Llama2 on OCI Data Science