Building a Model Distillation Pipeline on OCI Data Science for Fraud Detection (Part-1)

Modern machine learning systems aren’t just judged by accuracy anymore. These models need to be fast, scalable and cost effective. For example, In high throughput use cases like fraud detection, models must make decisions in milliseconds while processing massive volumes of transactions.

While larger, more complex models often perform better, they’re also slower and more expensive to run, which makes them harder to use in real-world systems. So, the challenge is, how do you keep the intelligence of these models while making them efficient enough for production?

In this blog, we explore how to solve this problem using model distillation and demonstrate how to distillation enables us to achieve strong model performance with significantly reduced latency, model size, and cost.

In the next blog, we will build the complete pipeline on OCI using OCI Data Science Service and Object Storage.

What is Model Distillation

Model distillation, or knowledge distillation, is a way to transfer what a large, complex model has learned into a smaller, faster one. The smaller model (called the student) is trained to mimic the larger model (the teacher).

Instead of learning only from hard labels like “fraud” or “not fraud,” the student also learns from the teacher’s probability outputs. These probabilities show how confident the teacher is and how it distinguishes between classes. This helps the student pick up not just the final decision, but also the subtle patterns the teacher has learned.

The result is a model that keeps most of the accuracy of the larger model, while being much faster, lighter, and cheaper to run.

The distillation process usually follows a few simple steps:

First, train a high-performing teacher model on your dataset.
Use this model to generate soft labels (probability scores), which provide richer learning signals than hard labels.
Train the student model using both the soft labels and the original ground truth data, so it can mimic the teacher.
Finally, evaluate and compare both models in terms of accuracy, latency, and overall efficiency.

This approach helps strike a good balance between performance and efficiency, making the student model better suited for real-world deployment.

The Image showcases the model distillation overview and the process.

Building the Distillation Pipeline on OCI

Build and Train the Teacher Model

The first step is to build a strong teacher model that focuses on accuracy. Instead of using a single algorithm, we use an ensemble of models—each bringing a different strength.

LightGBM (LGBMClassifier) and XGBoost (XGBClassifier): These gradient boosting models are great at capturing complex patterns in tabular data. They work especially well for fraud detection, where data is often imbalanced and relationships between features can be subtle.
Random Forest: This helps make predictions more stable by reducing variance and handling noisy data better.
K-Nearest Neighbors (KNN): A simpler model, but useful for spotting local patterns and rare or unusual behavior in the data.

Each model is trained separately, and their outputs are combined into an ensemble. The final prediction is based on the average (or weighted average) of their probabilities. This way, we get the benefit of multiple approaches instead of relying on just one.

The result is a much stronger teacher model. Boosting models capture complex global patterns, KNN adds local insights, and Random Forest improves stability. Together, they produce richer and more reliable probability outputs.

These high-quality probability scores are important—they become the “soft labels” we use in the next step to train a smaller, more efficient student model.

The image showcasaes the ensemble of teacher model being used for fraud detection.

Train Student Model

Once the teacher model has generated soft labels (probability outputs), the next step is to train the student model to mimic this behavior in a much more efficient form.

For the student, we intentionally chose a simpler and lighter model—in this case, a HistGradientBoostingRegressor. Compared to the teacher ensemble, this model is:

Computationally cheaper
Faster at inference
Smaller in size
Capable of capturing non-linear relationships in tabular data

This choice strikes a balance: instead of using an overly simplistic model (like linear regression), we use a moderately expressive model that can still learn complex patterns, but without the overhead of maintaining multiple models in an ensemble. The goal of distillation is not just simplification, but efficient approximation of the teacher’s behavior.

The image shows the student model which is learning patterns from the training data and the probabilities generated by the teacher model.

Compare Teacher vs Student Models

Once both models are trained, the next step is to compare them side by side. The goal isn’t just to check accuracy—it’s to understand the trade-off between performance and efficiency, which is what distillation is all about.

We are going to focus on 4 key metrices when comparing the model:

ROC-AUC (Overall Performance)

ROC-AUC is a standard, widely accepted metric for classification quality and measures how well the model can separate the transactions across all thresholds.

The ROC curve shows that both the teacher and student models perform very well, with AUC scores of 0.9969 and 0.9920. The student closely tracks the teacher across most thresholds, which means it has learned the teacher’s behavior with only a small drop in performance.

There’s a slight difference at very low false positive rates, where the teacher does a bit better, but the gap is minimal. Overall, this confirms that distillation has preserved most of the model’s predictive power while making it more efficient and easier to deploy.

Recall @ Fixed Precision (Business Impact)

This measures how much fraud the model detects while keeping precision high (for example, at 90%).

The image shows the Recall @ 90% Precision comparison for the teacher and student model.

This chart shows the business trade-off between the teacher and student models.

At 90% precision, the teacher achieves ~90.6% recall, while the student reaches ~75.4%, meaning the teacher catches more fraud under strict conditions.

However, these results come from a demo setup and aren’t fully optimized. In real-world scenarios, performance will depend on data quality, feature engineering, and tuning. With further optimization, this gap can be reduced.

The key point is that distillation still produces an efficient model with strong performance, which can be improved based on real production needs.

Inference Latency (Speed)

This measures how long the model takes to make predictions. This is where the power of distillation really shines. The student model would be much faster than the teacher model.

Image showcasing the latency improvement between teacher and student model for a single request.

Image showcasing the latency improvement between teacher and student model for 10 requests.

These results clearly show the efficiency gains from distillation. For a single request, the student model is about 27× faster than the teacher, reducing latency from ~90 ms to ~3 ms. With larger batches (10 requests), the gap increases further, with the student becoming over 50× faster.

This highlights a key benefit of distillation: while ensemble models are powerful, they come with higher inference cost. The student model captures that intelligence in a single, lightweight model, delivering fast and consistent predictions—making it much better suited for real-time use cases like fraud detection.

4. Model Size (Efficiency)

This compares how large the models are in terms of memory. Smaller models are easier to deploy, faster to load, and scale better

the image compares the size of the teacher and student model.

The model size comparison further highlights the efficiency gains from distillation. The teacher model, being an ensemble, has a much larger footprint, while the student model is far more compact.

This smaller size makes the student easier to store, quicker to load, and better suited for deployment in constrained environments. In practice, it also means lower infrastructure costs and better scalability.

Key Takeaways

Model distillation is a practical way to balance accuracy and efficiency in real-world systems like fraud detection. By transferring knowledge from a strong ensemble teacher model to a smaller student model, you can retain most of the predictive performance while significantly improving latency, model size, and cost. This shows that with the right approach, it’s possible to build models that are both accurate and production-ready. OCI Data Science further simplify this process by providing a secure, scalable environment for experimentation and deployment without added infrastructure complexity.

What’s Next

In this first part, we focused on understanding and implementing the distillation process—building the teacher-student setup and validating it through benchmarking. In the next part, we’ll take this further by building a fully automated, production-ready pipeline on OCI. We’ll use OCI Data Science features like Jobs, Pipelines, and Model Catalog to operationalize the entire workflow end to end.

Building a Model Distillation Pipeline on OCI Data Science for Fraud Detection (Part-1)

What is Model Distillation