Unsupervised Anomaly Detection with MySQL HeatWave

May 9, 2023 | 5 minute read
Salil Pradhan
Principal Product Manager
Text Size 100%:

Anomaly detection is a technique to find abnormalities in the data. It has found applications in a wide variety of fields such as fraud detection, network intrusion, detecting life-threatening medical conditions, quality control etc.

Here are two examples: The first is an example of fraud detection with credit card usage. If a customer typically uses credit card in their home country and multiple large purchases were made using the credit card in a foreign country, then this is a deviation from the usual pattern. This could indicate that the card may have been stolen and used without owner’s knowledge. The second is an example related to automobile manufacturing – identifying defective part.  This identification can indicate a quality problem which should be fixed before it becomes much more expensive to correct it at later stages of assembly.

In the above examples, anomaly detection can sense abnormal behavior and can flag a warning so that corrective actions can be taken.

Anomaly detection is particularly challenging because of the issues such as lack of labelled data, rare occurrences of anomalies, inability to address various types of anomalies with one solution etc.  

Anomaly detection literature has multiple distance-based machine learning algorithms, which use nearest-neighbor distances to detect anomalies. However, these algorithms are often designed to detect specific anomaly types, so choosing an algorithm can be problematic if the user does not know which anomaly types are in the dataset.

There are three typical types of anomalies as illustrated below:

  1. Global anomaly

An anomaly that is far from all normal points. For instance, - A credit card transaction happening in a distant country whereas all other transactions are taking place in a specific country or region.

  1. Local anomaly

An anomaly that is relatively far from the closest normal points, but possibly the distance between it and the closest normal points is comparable to the distance between other normal points. For instance, Incorrect amount charged on a credit card.

  1. Cluster anomaly

An anomaly that is far from closest normal points but close to some other anomalies. For instance, Multiple consecutive maximum-allowed cash withdrawals from an ATM on the same day.

The example shown below has two normal clusters and illustrates distance based local, global and cluster anomalies.

 

 

HeatWave AutoML detects anomalies in unlabeled data in an automated fashion using a novel and patented technique called Generalized kth Nearest Neighbors (GkNN) which is based on an ensemble algorithm that does not require tuning of hyperparameters. It identifies common types of anomalies such as local, global, and clustered which typically requires separate algorithms to detect. This algorithm provides high performance on a set of anomaly detection data sets, out-performing some of the most widely utilized algorithms such as k-th Nearest Neighbor (kNN) and Local Outlier Factor (LOF).

 

 

HeatWave AutoML offer one of the best AUC (Area under the ROC Curve) score compared to other anomaly detection techniques. AUC score is a measure of model efficiency. The higher the AUC, the better the model’s performance at distinguishing between the positive and negative classes.

 

 

None of the competing products such as Google BigQuery ML, Redshift ML or Snowflake offer a fully automated solution for anomaly detection as HeatWave AutoML does. HeatWave is much faster and accurate compared to Redshift ML on a set of anomaly detection data sets.

 

 

HeatWave AutoML Example

HeatWave AutoML routines to train a model and to make predictions are provided below. Note that we only request the user to provide a contamination factor, which indicates the % anomalies in the data set.  The default value is 1%. Please refer to MySQL HeatWave AutoML documentation for the details.

Training

CALL sys.ML_TRAIN ('table_name', NULL, options, model_handle);

options: {

    JSON_OBJECT('key','value'[,'key','value'] ...)

        'key','value':

        |'task', {|'anomaly_detection'}|NULL

        | ‘contamination’, float
       }

Running the ML_TRAIN routine on training dataset produces a trained machine learning model.

Predictions

CALL sys.ML_MODEL_LOAD(model_handle, user);

The ML_MODEL_LOAD routine loads a model from the model catalog. A model remains loaded until the model is unloaded using the ML_MODEL_UNLOAD routine or until HeatWave ML is restarted by a HeatWave Cluster restart.

CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name, options);

options: {

    JSON_OBJECT('key','value'[,'key','value'] ...)

        'key','value':

        |'threshold', float

        | ‘topk’, int
       }

ML_PREDICT_TABLE generates predictions for an entire table of unlabeled data and saves the results to an output table. 

CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name, options);

options: {

    JSON_OBJECT('key','value'[,'key','value'] ...)

        'key','value':

        |'threshold', float

        | ‘topk’, int
       }

ML_PREDICT_ROW generates predictions for one or more rows of unlabeled data.

SELECT ML_PREDICT_ROW(input_data, model_handle, options);

options: {

    JSON_OBJECT('key','value'[,'key','value'] ...)

        'key','value':

        | ‘threshold’, float
       }

In summary, anomaly detection is an effective technique for finding abnormalities in data. HeatWave AutoML offers an automated machine learning pipeline for anomaly detection based on patented algorithm that can detect various types of anomalies and provides better accuracy compared to other anomaly detection techniques.

Salil Pradhan

Principal Product Manager

Salil Pradhan is a Product Manager in MySQL HeatWave team. His interests include distributed data processing, machine learning, cloud computing, middleware technologies as well as application areas such as Marketing Automation and Supply Chain Management.


Previous Post

MySQL at FOSSASIA 2023 Report

Lenka Kasparova | 4 min read

Next Post


How to Migrate from MariaDB to MySQL HeatWave

Frederic Descamps | 15 min read
Oracle Chatbot
Disconnected