This post was previously published on Towards Data Science.
IoT sensors can provide information about systems health, but they can also hide valuable early warnings related to incoming failure that could be avoided with predictive maintenance.
The Multivariate State Estimation Technique — Sequential Probability Ratio Test (MSET-SPRT), a machine learning algorithm used for decades in nuclear plants and critical infrastructures, is able to detect these key warnings hidden by the signal noise, and it will be released with Oracle Database 20c.
In this post, I’ll show how to apply a real machine learning test case for MSET-SPRT in predictive maintenance.
Subscribe to the Oracle AI & Data Science Newsletter to get the latest posts sent to your inbox!
According to McKinsey’s study “Visualizing the uses and potential impact of AI and other analytics”, 2018, the estimated impact of artificial intelligence and other analytics on all industries regarding anomaly detection is between $1.0T and $1.4T.
Anomaly detection is the critical success factor in predictive maintenance, which tries to anticipate when maintenance is required. This differs from the classical preventive approach, in which activities are planned on a regularly scheduled basis, or condition-based maintenance activities, in which assets are monitored through IoT sensors.
The problem of predictive maintenance can be explained with the following figure:
The Goldilocks problem of maintenance
Applying anomaly detection algorithms based on machine learning, it’s possible to perform prognostics to estimate the condition of a system or a component and its remaining useful life (RUL), in order to predict an incoming failure.
One of the most famous algorithms is the MSET-SPRT, well-described with a use case in this blog post: “Machine Learning Use Case: Real-Time Support for Engineered Systems." Originally developed at the US DOE’s Argonne National Laboratory in the 1990’s, it has been used for decades to monitor nuclear plants or air fleets, in which “Failure is not an option." It will be included into the next release of Oracle Database, the 20c, as an implementation improved by Kenny Gross at Oracle Labs.
Oracle Database 20c is now available as a preview on Oracle Cloud Infrastructure to test its performance.
The MSET-SPRT algorithm monitors critical processes and detects subtle anomalies. MSET is a nonlinear, nonparametric anomaly detection machine learning technique that calibrates the expected behavior of a system based on historical data from the normal operational sequence of monitored signals.
It works as shown in the following figure:
The MSET-SPRT algorithm
The MSET model is trained on a series of M observations, time-based, coming from N sensors, and it is used to estimate signal values based on new values in input.
To form a hypothesis about the overall health of the system, these functions calculate the difference between the estimated and the actual signal values (residual) and use SPRT calculations to determine whether any of the signals have become degraded. An output value is 1 for a normal signal in input, and 0 for an anomaly detected — several other pieces of information are provided in order to describe the scoring. The assumption is that the characteristics of the data being monitored do not change over time.
The goal is to detect a coming failure as soon as possible, extracting information from signals in order to issue an alarm at an early stage.
To experiment with this anomaly detection algorithm, I’ve chosen a dataset coming from NASA Ames Prognostics Data Repository named “Bearing Data Set," provided by the Center for Intelligent Maintenance Systems (IMS), University of Cincinnati [¹].
It has data collected from a mechanical test plant with 4 bearings — we will call them B1, B2, B3, and B4 — installed on a shaft rotating at 2000 RPM with a motor connected to the shaft, running for days without stops.
On each bearing were installed 2 accelerometers, so we have 8 signals, sampled each time for one second at 20 KHz. Each sampling has 20480 points, and it is repeated every 10 minutes.
There are several set of tests available, but I’ve used data collected from October 22, 2003 12:06:24 to November 25, 2003 23:39:56 (Set №1). In total, we have more than 44 Ml records.
At the end of this test, the bearing B3 was damaged, as well as the bearing B4.
To create a model that can detect an incoming defect as soon as possible, I’ve worked with an Oracle Database 20c instance with a Zeppelin Notebook configured to use PL/SQL as interpreter. All information about how to reproduce this test is at the end of this post.
For a general understanding about Oracle Machine Learning in PL/SQL on Oracle Database, please refer to my post “Machine Learning on Autonomous Database: A Practical Example”. In this post, I’ve talked about Autonomous Database, but using Zeppelin is the same approach from the programming point of view.
Check out additional blog posts by Corrado:
First of all, I’ve extracted from the 44 Ml records dataset the first timeframe that we can use as an example of normal behaviour of the system, on which must be trained the MSET (from 22-Oct to 1-Nov). The remaining part of dataset has been used to evaluate the accuracy of the algorithm trained.
An important note for dataset preparation: the field with a timestamp of sensor values record must be of DATE or TIMESTAMP type. In our case, I’ve choose the TIMESTAMP that allows to store fractions of second, as shown in this code:
CREATE TABLE MSET_TRAINING_SH_DATA ( "TIME_ID" TIMESTAMP (6) NOT NULL ENABLE, "B11" NUMBER, "B12" NUMBER, "B21" NUMBER, "B22" NUMBER, "B31" NUMBER, "B32" NUMBER, "B41" NUMBER, "B42" NUMBER );
Just to have a visual representation between the normal condition of B3.1 sensor vs near-failure, I’ve plotted the following graph:
B3.1 sensor in normal condition compared with failure signals
To have a more valuable data visualization, I’ve calculated the Root Mean Square for sensors attached to Bearing B3 and B4 observations for each sample of one second at 20 KHz. As you can see, there is no evidence for B3 of incoming damage until 24 November, when the situation worsens.
For B4, there are some minimal fluctuations on 21 Nov.
RMS signals from the 4 accelerators attached to the bearings are going to be damaged
Monitoring only these parameters, you have no time to plan a maintenance activity to substitute the bearing 3 and 4 and avoid the failure of entire system per time.
As a training sample, MSET-SPRT requires a more limited number of records compared the training set size (10 Ml), so I’ve queried the first second sampled: 22–10–2003 12:06:25.005000, corresponding to a 20480 records in the early stage of system test running, on 22 October.
MSET-SPRT implementations requires several parameters, but I’ve set only these ones:
It must be understood that, unlike classification algorithms that have no memory during scoring, the scoring in MSET-SPRT is done considering a set of sequential samples in time order (MSET_ALERT_WINDOW) and how many samples (MSET_ALERT_COUNT) have passed the threshold to fire an alert.
For calling the create_model() function, we don’t need to set the target, because the algorithm is trained on normal condition observations, but we need the specify timestamp field.
BEGIN DBMS_DATA_MINING.DROP_MODEL('MSET_MODEL1'); DBMS_DATA_MINING.create_model( model_name => 'MSET_MODEL1', mining_function => 'CLASSIFICATION', data_table_name => 'sample', case_id_column_name => 'time_id', target_column_name => '', settings_table_name => 'MSET_SH_SETTINGS'); END;
After that, I’ve run a prediction on the full test dataset, composed by more than 33 Ml records to evaluate how many days before the model could be helpful to predict an incoming failure.
To do this, I’ve written a Stored Procedure test_model(startDay,stopDay):
%osql CREATE OR REPLACE PROCEDURE test_model (startDay NUMBER,stopDay NUMBER) AS str1 varchar2(4000); str2 varchar2(4000); i varchar2(2); ii varchar2(2); BEGIN str1 := q'[ insert into pred_stat_temp select dt,count(pred) as fail_num from ( SELECT to_char(time_id,'DD') as dt,b11,b31, PREDICTION(mset_model1 using *) OVER (ORDER BY time_id) pred FROM mset_test_sh_data where to_char(time_id,'DD')=']'; str2 := q'[') where pred=0 group by dt order by dt]'; for i in startDay..stopDay loop if i<10 then ii:= '0'||to_char(i); else ii:= to_char(i); end if; execute immediate str1|| ii ||str2; end loop; END;
To execute a long-running procedure like this, I’ve used the Oracle Database Job as follows:
%osql DECLARE X NUMBER; begin dbms_job.submit( X, 'BEGIN test_model(1,25); END;', SYSDATE ); end;
And monitor the status with:
select job_name,job_action,start_date,end_date from USER_SCHEDULER_JOBS
This approach has allowed me to test several combinations of hyper-parameters in order to find the most suitable combination on algorithm settings.
The following figure is a plot of number of anomalies per day that have been fired, starting from November 10th.
Number of anomalies detected per day in last period before failure
On the 18th, we have detected 47 anomalies (on 2,375,680 records), compared with 282 anomalies on the 19th (on 1.474.560 records). The normalized graph represents 866% of growth, enough to trigger an alarm 5 days before the failure.
But is this indicator useful compared to a normal range of signals on the same days?
To have an idea, I’ve calculated the RMS, Min, Max and Variance of B3.1 signal on Nov. 18th and on Nov. 19th, the latter when the model start to increase significantly the number of alerts.
No indicator expresses better than the growth of 866% of anomalies detected by the model if we compare the two days.
You can use a threshold tuned on number of anomalies triggered on the 19th to have an alarm to solicit a maintenance activity.
If you want to test by yourself the algorithm and this use case, you have to set up the environment as follows:
 J. Lee, H. Qiu, G. Yu, J. Lin, and Rexnord Technical Services (2007). IMS, University of Cincinnati. “Bearing Data Set”, NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field, CA
The views expressed on this paper are my own and do not necessarily reflect the views of Oracle.