What is Anomaly Detection
In data science, anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
In the following figure anomaly data which is a spike (shown in red color). But the same spike occurs at frequent intervals is not an anomaly.
There are 3 types of Machine Learning Techniques
Refer https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ for more details.
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.
We will need Unsupervised Anomaly detection when we don’t have labelled data. i.e. we don’t have data with label of when anomaly has occurred.
Different types of Anomaly detection techniques are described below.
A safe bet is to use wisdom of the crowds by using multiple ensemble methods. We can then choose to combine them through majority vote, or union or intersection of the individual algorithms’ verdicts.
It is a clustering based Anomaly detection.
There are some more methods like probability based multivariate gaussian distribution, PCA,t-SNE.
Feel free to walk through my ipython notebook https://github.com/meenavyas/Misc/blob/master/AnomalyDetection.ipynb
In this notebook , I have tried IsolationForest amd Lof. As you can see in the plots given below, points which got high scoring from these algorithms are anomalies.
When we run anomaly detection automatically on streaming data for that we may need infrastructure like Apache Spark.
This blog is also posted in my personal blog here.