Fraud affects all of us. As noted by the Coalition Against Insurance Fraud, “Honest consumers like you pay. Lives, families, businesses, and careers are wrecked. … Your money is stolen. … Your insurance premiums rise.” The statistics are sobering. Insurance fraud costs more than $308.6 billion per year from American consumers. Ten percent of property-casualty insurance losses in involve fraud. This is just one industry.
Healthcare is another industry inundated with fraud. In 2018, AARP reported that Medicare fraud is estimated to cost $60 billion per year, which was 10% of Medicare’s budget.
Workers’ compensation premium fraud is estimated to cost $25 billion per year – increasing costs for legitimate businesses, making them uncompetitive and unable to hire covered worked.
Machine learning offers technology to help detect potential fraud. With scalable algorithms in Oracle Database and Autonomous Database, you can start exploring anomalies in your enterprise data that may point to potential fraud.
Challenges to detecting fraud
Detecting fraud is often challenging due to several factors:
- Examples of fraud are often rare relative to the broader base of normal transactions. This imbalance in the data makes it difficult to use classification algorithms since there isn’t enough “signal” in the data to produce an accurate model.
- Patterns of fraud change quickly. Using historical data, even if you have enough examples, may not be helpful as fraudsters constantly change tactics, so a model that learned patterns of past fraud well may still not be able to detect future instances of fraud. More importantly, the ability to rebuild models on the latest data and redploy those models within hours, not days, weeks, or months, enables detecting these new patterns of fraud while the patterns are still relevant.
- Missing, inaccurate, or incomplete data can limit machine learning model accuracy. Fraudsters may omit data or provide false information to “confuse” algorithms. In other cases, not having well-integrated customer demographic data, transaction data, and unstructured data like text, voice, video, can all limit fraud detection effectiveness.
- Understanding why a transaction is identified as fraud may be critical to open an investigation. Some machine learning algorithms, such as deep learning neural networks, may be difficult to understand why a given transaction was flagged.
- Too many false positives, where records are identified as potential fraud but they are not fraudulent, can have significant negative side effects. If too many credit card transactions are automatically denied due to suspected fraud, sales and revenue may be adversely impacted. If each case of suspected fraud needs to be manually investigated, such investigation is expensive and time-consuming, potentially allowing other real cases of fraud to go unchecked.
Despite these challenges, there are machine learning techniques that can help identify potential fraud, either individually or in combination. We say “potential” fraud because machine learning models aren’t perfect, they can make mistakes. In fact, we don’t want predictive algorithms to memorize their input data, but instead we need them to generalize so they can more effectively handle data they haven’t seen before.
Given the imperfect performance of machine learning algorithms, when a transaction is flagged as “fraud” it may only merit further investigation. Assigning a probability to potential fraud can also help prioritize which cases to investigate first. Of course, with applications involving millions of real-time transactions, some degree of error needs to be tolerated. Have you ever made a big purchase at a “big box” store only to have to speak to a credit card company representative to verify that your purchase isn’t an instance of fraud?
Multiple Machine Learning Techniques for Detecting Fraud
A few of the common machine learning techniques for identifying potential fraud include Anomaly Detection, Classification, and Clustering.
Anomaly Detection
Anomaly detection identifies unusual cases in data that, examined in isolation, may appear normal. Ideally, the data provided to build or train the anomaly detection model would consists of only “normal” or non-fraud cases. The One-Class Support Vector Machine algorithm can be used to learn these normal cases and then when scoring new data, will flag instances that sufficiently deviate from the normal patterns. An outlier rate can be specified to limit the number of records flagged as anomalies. Using One-Class SVM, even if you don’t have purely “normal” data, you can still get useful results, or add a feedback loop to purify the input training data over time and improve model accuracy.
Classification
Classification is a supervised learning technique that requires labeled data, that is, data that is known to be fraud or not. This is usually provided in a binary target variable (column of data), e.g., FRAUD_FOUND, with values 0/1, or NO/YES. Many algorithms can support classification. Some offer greater transparency, like decision trees, which produce human-interpreter rules, while others offer greater accuracy, like neural networks. Other classification algorithms include Generalized Linear Model, Support Vector Machine, Naive Bayes, XGBoost, etc. Classification works best if you have enough positive (fraudulent) examples, e.g., 10-20%, on which to build the machine learning model.
Clustering
Clustering is an unsupervised learning technique that doesn’t require labeled data. With clustering, we’re looking for examples that don’t fit any of the identified clusters well (a type of outlier), or is part of a particularly small cluster. Algorithms like K-Means and Expectation Maximization produce models consisting of clusters, then distance metrics or probability of belonging to a given cluster can be used to identify potentially fraudulent records. Determining the right number of clusters is handled by some algorithms automatically, while others, like K-Means, requires some experimentation or model evaluation techniques.
Hybrid solutions for detecting fraud
Each of these techniques by themselves may produce useful results but may result in too many false positives as discussed above. To help mitigate this problem, we can combine multiple machine learning techniques and models to improve the accuracy and robustness of fraud detection systems.
Clustering with Anomaly Detection
In this scenario, we cluster the data first to get a “first order” grouping of similar cases. Then we can build an anomaly detection model on the records assigned to each cluster. This allows finding unusual cases within each cluster. The Oracle Machine Learning partitioned model feature simplifies this approach. After building the clustering model in the database and assigning each record to a cluster using a SQL query (adding a column to the data table, e.g., called CLUSTER), we can build a One-Class SVM model specifying the CLUSTER column as the “partition” column. OML then builds one model per partition and provide a single model to the user.
Model ensemble
Applying the notion that “more heads are better than one,” ensemble techniques, such as the “panel of experts” can help identify those cases that one or more models identify as fraudulent. For example, we may want to cast a wide net, such that if any model identifies potential fraud, we want to flag this as such. However, we may want to limit the set of flagged cases to those where multiple approaches “agree” on potential fraud.
Using the techniques above, we may build one or more each of anomaly detection, classification, or clustering models. Then choose how many we want to agree before flagging potential fraud. This could be as few as two, or as many as all the models.
Feature engineering
So far, we’ve focused on the machine learning modeling side of fraud detection. However, feature engineering can play an important role in improving our ability to detect fraud as well. That’s a topic for another article.
For more information
To learn more about Oracle Machine Learning, see these resources:
- OML Webpage: https://oracle.com/machine-learning
- OML Blog: https://bit.ly/omlblogs
- OML GitHub Repository: https://bit.ly/omlgithub
- OML Office Hours: https://bit.ly/omlofficehours
- Try on Oracle LiveLabs
- Overview: https://bit.ly/omlfundamentalshol
- OML4Py: https://bit.ly/oml4pyhol
- All OML: https://bit.ly/omllivelabs
- OML Documentation: https://docs.oracle.com/en/database/oracle/machine-learning/index.html
