The Fusion Adaptive Intelligence team (DataFox) continuously extracts detailed data on more than 2.8 million public and private businesses while adding 1.2 million businesses annually.
Our data is used by customers to prioritize accounts, enrich leads, refresh and harmonize CRM data, and identify new prospects. We enhance Oracle Cloud Applications with an extensive set of trusted company-level data and signals, enabling customers to reach even better decisions and business outcomes.
The DataFox Data Cleaning team is responsible for augmenting, enriching, and cleaning company data with 40+ data points that allow our customers to get a truly unique, more complete view of their addressable market.
The major workflow followed by the DataFox Data Cleaning team during FY20 is a workflow aimed at data integration of two large datasets by matching key firmographic data points: company name, company URL, and company location.
The first dataset (input data) is unchangeable, with static data points combinations (so-called input rows). Input data is added into the tool called Otocyon where:
Not automatched rows is an object of manual processing where the team is headed to creating new or fixing existing admin profiles for making them integrated with the input data.
While integrating, the team takes specific action on each input row - admin profile pair following complex instructions. Three main types of actions taken include:
The goal of building predictive models for data integration is to forecast which type of action will be taken on the input row without processing the rows manually.
For building a predictive model, the historical data was used. The data consisted of already processed (action taken) input rows. Gathered data covered a period of 1 month and included 28,726 records overall.
An established hypothesis said that action taken on the input row could be predicted by the combination of the useful variables:
The next categories of variables were excluded from the data:
Useful variables (those are strings) were encoded, and new flag variables with values {1;0} were created. Types of predicted actions were encoded as match (0), bad input (1), or skip (2).
The dataset for modeling included 32,776 rows, and the dataset for test included 8,194 rows.
The Gradient Boosting Model was trained and evaluated:
gb = GradientBoostingClassifier(n_estimators=500, learning_rate = 0.7, random_state =0)
gb.fit(X_train, y_train)
predictions = gb.predict(X_test)
The next results were achieved:
Confusion Matrix:
[[4397 10 0]
[ 22 1011 1]
[ 15 23 2715]]
Classification Report
precision recall f1-score support
0 0.99 1.00 0.99 4407
1 0.97 0.98 0.97 1034
2 1.00 0.99 0.99 2753
accuracy 0.99 8194
macro avg 0.99 0.99 0.99 8194
weighted avg 0.99 0.99 0.99 8194
Accuracy score (training): 0.990
Accuracy score (validation): 0.991
The accuracy level for 0 (bad input) and 2 (skip) types of actions allows us to use this model for automating action types “Human Action – Bad Input” and “Human Action – Matched.”
Automating action on thousands of rows equals reducing manual work by 120 hours, or 3 annotators working full-time on the project during one week.
Rows classified by the tool as Human Action – Matched are to be prioritized for manual processing as of the highest predicted matching yield.
As recently created, the model requires large sets of historical data to improve variables combination for making better predictions on new input files. In other words, before applying the model for a new dataset, it should get manually processed rows from the dataset included to the dataset for modelling – that will ensure more accurate prediction.
To learn more about machine learning and AI, check out the Oracle AI page. You can also try Oracle Cloud for free!
By Tetiana Ierokhina, Manager for Data Research at Fusion Adaptive Intelligence (DataFox)
tetiana.ierokhina@oracle.com