Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Predictive Modeling for Data Integration

The Fusion Adaptive Intelligence team (DataFox) continuously extracts detailed data on more than 2.8 million public and private businesses while adding 1.2 million businesses annually.

Our data is used by customers to prioritize accounts, enrich leads, refresh and harmonize CRM data, and identify new prospects. We enhance Oracle Cloud Applications with an extensive set of trusted company-level data and signals, enabling customers to reach even better decisions and business outcomes.

The DataFox Data Cleaning team is responsible for augmenting, enriching, and cleaning company data with 40+ data points that allow our customers to get a truly unique, more complete view of their addressable market. 


The Goal of Predictive Modeling

The major workflow followed by the DataFox Data Cleaning team during FY20 is a workflow aimed at data integration of two large datasets by matching key firmographic data points: company name, company URL, and company location.


Firmographic Data
The first dataset (input data) is unchangeable, with static data points combinations (so-called input rows). Input data is added into the tool called Otocyon where:

  • Input data gets automatically matched to the existing pieces of data in the database (admin profiles)
  • If automatching is impossible – input data gets bucketed based on labels added by Otocyon to identify problems preventing automatching 

Not automatched rows is an object of manual processing where the team is headed to creating new or fixing existing admin profiles for making them integrated with the input data. 

While integrating, the team takes specific action on each input row - admin profile pair following complex instructions. Three main types of actions taken include:

  • Match - when there is an existing or newly created admin profile integrated with the input row
  • Mark as bad input - when one or multiple datapoints in the input row disables integration process as the input data is invalid
  • Skip - when input row requires reprocessing

The goal of building predictive models for data integration is to forecast which type of action will be taken on the input row without processing the rows manually. 


Building A Predictive Model

For building a predictive model, the historical data was used. The data consisted of already processed (action taken) input rows. Gathered data covered a period of 1 month and included 28,726 records overall.

An established hypothesis said that action taken on the input row could be predicted by the combination of the useful variables: 

  • Labels added by the tool when input data is put into Otocyon: row_flags, detail, status
  • Input data characteristics: special symbols on input name, special symbols on input url
  • Levenshtein Distance between input name/url and Top Potential Match Name/URL (Top Potential Match means match suggested by the tool).

The next categories of variables were excluded from the data:

  • Unknown variables (variables known only after manual action is taken)
  • Useless variables (variables with no effect on the type of action)

Useful variables (those are strings) were encoded, and new flag variables with values {1;0} were created. Types of predicted actions were encoded as match (0), bad input (1), or skip (2).

The dataset for modeling included 32,776 rows, and the dataset for test included 8,194 rows.

The Gradient Boosting Model was trained and evaluated:

gb = GradientBoostingClassifier(n_estimators=500, learning_rate = 0.7, random_state =0)
gb.fit(X_train, y_train)
predictions = gb.predict(X_test)

The next results were achieved:

Confusion Matrix:
[[4397   10    0]
 [  22 1011    1]
 [  15   23 2715]]

Classification Report
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      4407
           1       0.97      0.98      0.97      1034
           2       1.00      0.99      0.99      2753

    accuracy                           0.99      8194
   macro avg       0.99      0.99      0.99      8194
weighted avg       0.99      0.99      0.99      8194

Accuracy score (training): 0.990
Accuracy score (validation): 0.991



The accuracy level for 0 (bad input) and 2 (skip) types of actions allows us to use this model for automating action types “Human Action – Bad Input” and “Human Action – Matched.”

Automating action on thousands of rows equals reducing manual work by 120 hours, or 3 annotators working full-time on the project during one week. 

Rows classified by the tool as Human Action – Matched are to be prioritized for manual processing as of the highest predicted matching yield.


Applying Machine Learning 

As recently created, the model requires large sets of historical data to improve variables combination for making better predictions on new input files. In other words, before applying the model for a new dataset, it should get manually processed rows from the dataset included to the dataset for modelling – that will ensure more accurate prediction.

To learn more about machine learning and AI, check out the Oracle AI page. You can also try Oracle Cloud for free!


By Tetiana Ierokhina, Manager for Data Research at Fusion Adaptive Intelligence (DataFox)

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.