Automated document classification and key-value extraction using OCI Document Understanding and OCI Data Labeling service

January 27, 2023 | 7 minute read
Rekha Mathew
Cloud Solution Architect | A-Team
Text Size 100%:

Introduction

Automating the classification of documents and their further processing using AI helps to reduce manual effort and errors in handling business documents, especially when there are a large number of documents of various categories.

In this blog, we will see how you can train custom AI models for document classification and key-value extraction using the OCI Document Understanding and OCI Data Labeling services. These AI models combined with other OCI native services automate document processing tasks.

OCI Document Understanding comes with pre-trained models for a set of document types which are listed here. If your organization has other types of documents to classify and process or the document structure differs from those recognized by pre-trained models, you can use the architecture in this blog.

Architecture

This reference architecture describes how you can automate classification and key-value extraction of documents using OCI services.

Architecture Diagram

The following diagram illustrates the reference architecture.

Architecture Diagram

 

Components

This architecture has the following components.

  • OCI Document Understanding is an AI service that enables developers to extract text, tables, and other key data from document files through APIs and command-line interface tools.  It has pre-built models and also supports training custom models to suit your specific needs. A labeled dataset is a key prerequisite to training a custom model. You can use  OCI Data Labeling to label documents to create the dataset for a custom model.
  • In addition, this architecture is using OCI Functions, OCI Events, and OCI Object Storage.

Prerequisites

  1. Collect a set of training documents and label them using OCI Data Labeling. Later use these labeled datasets in OCI Document Understanding for creating custom models. These steps are given in the Label Data and Create Custom Models section of the blog.
  2. Create 3 Object Storage buckets,

                    – incoming-documents for storing incoming documents.

                    – classified-documents for storing documents after classification.

               Make sure that Emit Object Events property is set to True for these 2 buckets.

                    – sdk-results to store the model inference results. Both the classification model and key-value extraction model inference results are stored here.       

3. There are 2 OCI Functions used, classify-document and extract-key-values.

4. Create 2 OCI Event rules to trigger Functions when documents are added in the buckets.

Object creation in incoming-documents bucket will invoke the classify-document Function. Rules are shown below.

 

Event Rule1

 

Object creation in classified-documents buckets will invoke the extract-key-values Function. Rules are shown below.

 

Event Rule2

Note - Make sure you have defined the required security policies for all the OCI services.

  Flow

  1. Incoming documents are added to an Object Storage bucket, incoming-documents.
  2. Object create event in incoming-documents  bucket, will trigger an Event, which calls a Function, classify-document.
  3. classify-document  Function will invoke the OCI Document Understanding SDKs for document classification. The custom classification model is used to perform classification. The SDK will return the detected document type as a response. Based on the classified type, documents are moved to different subfolders in the classified-documents bucket. Subfolders are named as document_type1, document_type2.
  4. Creation of documents in the classified-documents bucket invokes the Event to call a Function, extract-key-values.
  5. extract-key-values Function checks the document name in the bucket to get the subfolder name. The subfolder name is used to identify the key-value extraction model id to use. For example, if the subfolder name is document_type1 then the model id for key-value extraction for documents of type1 is used. Model id is passed to OCI Document Understanding SDKs, to extract key values from the document. Once key values are extracted, you will get a response JSON. You can use the values from this JSON to call the target application APIs for further processing of the documents, say use these extracted values to automatically attach a document to an existing record in a target application or automatically create records in a target application.

Assumptions

  I have used two types of documents in this solution, document_type1, and document_type2. You can easily extend the solution by adding any number of document types, based on your business needs.

Label Data and Create Custom Models

Custom models need a labeled data set for training them.

Label Data in OCI Data Labeling

OCI Data Labeling can be used to label your training documents. Data labeling is the process of identifying properties (labels) of documents and annotating (labeling) them with those properties.

The first step is to collect a sample set of different types of documents you want to automate. Once you have collected a training document set, you can create a dataset and label them.

Let’s see how a dataset is created for the classification model.

  • Go to OCI Data Labeling service in the OCI console, choose Create dataset, and select Dataset format as Documents and the Annotation class as Single Label.

 

Dataset

 

 

  • Upload all your training documents and enter labels in Add labels section. These labels will be used for later annotating the documents in the dataset. Review and create the dataset.

 

Add Label

 

  • Once the dataset is created, you can label documents. Navigate to each document in the data record and assign them the correct label.

 

For key-value extraction models, you should create separate datasets for each document type.

  • In this case, choose Key Value in the Annotation class and add labels for each type of keys that you want to extract from the document type.

Dataset2

 

  • While labeling the key value dataset, navigate to each document, mark the areas you want to extract, and assign labels to them.

 

Labeling

 

Create and Train Models in OCI Document Understanding

 

  1. Go to the OCI Document Understanding service and Create a project under the Custom models section.

 

Project

 

  1. Go to the created project and select Create and train model option. You need to create one model for document classification and separate models for key-value extraction for each document type.

 

In the case of Document Classification, Choose the model type to Train as Document classification.

 

Classification Model

 

  1. While creating the model, choose the corresponding datasets created using the OCI Data Labeling service.

 

Choose Dataset

 

  1. Train the model and check the metrics and if you are satisfied with the metrics of the model, you can use them for automation.
  2. Repeat the above steps for key-value models, Choose the model type to Train as Key value extraction.
  3. Note the model OCIDs, we will use them as configuration parameter values in Function Application.

OCI Function code samples

      Refer this blog for Function code.

Conclusion

   I hope this blog has given you an idea of how OCI AI and native services can eliminate slow, error-prone manual document processing by your employees to save time and improve data accuracy. Please feel free to contact me at rekha.mathew@oracle.com, if you have an Oracle Fusion SaaS business use case that can be solved using this reference architecture.

 

 

Rekha Mathew

Cloud Solution Architect | A-Team


Previous Post

Connect VyOS to Oracle Cloud

Jake Bloom | 7 min read

Next Post


Using the OCI bastion service with putty

Vinay Kalra | 6 min read