Automation of Laboratory Reports is Essential to Personalized Medicine

Anwaruddin Mohammad, and Pankaj Tripathi

Personalized medicine is rapidly becoming adopted as a key component of standard healthcare plans in many countries. The advancement of science and the availability of data through various sources mandate acceleration in the ability to easily support data processing, automation and visualization. One area that is critical to the success of completing all of the processes needed to seamlessly integrate personalized medicine into everyday use is the ability to process laboratory reports (both genomic and clinical).

Everyday thousands of laboratory reports are generated for patients coming from internal labs, as well as, externally from service providers. In most cases these reports are in PDF format. The ability to extract the discrete information from these reports and use it upload to a EMR, data warehouse or use for direct analysis is a challenge as the industry lacks standardization, and few reports comply with the standards of HL7 or FHIR. This makes it harder for healthcare data analysts to automate the processing of these reports. Automated processing of laboratory reports has the potential of yielding significant amounts of useful data, which on its own or in combination with other data can be used to obtain meaningful insights into an individuals’ health and discovery of hidden or potential issues that might otherwise remain unnoticed.

For example:

  • Review of a laboratory report with normal Aspartate Aminotransferase (AST) when combined with historical results extracted from other laboratory reports, could show a progressive increase in the AST level within the normal range which may indicate worsening liver function.
  • Similarly, progressively increasing laboratory result values of Fasting Blood Sugar (FBS), which are still within normal range, if combined with body weight and social determinants data could help in timely identification of potential diabetes or predisposition towards it, as well as possible etiological factors for this condition, thereby assisting in timely intervention.
  • Laboratory reports with borderline high Blood Urea Nitrogen (BUN) may be interpreted as incidental and normal but if extracted for a large number of individuals and combined with their medication and genomic data, then such borderline elevated BUN levels could possibly be correlated with certain medications aiding in timely identification of possible side effects in individuals with specific genetic profiles.

The above exemplified uses of laboratory report data are feasible when discrete and meaningful data can be automatically extracted from a large number of laboratory reports that could be originating from diverse sources over a period of time.

Extracting Discrete Data Using Nature Language Processing

Using Nature Language Processing (NLP) using a simple-rule based approach without use of complex machine learning and AI based algorithms discrete data from laboratory reports can be extracted. The following provides a workflow example:

Extract Text from PDF report

The first step would be to extract the text format of the report, which needs to be further processed to extract entities. This can be done using OCR technology and image processing. Below is a part of the report with PHI data masked.

Entity Identification

Manually identify the entities that you want to extract from the report. For example, you would want to extract the identifiers like patient ID, specimen ID, dates, test report name, test results, reference range of the test and other details.

Rules for each Entity

Create a rule for each entity you want to extract using a pattern to uniquely identify each of the entities. For example, “Specimen ID: ST0000001”, you can easily set a regular expression-based rule with starting characters and next 7 digits to uniquely identify the specimen identifier.

Testing and validation

Once you have created all the rules for every possible entity required, you need to test the rule on a larger dataset of reports and quantify the performance. If required, you would need to update the algorithm based on the accuracy and other metrics.


Once you are confident of the accuracy of the algorithm you can run it on any new report and extract discrete data in any format you want like JSON or even FHIR and enable ETLs to load the data to the centralized datamarts.

Following are the entities extracted from the report in a simple named entity dataframe.

Providing a more efficient yet simple approach guarantees 100% accuracy, because you have control over the logic and can easily identify the entities that need additional efforts to create a rule. That said, it is always good practice to continue running the test on any new batch of new reports coming in and when there is any change in the way new reports are created or changes in format adjust accordingly. This is a very important step because you do not want to misread or omit new entities.


Personalized medicine is changing the decision-making process of clinicians, affording individual patients with improved quality and care offerings. Clinicians are now relying more of the use of cognitive tools as accuracy in predictions has improved significantly, allowing timely interventions and treatment options. Essential to this effort is the automation of unstructured or semi-structured data from laboratory reports, aiding timely decision-making efforts and further accelerating the use of personalized medicine.

Oracle’s Healthcare Translational Research Notebook includes the Parallel Graph Analytix (PGX) data studio providing users with the ability to implement these kind of use cases with their existing database, enabling a unified workflow without having to manage multiple applications.

Learn more about the Oracle Healthcare Foundation, a unified healthcare analytics platform.

Contact us for a conversation.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.