Personalized medicine is rapidly becoming adopted as a key component of standard healthcare plans in many countries. The advancement of science and the availability of data through various sources mandate acceleration in the ability to easily support data processing, automation and visualization. One area that is critical to the success of completing all of the processes needed to seamlessly integrate personalized medicine into everyday use is the ability to process laboratory reports (both genomic and clinical).
Everyday thousands of laboratory reports are generated for patients coming from internal labs, as well as, externally from service providers. In most cases these reports are in PDF format. The ability to extract the discrete information from these reports and use it upload to a EMR, data warehouse or use for direct analysis is a challenge as the industry lacks standardization, and few reports comply with the standards of HL7 or FHIR. This makes it harder for healthcare data analysts to automate the processing of these reports. Automated processing of laboratory reports has the potential of yielding significant amounts of useful data, which on its own or in combination with other data can be used to obtain meaningful insights into an individuals’ health and discovery of hidden or potential issues that might otherwise remain unnoticed.
The above exemplified uses of laboratory report data are feasible when discrete and meaningful data can be automatically extracted from a large number of laboratory reports that could be originating from diverse sources over a period of time.
Using Nature Language Processing (NLP) using a simple-rule based approach without use of complex machine learning and AI based algorithms discrete data from laboratory reports can be extracted. The following provides a workflow example:
The first step would be to extract the text format of the report, which needs to be further processed to extract entities. This can be done using OCR technology and image processing. Below is a part of the report with PHI data masked.
Manually identify the entities that you want to extract from the report. For example, you would want to extract the identifiers like patient ID, specimen ID, dates, test report name, test results, reference range of the test and other details.
Create a rule for each entity you want to extract using a pattern to uniquely identify each of the entities. For example, “Specimen ID: ST0000001”, you can easily set a regular expression-based rule with starting characters and next 7 digits to uniquely identify the specimen identifier.
Once you have created all the rules for every possible entity required, you need to test the rule on a larger dataset of reports and quantify the performance. If required, you would need to update the algorithm based on the accuracy and other metrics.
Once you are confident of the accuracy of the algorithm you can run it on any new report and extract discrete data in any format you want like JSON or even FHIR and enable ETLs to load the data to the centralized datamarts.
Following are the entities extracted from the report in a simple named entity dataframe.
Providing a more efficient yet simple approach guarantees 100% accuracy, because you have control over the logic and can easily identify the entities that need additional efforts to create a rule. That said, it is always good practice to continue running the test on any new batch of new reports coming in and when there is any change in the way new reports are created or changes in format adjust accordingly. This is a very important step because you do not want to misread or omit new entities.
Personalized medicine is changing the decision-making process of clinicians, affording individual patients with improved quality and care offerings. Clinicians are now relying more of the use of cognitive tools as accuracy in predictions has improved significantly, allowing timely interventions and treatment options. Essential to this effort is the automation of unstructured or semi-structured data from laboratory reports, aiding timely decision-making efforts and further accelerating the use of personalized medicine.
Oracle’s Healthcare Translational Research Notebook includes the Parallel Graph Analytix (PGX) data studio providing users with the ability to implement these kind of use cases with their existing database, enabling a unified workflow without having to manage multiple applications.
Learn more about the Oracle Healthcare Foundation, a unified healthcare analytics platform.
Contact us for a conversation.