How to easily extract text data for machine learning

August 31, 2021 | 6 minute read
Jize Zhang
Software Developer
Text Size 100%:

Typewriter closeup

 

Whether you’re mining text or building natural language processing (NLP) models, you’re working with text data. In many enterprise use cases, however, raw text data is in PDFs or Microsoft Word files that machine learning models can’t use.  What to do?

The Accelerated Data Science SDK (ADS) available in OCI Data Science comes with a text dataset module for converting from raw text data into a ML-friendly format. Import and export are streamlined, thanks to support for local and remote storage for both source and destination locations. Let’s take a look at the ADS text dataset module and how you can use it.

PDF, DOC, or TXT

The ADS text dataset module lets you load data in .pdf, .docx, .doc or plain text format. Non-text files are converted to text files during loading. You can save the source files locally or remotely in OCI object storage.  Behind the scenes, ADS uses Apache Tika (by default) or pdfplumber (pdf only) to convert files. You can have the loaded data returned as a data generator, a pandas/cuDF dataframe, or save it as plain text to the local or remote destination of your choice.

So, how can you use this module?

  • Loading data from OCI Object Storage to notebook session.
  • Reading folders of files and extract information (e.g. topics) from file paths for machine learning tasks.
  • Converting pdf or docx files to plain text and persisting them directly to OCI object storage.

Now, let’s examine the core functions in the ADS text dataset module and learn how to use them in these popular scenarios.

Text Dataset API

The ADS text dataset provides four functions for text data processing:

  • read_line: Read files line by line, where each line corresponds to a record in the collection.
  • read_text: Read files where each file corresponds to a record in the collection.
  • convert_to_text: Convert and then save files into plain text files.
  • metadata_all and metadata_schema: Extract metadata from each file, when applicable (this one is more involved; for details, please check the documentation).

Here, a "collection" refers to either a data generator or a dataframe, and a "record" refers to an item in a generator or a row in a dataframe.

The four functions are called via a DataLoader object. A DataLoader object can be created by minimally specifying the source files format. For instance,

from ads.text_dataset.dataset import TextDatasetFactory as textfactory

dl = textfactory.format('pdf')

 

creates a dataloader object that can parse pdf files.

Loading data from OCI object storage to notebook session as a dataframe

To load data into a dataframe, we first create a DataLoader as the following:

dl = textfactory.format('pdf').engine('pandas')

 

Here, the "engine" option specifies what engine should be used to materialize loaded data. Currently "pandas" and "cudf" are supported. By setting engine to be "pandas", we instruct the dataloader to return loaded data as a pandas dataframe. If left unspecified, loaded data is returned as a data generator and it is up to the user to decide when and how to materialize it.

Now that a dataLoader is created, we can simply pass the source files uri to the function "read_text".

df = dl.read_text(

    f'oci://{bucket}@{namespace}/pdf_samples/*.pdf',

    storage_options={"config": oci.config.from_file()},

)

 

Here, the source data path is specified using glob pattern, and each file that matches the pattern will correspond to one row in the returned dataframe. Note that the source files location can also be a list of paths.

To load data in different formats, we can simply change the "format" option when initializing the dataloader object. Also, if instead we would like each line in a file to correspond to a row in the dataframe, we can call "read_line" instead of "read_text".

In case source data path points to cloud storage, you should usually specify "storage_options". For OCI Object storage, "config" is required, as in the example code snippet. "config" stores authentication information. If using resource principal in a notebook session, then we can simply set "storage_options = {"config": {}}". Otherwise we need to load configuration from a configuration file using "oci.config.from_file()". For more information on arguments and parameters one can pass in to this function, please check documentation here.

Reading folders of files and extract information from paths

Sometimes data is grouped into folders by topic. For instance, resumes may be grouped by hiring positions; reviews may be grouped by sentiment; news may be grouped by fields. As an example, the newsgroup20 has folder structure that looks like this:

20news/

    alt.atheism/

    comp.graphics/

    ...

    talk.politics.misc/

    talk.religion.misc/

 

The folder names here can be used as labels in a classification task. To create a dataframe with labels, we can do:

dl = textfactory.format('txt').backend('tika').engine('pandas').option(Options.FILE_NAME)

df_clf = dl.read_text(

    f'oci://{bucket}@{namespace}/20news-small/**/[1-9]*',

    storage_options={"config": {}},

    df_args={'columns': ['path', 'text']},

)

df_clf['label'] = df_clf['path'].apply(lambda x: x.split('/')[-2])

 

Notice that in this code snippet we also specified "backend" to be "tika". By default, plain text file does not need conversion, but because the data is not utf-8 encoded in this case, we use tika to handle that.

Additionally we set an option "Option.FILE_NAME" on the dataloader object. This instructs the dataloader to grab path to files in addition to file contents when reading in files.

Converting pdf or docx to plain text and persisting to OCI object storage

Sometimes you may want to persist converted data rather than just using it once in a notebook session. For instance, the same set of data may be used for multiple tasks or by multiple people in a team. To streamline this process, the Dataloader class has a "convert_to_text" function. This functions takes in a source and a destination location, convert source files to plain text and write them directly to a destination location. As an example,

dl = textfactory.format('pdf')

dl.convert_to_text(

    f'oci://{bucket}@{namespace}/pdf_samples/*.pdf',

    f'oci://{dst_bucket}@{dst_namespace}/extracted/pdfs',

    storage_options={"config": oci.config.from_file()},

)

or for docx/doc files,

dl = textfactory.format('docx')

dl.convert_to_text(

    [f'oci://{bucket}@{namespace}/docx_samples/*.docx', f'oci://{bucket}@{namespace}/doc_samples/*.doc'],

    f'oci://{dst_bucket}@{dst_namespace}/extracted/docs',

    storage_options={"config": oci.config.from_file()},

)

 

Don’t worry about the format

To summarize, the ADS text dataset module aims to facilitate the transition from enterprise data commonly available as PDFs or Microsoft Words to plain text that is easier to manipulate for data scientists. For details on the API and more examples, please refer to the documentation.

You can try out the Accelerated Data Science SDK in OCI Data Science by signing up for the Oracle Cloud Free Tier and following these simple steps to configure your OCI account to use Data Science.  To learn more about OCI Data Science,  go to  

 

Jize Zhang

Software Developer

Jize is a Seattle-based software engineer at Oracle with a doctorate in applied mathematics from the University of Washington.


Previous Post

How to create a new conda environment in OCI Data Science

JR Gauthier | 7 min read

Next Post


The Model Agnostic Confidence Estimator helps over-confident models

Matthew Rowe | 5 min read