As we progress further into the digital era, ensuring the protection of sensitive data isn’t just a priority. It’s a business imperative. Financial services, education, retail, healthcare, and governmental entities are among the myriad industries grappling with the challenges of data privacy and security. Failure to protect personal identifiable information (PII) can lead to identity theft, financial loss, and reputational damage of individuals and businesses alike, highlighting the importance of taking appropriate measures to safeguard sensitive information. To navigate this complex landscape, the Accelerated Data Science (ADS) PII operator emerges as a robust tool designed to identify and secure protected health information (PHI) and personally identifiable information (PII).

What is PII?

PII refers to any information that can identify an individual, encompassing financial, medical, educational, and employment records, such as name, social security number, contact information, and driver license number. In addition to general data privacy regulations, some industries have their own specific regulations for protecting sensitive information. For example, in the US the Health Insurance Portability and Accountability Act of 1996 (HIPAA) regulates how the healthcare and life sciences organizations collect, store, and protect medical records and PHI of patients. These industry-specific regulations help ensure that organizations in these sectors protect sensitive information and maintain the trust of their customers and clients.

The gap in enterprise protecting PII

As enterprises across various industries continue to digitize and handle ever-increasing volumes of personal data, the risk of data breaches and noncompliance with privacy regulations also escalates. Despite advancements in cybersecurity, a significant gap remains in effectively protecting PII at the scale and complexity demanded by large organizations.

Abstract text and code intersecting three-dimensionally on a dark blue screen.

ADS PII operator: The vanguard of data protection

Crafted by the synergy of Oracle Cloud Infrastructure (OCI) AI expertise and a keen understanding of data privacy demands, OCI Data Science’s ADS PII operator isn’t merely a tool. It’s a paradigm shift that’s changing how data is protected:

  • Automated detection and classification: Using pattern matching and AI-powered solution, the ADS PII operator efficiently identifies sensitive data on free form texts.

  • Intelligent coreference resolution: A standout feature of the ADS PII operator is its ability to maintain coreference entity relationships even after anonymization. This feature not only anonymizes the data but also preserves the statistical properties of the data.

  • Flexible Integration: The operator seamlessly integrates with your existing systems, providing a layer of protection without disrupting workflows.

  • Simplified task definition and execution: The integration of YAML configurations means that the ADS PII operator is equipped with preset profiles that embody domain-specific knowledge, ready to be utilized out of the box. Yet, it retains flexibility through customizable YAML files, empowering users to tailor the operator to their precise specifications. The accompanying CLI tool is the epitome of user-friendly design, simplifying the deployment and execution process to a single line command. It streamlines the operator’s use across varied environments, enabling you to leverage its capabilities with minimal setup hassle.

  • Tailored to every business: Whether you’re safeguarding a small clinic’s patient records or a multinational’s client data, the ADS PII operator molds to your context. With customization at its core, it aligns with your data protection goals, however granular they are.

  • Enterprise resilience: Designed for organizations demanding the highest standard of data security, the ADS PII operator is built to scale, to withstand, and to excel, backed by Oracle’s robust Data Science and AI platform. For example, enabling batch processing within OCI Data Science jobs allows for the processing of large volumes of data in an efficient and timely manner, ensuring that projects of any size can be handled with ease.

Empowering control with insightful analytics

Beyond mere defense, the ADS PII operator empowers enterprises with deep insights. Each deployment yields comprehensive analytics, detailing protection levels, and potential vulnerabilities, translating complex data security metrics into actionable intelligence.

ads-pii-workflow

 

Start your journey with ADS PII operator

We have the following medical records to process:

documentation_id patient_visit_date content
00001cee341fdb12 01/01/2012 “Hi, this is John Smith. My number is (805) 555-1234.”
00097b6214686db5 01/31/2012 “John recently got a beautiful puppy. He may be allergic to dog hair. “

These two records belong to John Smith. We redact and anonymize the PII—name and phone number—in both records.

Requirements

To use the ADS PII operator, install Accelerated Data Science with the following command. ADS supports this feature after version 2.9.0.

pip install oracle-ads>=2.9.0 -U

Configure

Set up “ads opctl” on your machine by running the following command. For more details, see Getting started with ADS operators.

ads opctl configure

Now, you’re ready to generate starter YAML configs for the operators.

ads operator init -t pii --output ~/pii/

Edit the generated pii.yaml file based on your needs.

 
 
kind: operator
type: pii
version: v1
spec:
  output_directory:
    url: oci://my-bucket@my-tenancy/results
    name: mydata-out.csv
  report:
    report_file_name: report.html
    show_sensitive_content: true
  input_data:
    url: oci://my-bucket@my-tenancy/mydata.csv
  target_column: content
  detectors:
     - name: spacy.en_core_web_trf.person
       action: anonymize
     - name: default.phone
       action: anonymize

Run

After you have your pii.yaml written, you can run the PII:

ads operator run -f ~/pii/pii.yaml

Interpret the results

The ADS PII operator produces two output files, mydata-out.csv and report.html, under the given output_directory.

  • mydata-out.csv: This file contains the processed dataset.

    documentation_id patient_visit_date content
    00001cee341fdb12 01/01/2012 “Hi, this is David Doe. my number is (123) 456-7890.”
    00097b6214686db5 01/31/2012 “David recently got a beautiful puppy. He may be allergic to dog hair. “
  • report.html: The report.html file is customized based on report parameters in the configuration yaml. It contains a summary of statistics, a plot of entities distributions, details of the resolved entites, and details about any model used. By default, sensitive information isn’t shown in the report, but for debugging purposes, you can disable this with show_sensitive_content. It also includes a copy of YAML file, providing a fully detailed version of the original specification.

sample-report

What’s next?

For more information and examples, check out the ADS documentation and download Oracle ADS from PyPI.

Stay tuned for our next blog post where we delve into the nuts and bolts of customizing the use of ADS PII operator. Witness firsthand the might of this tool as it stands guard over the sanctity of your data. With Oracle’s cutting-edge technology, discover peace of mind in an otherwise tumultuous realm of data threats.

Protecting PII isn’t just an operational task—it’s a mission. The ADS PII operator is your ally in this mission. Join us as we step into a new epoch of data security with confidence and assurance.

Try Oracle Cloud Free Trial! A 30-day trial with US$300 in free credits gives you access to Oracle Cloud Infrastructure Data Science service. For more information, see the following resources:

Got questions? Reach out to us at ask-oci-data-science_grp@oracle.com.