kanverse blog banner image

Handling Personally Identifiable Information (PII) in Documents

December 3, 2021

In today's data and digital era, organizations need more robust controls to safeguard sensitive data to tackle the risk of exposing personal data to the public. According to Gartner, over 60% of the world's population will have their data covered under some form of data protection legislation by 2023. Data privacy regulations are rising from GDPR to California Consumer Privacy Act. It will only be imperative to have stricter rules on how organizations store, view and share personal data in the next few years.

What Is PII?

Any information that can be used to identify an individual by either direct or indirect means is termed as personally identifiable information (PII) (e.g., name, address, social security number, biometric records, telephone number, etc.) While Sensitive PII consists of information that directly identifies a person, non-sensitive PII refers to data that is combined with other sources to identify someone. ( e.g., information like date of birth, gender, race, or zip code). From medical information in EHR systems, financial data held by financial services organizations, to data used by insurance underwriters, personal data is critical for delivering quality services across industries.

PII Regulations

There is a network of regulations all over the world that aim to enforce PII compliance. Some of the widely accepted frameworks for protecting PII include:

  • GDPR: General Data Protection Regulation is an EU statute ensuring data privacy both within the EU and for data transferred outside the EU and avoiding unauthorized personal information exposure.
  • CCPA: California Consumer Privacy Act is a benchmark policy that gives additional data privacy rights and protections to clients protected by California law
  • PCI DSS: This is a set of standards and regulations to protect cardholders from financial fraud.
  • HIPAA: This act was passed in 1996 to protect sensitive information relating to healthcare. It also regulates the digital transfer of such data, ensuring client safety.

These are just some of the PII compliance standards worldwide, and more laws come into effect every year. Protecting PII is a top priority for governments and consumers.

PII in Document Processing

Consumers frequently enter their personal details into forms and documents, from loan applications, insurance claims, tax returns, etc.; The information is then processed and consumed by different software applications, databases, and workflows. PII appears on various files, from the original document filled by the user to new content derived within an application like patient records, insurance policies, and contracts. Therefore, additional measures to ensure secured and authorized access to this data across its lifecycle are required.

Organizations need tools in their software landscape to accurately identify and classify sensitive information and efficiently redact specific details when needed. The challenge is implementing a solution that can detect PII data and any associated risk of compliance, identify where and how the PII content is stored, and determine the policy around the usage of the content. Using word processing solutions to obscure sensitive text and then convert the document into a flattened PDF is a frequently used approach if you are printing the document. Electronic documents contain metadata information that can render text even if that text is obscure.

A more effective way involves using an optical character recognition (OCR) solution to extract the visible text and using manual or automatic detectors to classify strings, files, and images that contain PII and a wide variety of sensitive data. Define business rules and controls to restrict access to this data and redact or remove the sensitive data. Some of the popular data masking and PII redaction approaches utilized by data platforms are encryption, shuffling, nulling, scrambling, and hashing.

You can also create role-based views to control access to restricted content while providing others with a redacted view. The extracted content from documents is encrypted in transit and at rest. All the data is removed from the platform when the extracted content moves to the relevant enterprise systems. Kanverse automates the redaction of multiple documents in a single process and removes the sensitive information and related metadata without touching the source file.

Contact for more details.

About the Author

Kingshuk Ghosh, Product Manager,

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.