Deidentifying free-text patient notes: no need for tradeoffs

Dandelion Health is a provider of multimodal, longitudinal clinical data for healthcare innovators. This session shows how it built a de-identification process for free-text clinical notes, with John Snow Labs’ Healthcare NLP & LLM at its core. This process maintains patient privacy, minimizes risks for hospital systems, and preserves the bulk of free-text notes to provide researchers with high fidelity clinical data.

Dandelion Health partners with hospital systems, deidentifies their clinical data in their environment, and then copies this data to the Dandelion data lake so that customers can perform research and validation within the secure Dandelion platform. To ensure HIPAA compliance, deidentification requires an expert determination to confirm that minimal protected health information (PHI) remains after the process.

Tabular data is straight-forward to handle by removing or masking data fields with PHI related values – such as patient names, birth dates, addresses, or contact details. Free text patient notes are much more difficult to automatically deidentify, as this requires PHI words and phrases to be redacted or masked, after which the whole of the patient note must be verified.

Key topics of the presentation include:

1. Breaking down different note types (e.g. radiology reports, pathology reports, echo narratives, progress notes) according to level of risk, and adapting the de-id process accordingly.

2. Assessing note subtypes (e.g. radiology reports for DEXA scans, or fetal radiology reports) in order to carve out exceptions to our standard process (e.g. unique note structure, or age formats such as “27w” that need to be redacted).

3. Determining the importance of recall, precision, and PHI frequency for quasi-identifiers.

4. Applying pre-processing or enhancements such as HIPS (hiding in plain sight) to reduce risk based on the recall, precision, and frequency of PHI in free-text notes. This presentation features real-world case-studies and examples, demonstrating the power of: validating clinician data-quality hypotheses with language models, using different NLP & LLM strategies for different datasets, and letting QA/QC statistics tell the story – so we know that we’re doing right by the patient.

About the speaker
Amy-Heineike

Ross Bierbryer

Head of Engineering at Dandelion Health

NLP-Summit

When

Online Event: September 25, 2024

 

Contact

nlpsummit@johnsnowlabs.com

Presented by

jhonsnow_logo