Automated de-identification of medical documents & images

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging.

These files are challenging to de-identify because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text, to begin with.

About the speakers
Mykola-Melnyk
Mykola Melnyk
Senior Data Scientists at John Snow Labs

AMykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience.

He has led teams and projects building machine learning and big data solutions in a variety of industries - and is currently the lead developer of the Spark OCR library at John Snow Labs.

Alina-Petukhova
Alina Petukhova
Senior Data Scientists at John Snow Labs

Alina is a senior data scientist at John Snow Labs, where she is working as a part of the Core team. Her recent work was focus on creating models and pipelines to process healthcare documents for the purposes of Named-entity recognition and de-identification. John Snow Labs is a healthcare AI company, accelerating progress in data science by providing state-of-the-art models, data, and platforms.

Before joining John Snow Labs she was working for more than 5 years as a consultant in the area of NLP, Image recognition, and Predictive modeling and completed her Ph.D. studies in Apply Mathematics. In her free time, she enjoys hiking and watching indie movies.

Presented by