AI-Enhanced Oncology Data: Unlocking Insights from EHRs with NLP and LLMs
Here we describe the development, validation, and scaling of automated Natural Language Processing (NLP) and Large Language Model (LLM) pipelines and go on to show real‑world applications of our ‘AI‑enhanced’ data model, including matching patients to clinical trial criteria with high accuracy and defining therapy‑related adverse events.
Using John Snow Labs’ NLP frameworks, we developed a comprehensive pipeline that performed optical character recognition, entity extraction, assertion detection, and relationship mapping from physician notes, radiology, pathology, and genetic reports focusing on temporal, anatomical, oncological, pathological, and treatment‑related entities and connections. Our dataset contained over 1.4 million physician notes and approximately 1 million PDF reports and scans. The technical implementation leveraged both classical LSTM‑based models and transformer‑based approaches (BioBERT).
Following deployment on a highly scalable Spark pool environment, we identified 113.6 million entities (averaging 80 entities per note) and 29.2 million relationships across 25 distinct relationship types (averaging 22 relationships per note). Comparison with a large manually reviewed truth set achieved high accuracy, with an entity extraction combined F1 score of ~93% and a relationship extraction F1 score of 88%.
By integrating NLP-derived elements into our standard data model, we can demonstrate an enhanced ability to match oncology patients to complex clinical trial enrollment criteria using a custom web application. Our matching performance was especially enhanced when cancer staging and biomarker findings were important inclusion criteria — as these are often missing from structured EMR data but can be recovered using NLP.
Although NLP greatly improves our data model, in certain contexts it performed poorly. We observed that the pretrained BioBERT model identified blood clots (thrombosis) with a precision < 0.5 — likely because of the many ways in which thrombosis can be described in notes and the lack of a causal drug relationship. As such, we developed a more sophisticated approach. Starting with all NLP‑predicted adverse events and flanking text, we used a locally cached base large language model, llama3 (LLM), and prompt engineering to classify text excerpts. The NLP/LLM hybrid approach performed significantly better (precision = 0.87) and allowed us to efficiently parse > 18k clinic notes in two hours.
These AI‑enhanced pipelines demonstrate the power of combining NLP and LLMs to extract critical information from EHRs, thereby enabling more efficient and accurate oncology research and clinical trial matching.

Scott Newman
Dr. Scott Newman is Senior Vice President of Life Sciences at Meaningful Insights Biotech Analytics (MiBA). MiBA is a healthcare AI technology company on a mission to close the feedback loop between physicians, patients, and industry partners. The technology and data insights that MiBA generates will support pharmaceutical companies, research organizations, payers, biotech, health systems and providers to advance the quality of patient care. Scott leads the team responsible for developing and implementing data models, ensuring data accuracy, conducting real-world evidence studies, and providing crucial support to MiBA’s partner oncology networks. With over fifteen years of experience in translational research and clinical oncology, Scott brings a wealth of expertise to his role. His career encompasses significant contributions at institutions including Cancer Research UK, Emory University, St. Jude Children’s Research Hospital and GeneDx. Scott holds a PhD from the University of Cambridge and a Master’s degree in Medical Genetics, providing a strong foundation for his work in advancing precision oncology
through data analytics.

Zach Liu
Dr. Zongzhi “Zach” Liu is a Senior Research Scientist in Life Sciences at Meaningful Insights Biotech Analytics (MiBA). He leads the development of a real-world data (RWD)-driven platform to enhance patient recruitment for clinical trials and spearheads the creation of AI tools for extracting insights from clinical notes, pathological reports, and radiological findings. Dr. Liu holds a Ph.D. in Genetics from the Chinese Academy of Sciences and has over a decade of experience collaborating with cancer researchers, specializing in next-generation sequencing (NGS) analyses at Yale Cancer Center. Additionally, he has six years of expertise in clinical research and data delivery for pharmaceutical companies, leveraging RWD and AI technologies. His unique skill set enables him to effectively bridge the gap between clinical researchers and data engineers, driving impactful innovations in biotech analytics.