Introducing the MultiCaRe Dataset: A Multimodal Case Report Dataset of Clinical Cases, Images, Labels and Captions

The MultiCaRe Dataset is a multimodal case report dataset that contains data from 75,382 open-access PubMed Central articles spanning the period from 1990 to 2023. It includes 96,428 clinical cases from different medical specialties, along with 135,596 images and their corresponding labels and captions. The structure of the dataset allows for the seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision, or multi-modal models. Apart from describing the contents of the dataset, during this presentation we will go through the process of its creation, which involved tasks such as data extraction and preprocessing using different resources (Biopython, Spark NLP for Healthcare and OpenCV, among others). Finally, we will learn how to create a customized subset based on a specific use case. To achieve this, we will leverage the MedicalDatasetCreator class, which provides the capability to filter clinical cases by patient demographics, article metadata, strings, and image labels.

About the speaker
Amy-Heineike

Mauro Nievas Offidani

MD, Data Scientist at National University of the South

NLP-Summit

When

Online Event: September 25, 2024

 

Contact

nlpsummit@johnsnowlabs.com

Presented by

jhonsnow_logo