Introducing the MultiCaRe Dataset: A Multimodal Case Report Dataset of Clinical Cases, Images, Labels and Captions
The MultiCaRe Dataset is a multimodal case report dataset that contains data from 75,382 open-access PubMed Central articles spanning the period from 1990 to 2023.
It includes 96,428 clinical cases from different medical specialties, along with 135,596 images and their corresponding labels and captions. The structure of the dataset allows for the seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision, or multi-modal models.
Apart from describing the contents of the dataset, during this presentation we will go through the process of its creation, which involved tasks such as data extraction and preprocessing using different resources (Biopython, Spark NLP for Healthcare, and OpenCV, among others).
Finally, we will learn how to create a customized subset based on a specific use case. To achieve this, we will leverage the MedicalDatasetCreator class, which provides the capability to filter clinical cases by patient demographics, article metadata, strings, and image labels.
Mauro Nievas Offidani
MD, Data Scientist at National University of the South
Mauro is a Medical Doctor and Data Scientist from Argentina. He holds a Master’s Degree in Epidemiology, Health Policy and Management, and is currently pursuing his PhD in Medical Imaging. He has contributed to several AI projects in various roles including medical consultant, data annotator, annotation lead, and data scientist.