Smart Shortcuts for Bootstrapping a Healthcare NER Project
Within domain-specific data science, people often struggled with starting projects without decent training data. Especially for token-classification this is a time-consuming issue. Nowadays, there are many shortcuts that one can use to get a head start with projects like these by applying techniques like active learning, weak-supervision, few-shot learning, and cross-lingual models, however, in practice, people do not commonly use these.
The presentation shows how anyone can bootstrap a scarce data project by going over medical transcriptions. These transcriptions are pre-annotated by using a list of available keywords. These keywords are then evaluated using data exploration within the Argilla app. We further fine-tune our approach by applying few-shot NER with my concise-concepts package. Additionally, a weak-supervision rule base will be composed, which is actively being maintained using an active-learning listener as a background process during the annotation. Lastly, we combine all of these shortcuts into a SparkNLP training pipeline for NER.
David Berenstein
Developer Advocate Engineer at Argilla
He started off with the ambition to study healthcare, moved on to industrial engineering and then fell in love with programming. Throughout his studies, he has worked on several healthcare topics among which were using GANs for MRI at the University of Kyoto, and the usage of paraphrasing for intent generation for a healthcare chatbot. During his 2 years of being a Machine Learning Engineer for private intelligence at Pandora Intelligence, David came in contact open-source and developed several NLP packages. concise-concepts for few-shot NER. classy-classification for few-shot classification in spaCy. crosslingual-coreference for multi-lingual coreference resolution. fast-sentence-transformers for ONNX-enabled sentence-transformers.