Disambiguation – Linking Data Science and Engineering
Disambiguation or Entity Linking is the assignment of a knowledge base identifier (Wikidata, Wikipedia) to a named entity. Our goal was to improve an MVP model by adding newly created knowledge while maintaining competitive F1 scores.
Taking an entity linking model from MVP into production in a spaCy-native pipeline architecture posed several data science and engineering challenges, such as hyperparameter estimation and knowledge enhancement, which we addressed by taking advantage of the engineering tools Docker and Kubernetes to semi-automate training as an on-demand job.
We also discuss some of our learnings and process improvements that were needed to strike a balance between data science goals and engineering constraints and present our current work on improving performance through BERT-embedding based contextual similarity.
Sebastian is a Machine Learning Engineer at Adarga. He is responsible for Adarga's entity recognition and disambiguation pipeline and actively researching ways to improve its core and auxiliary pipes. In addition to that, he created components for extractive and abstractive summarization and geoparsing.
Previously, he worked as Data Scientist and Data Developer in the financial industry. Sebastian holds a Master's degree in Computer Science (Financial Risk Management) from University College London and two Bachelor's degrees from the University of Tübingen and the University of Vienna, respectively.
Felisia Loukou is a Senior Data Scientist at Adarga, specializing in metadata mining and knowledge discovery. She has worked in the Government Digital Service, optimizing GOV.UK, the UK's website for government services and information. In Government, she co-founded GOV.UK Data Labs, a multi-disciplinary team to help GOV.UK leverage its unstructured data estate. She led GOV.UK’s structured content strategy to improve user experience and established the organization’s knowledge graph.
She also led the Data Science Accelerator mentorship programme which enables analysts in the public sector to develop their data science skills. Felisia holds Master’s degrees in Computer Science and Natural Language Processing from the University of Oxford and the University of Edinburgh.