The age of synthetic data and small LLMs
Recent advancements in Large Language Models (LLMs) have been largely driven by enhancements in pre-training datasets. New data curation strategies are being explored, including leveraging synthetic data not only to create high-quality samples but also to develop classifiers that filter web content effectively.
In this talk, we will discuss the methodologies behind the creation of the Cosmopedia and FineWeb-Edu datasets and how they led to the development of SmolLM models —a series of compact yet powerful LLMs.
Loubha Ben-Allal
Machine Learning Engineer at HuggingFace
Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face, where she leads efforts on synthetic data for pre-training and small LLMs. Previously, she worked on large language models for code and was a core member of the BigCode team behind The Stack datasets and StarCoder models for code generation. Loubna holds master’s degrees in Mathematics and Deep Learning from Ecole des Mines de Nancy and ENS Paris-Saclay.