The age of synthetic data and small LLMs
Recent advancements in Large Language Models (LLMs) have been largely driven by enhancements in pre-training datasets. New data curation strategies are being explored, including leveraging synthetic data not only to create high-quality samples but also to develop classifiers that filter web content effectively. In this talk, we will discuss the methodologies behind the creation of the Cosmopedia and FineWeb-Edu datasets and how they led to the development of SmolLM models —a series of compact yet powerful LLMs.
About the speaker
Loubha Ben-Allal
Machine Learning Engineer at HuggingFace