The age of synthetic data and small LLMs

Recent advancements in Large Language Models (LLMs) have been largely driven by enhancements in pre-training datasets. New data curation strategies are being explored, including leveraging synthetic data not only to create high-quality samples but also to develop classifiers that filter web content effectively. In this talk, we will discuss the methodologies behind the creation of the Cosmopedia and FineWeb-Edu datasets and how they led to the development of SmolLM models —a series of compact yet powerful LLMs.

About the speaker
Amy-Heineike

Loubha Ben-Allal

Machine Learning Engineer at HuggingFace

NLP-Summit

When

Online Event: September 24, 2024

 

Contact

nlpsummit@johnsnowlabs.com

Presented by

jhonsnow_logo