Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.
A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.
Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.