Observatory of Examples of How Open Data and Generative AI Intersect

A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.

Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.

Common Corpus

Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.

Region

international

Sector

private_sectorcivic_tech

Scenario

pre-training

Start Date

2024

Location: Global