BLOOM is an open-access, multilingual large language model (LLM) trained using a mix of publicly available datasets, including community-selected data and filtered web-crawled data. Its training corpus, known as ROOTS, includes open data from sources like Project Gutenberg, OpenSubtitles, and HAL (open-access scientific publications), as well as government data and open research repositories such as the Catalan Government Crawling and the United Nations Parallel Corpus. BLOOM is designed to generate human-like text in 46 languages and 13 programming languages, and it is available for use and further development by researchers and institutions worldwide.
A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.
Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.