Observatory of Examples of How Open Data and Generative AI Intersect

A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.

Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.

Instruction Tuning for Low-Resource Languages: A Case Study in Kazakh

This project, developed by researchers from Mohammed Bin Zayed University for Artificial Intelligence (MBZUAI) and Cerebras Systems, focuses on creating a large-scale instruction-following dataset for the Kazakh language. It uses open data from government and cultural sources, including Kazakhstan's e-Government portal (gov.kz) and cultural data from Kazakh Wikidata. The dataset covers key aspects of Kazakhstan’s governmental structure, legal frameworks, and cultural heritage. The project uses generative AI, specifically GPT-4o, to help create instructional data from government and cultural texts. This data is used to help language models better understand and follow instructions in Kazakh. The goal of this project is to improve language models' ability to understand local governance and culture in Kazakhstan.

Region

apac

Sector

private_sectoracademia

Scenario

pre-training

Start Date

2025

Location: United States, United Arab Emirates