DataGemma is an initiative by Google and Data Commons which seeks to improve the quality of the AI output using statistical data. The team augments the Gemma model using RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation) using data from its Data Commons initiative and makes the model open access. Through these processes, the team aims to create LLMs for researchers and developers to use.
A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.
Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.
Region
internationalSector
private_sectorScenario
pre-trainingStart Date
2024Location: Global
Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.
Region
internationalSector
private_sectorcivic_techScenario
pre-trainingStart Date
2024Location: Global
Alva is a generative AI chatbot based on GPT 4o-mini that uses RAG to answer queries about Basel-Stadt. A key feature of the chatbot is its ability to provide attributed responses - citing the respective webpage or information source where the response came from. Currently, the chatbot can draw from publicly available information on the Basel-Stadt website (www.bs.ch.)
Region
emeaSector
public_sectorScenario
adaptationStart Date
2024Location: Switzerland
The AI Hub is a platform developed by the government of South Korea that aims to accelerate AI innovation using open government data in the private sector. The platform houses South Korea's AI infrastructure and open government datasets for AI development and offers several services such as data quality evaluations. To complement these efforts, the government of Seoul is experimenting with creating synthetic data from open government data. One initiative developed using the AI Hub is the TTCare initiative (an AI driven mobile application for pets) which was trained on data from the AI Hub along with other sources.
Region
apacSector
public_sectorScenario
pre-trainingdata_augmentationStart Date
2024Location: South Korea
Bayaan is a conversational tool developed by the Statistics Centre Abu Dhabi that aims to improve access to data from the Statistical Department. The tool uses generative AI to rapidly provide decision makers with data analytics, visualizations, and information that they can use in their decision making processes. The data included focuses on 7 areas and indicators: "Economy, Population, Industry, Social Statistics, Labour Force, Agriculture, and Environment."
Region
emeaSector
public_sectorScenario
open-ended_explorationStart Date
2024Location: United Arab Emirates
Region
apacSector
academiaScenario
data_augmentationStart Date
2024Location: South Korea
Berufsinfomat is a generative AI-driven tool (relying on ChatGPT) introduced by the Austrian Public Employment Service for career coaching. The system, trained on the Austrian Public Employment service's knowledge database on professions, training, and education is intended to offer users with information on professions, training, and education. The Berufsinfomat received 160,000 prompts in January 2024 and around 20,000 additional monthly inquiries. It received criticism for producing responses that conformed to stereotypes about men and women, bias in responses, and for producing various problematic answers. It has received several revisions in response to these problems.
Region
emeaSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: Austria
Region
north_americaSector
academiaScenario
inference_and_insight_generationStart Date
2024Location: United States
In February 2023, Brazil's Federal Court of Accounts launched ChatTCU, which uses OpenAI's ChatGPT and data sourced from the Federal Court of Accounts system. It allows auditors to request a summary of a case document, pose technical questions about the TCU and court decisions, and provide administrative services.
Region
latin_america_and_the_caribbeanSector
public_sectorScenario
open-ended_exploration_adaptationStart Date
2024Location: Brazil
Citymeetings.nyc is an independent initiative that uses LLMs to synthesize information from New York City Council meetings. It uses data from Legistar, an online platform where the government posts meetings summaries and agendas.
Region
north_americaSector
civic_techScenario
inference_and_insight_generationStart Date
2024Location: United States
The Data Science Campus of the United Kingdom's Office for National Statistics has developed ClassifAI, an experimental tool that uses large language models to organize text into categories (e.g. industry). It aims to improve upon existing classification methods by offering greater flexibility and potentially higher accuracy for tasks such as categorizing labor market survey responses. The code has been released as open-source. The developers note that further assessment is needed before potential use in official statistics production.
Region
internationalSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: United Kingdom
Trained on satellite imagery and earth observation data, Clay is a generative AI foundation model designed to understand and analyze Earth's surface. It can generate mathematical representations of any location on Earth at any given time, which can be used for various tasks like creating land cover maps, detecting crop or burn scars, and tracking deforestation. The AI model is open source.
Region
internationalSector
non-profitScenario
open-ended_explorationStart Date
2024Location: United States
Region
emeaSector
academiaprivate_sectorScenario
inference_and_insight_generationStart Date
2024Location: France
Region
internationalSector
private_sectorScenario
inference_and_insight_generationStart Date
2024Location: United States
Region
north_americaSector
academiaScenario
adaptationStart Date
2024Location: United States
Developed by the DS-I Africa (a research program in the United States funded by the National Institutes of Health) and the University of KwaZulu-Natal, DataLaw.Bot is a generative AI chatbot launched in October 2024 for researchers from several countries across the African continent to use in assessing data sharing regulations for scientific research. The chatbot was adapted from ChatGPT with national level data sharing regulations with the goal of increasing access to research data across the continent.
Region
emeaSector
public_sectoracademiaScenario
adaptationStart Date
2024Location: Botswana, Cameroon, Ghana, Kenya, Malawi, Nigeria, Rwanda, South Africa, Tanzania, The Gambia, Uganda, and Zimbabwe
The DC Compass AI assistant is a generative AI chat interface that provides answers to user queries based on datasets from Open Data DC. The interface can provide a summary of a dataset, supporting visualizations, graphs, and other maps. Currently, this project is a pilot program running a beta test open to the public. The team notes that the quality of the output is impacted by the quality of the data from Open Data DC as well as the breadth of data included.
Region
north_americaSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: United States
Region
north_americaSector
public_sectoracademiaScenario
adaptationStart Date
2024Location: United States
Region
emeaSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: Germany
Developed by researchers at the Indraprastha Institute of Information Technology-Delhi, GeneSilico Copilot is a tool used to support oncologists. Drawing from data from Drugbank Open Data, FDA drug labels, RxList, Therapeutic Target Database, Drugs.com, and Wikipedia to offer advice on treatment decisions based on observed facts about a given patient.
Region
apacSector
academiaScenario
inference_and_insight_generationStart Date
2024Location: India
GeoLLM-Engine, developed by researchers at CoStrategist R&D Group and Microsoft Corporation, is an interface for interacting with geospatial data. The system includes a set of tools for analyzing maps and conducting spatial research. The development team is currently focused on improving the quality of outputs and refining the user interface. GeoLLM-Engine aims to serve professionals in fields that utilize geospatial analysis, such as urban planning and environmental monitoring.
Region
north_americaSector
private_sectorScenario
pre-trainingopen-ended_explorationStart Date
2024Location: United States
GoldCoin is a large language model developed for the legal domain by researchers at the Department of Computer Science and Engineering, HKUST, in Hong Kong SAR, China. It specializes in detecting violations of HIPAA privacy rules based on specific queries. The model was trained using legal data from Harvard University's Caselaw Access Project, which offers public access to United States legal decisions. The research team suggests that GoldCoin could potentially be adapted to address other privacy laws in the future.
Region
apacSector
academiaScenario
inference_and_insight_generationStart Date
2024Location: China
GovTech's Data Science and Artificial Intelligence Division (DSAID) has developed a system to assist in drafting parliamentary replies* using artificial intelligence. The project uses machine learning techniques to train language models on past parliamentary data, aiming to generate responses that match the style and accuracy of official replies. This tool is designed to help public servants in Singapore more efficiently prepare answers to parliamentary questions, while also exploring the broader potential of customized AI models for government applications. *Parliamentary replies are official answers given by government ministers or representatives to questions asked by members of parliament during legislative sessions.
Region
apacSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: Singapore
The I14Y Interoperability Platform is Switzerlands national data catalogue, designed to improve access to data between authorities, businesses, and citizens. It provides a centralized repository for data collections, application interfaces, and government services from different levels of government. The platform offers services such as a searchable catalogue, concept definitions, news updates, and a handbook to support users in navigating and using Switzerland's data infrastructure.
Region
emeaSector
public_sectorScenario
data_augmentationStart Date
2024Location: Switzerland
The Indiana Office of Technology and Tyler Technologies (a technology firm), launched a beta version of an AI chatbot that aims to support the public in navigating public services. The chatbot is trained on public information from several departments within the State government and housed on the Government of Indiana website. Before opening the chatbot, there is a clause stating that the State will not be liable for any incorrect or misleading information from the chatbot.
Region
north_americaSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: United States
Region
apacSector
academiaScenario
adaptationStart Date
2024Location: China
With the support of Indonesia Endowment Fund for Education (LPDP) of the Ministry of Finance of the Republic of Indonesia, researchers at the University of Nottingham developed KemenkeuGPT - a generative AI chatbot that aims to support policy makers within Indonesia's Ministry of Finance. The chatbot uses RAG and combines data from the Ministry of Finance, Statistics Indonesia, and the International Monetary Fund among other sources.
Region
apacSector
public_sectoracademiaScenario
adaptationStart Date
2024Location: Indonesia
Region
emeaSector
public_sectorScenario
adaptationStart Date
2024Location: France
Region
north_americaSector
academiaScenario
open-ended_explorationStart Date
2024Location: United States
Researchers at the University of Georgia and State University of New York at Albany used LLMs to analyze the transcripts of United States presidential debates. The team tested 7 debates from the last 24 years using GPT40 and Claude3. The team aimed to demonstrate how LLMs can be used to help minimize bias in judging.
Region
north_americaSector
academiaScenario
inference_and_insight_generationStart Date
2024Location: United States
Region
emeaSector
public_sectorScenario
adaptationStart Date
2024Location: Germany
LuminLab is an online platform that employs generative AI to offer information on improving building energy efficiency. The model is trained using open data from the Energy Performance Certificate dataset provided by the Sustainable Energy Authority of Ireland. The developers are currently working on enhancements, including the integration of geospatial data to generate 3D images of various areas, aiming to expand the platform's capabilities and visual representations.
Region
emeaSector
academiaScenario
adaptationStart Date
2024Location: Ireland
Microsoft and Planet collaborated with humanitarian organizations to analyze the impact of Hurricane Beryl in Grenada. This experimental tool uses the Microsoft AI for Good Damage Assessment Visualizer to analyze satellite images from Planet, estimating damage to buildings and structures on the island of Carriacou. The tool provides visual data to support frontline workers in disaster response and logistics.
Region
internationalSector
private_sectorScenario
inference_and_insight_generationStart Date
2024Location: Grenada
Region
north_americaSector
publicScenario
data_augmentationStart Date
2024Location: United States
Researchers at Wangxuan Institute of Computer Technology at Peking University and the Computer Science Department of the University of California (Los Angeles) developed the Quantitative Reasoning with Data (QRData) Benchmark to assess LLM's ability to analyze statistical data. QRData includes data from open texts books, research papers, and other sources and is combined with 411 questions. Of the LLM's tested, GPT-4 performed the best, but the researchers noted the need for improvement.
Region
internationalSector
academiaScenario
pre-trainingStart Date
2024Location: United States, China
Researchers in Taiwan experimented with using RAG to improve LLM's ability to answer queries about the Taiwanese Hakka culture. The team combined data from the Ministry of Education's Cultural Knowledge Base and Hakka Dictionary along with other data sources focused on languae and geographic locations. Through this effort, the team aimed to demonstrate the value of integrating a translation function in LLMs to support generative AI technologies that reflect minority cultures.
Region
apacSector
private_sectoracademiaScenario
adaptationStart Date
2024Location: Taiwan
Region
emeaSector
academiaprivate_sectorScenario
pre-trainingStart Date
2024Location: France
Region
north_americaSector
private_sectorScenario
inference_and_insight_generationStart Date
2024Location: United States
Region
emeaSector
public_sectoracademiaScenario
pre-trainingStart Date
2024Location: Switzerland
Region
apacSector
public_sectoracademiaScenario
data_augmentationStart Date
2024Location: Australia
Region
north_americaSector
public_sectorScenario
data_augmentationStart Date
2024Location: United States
The Virtual Intelligent Chat Assistant (VICA) is an online platfrom by Singapore's Government Technology Agency (GovTech) that public servants from across the government can use to create their own generative AI chatbots. In a blog published in Towards Data Science, representatives from GovTech discuss a proof of concept they developed using VICA for the Department of Statistics' Data. The team created a chatbot that could respond to queries about national statistics (such as GDP) in a table format.
Region
apacSector
public_sectorScenario
inference_and_insight_generationStart Date
2024Location: Singapore
Region
latin_america_and_the_caribbeanSector
non-profitScenario
inference_and_insight_generationStart Date
2024Location: Mexico
Dolma is an open dataset created for the Allen Institute of AI made up of academic research along with other data sources such as books, website content, and code. The dataset currently hosts 3 trillion tokens and is accompanied by a toolkit on how to source datasets for training purposes.
Region
internationalSector
civic_technon-profitScenario
pre-trainingStart Date
2023Location: Global
Region
north_americaSector
academiaScenario
adaptationStart Date
2023Location: United States
The City of Helsinki has adapted general purpose LLMs to improve its civic services, including urban planning and public facilities. These generative AI tools are fine-tuned using open city data, such as zoning regulations and planning documents, to facilitate civic engagement. These tools aim to enable more efficient communication with residents while enhancing the accessibility of complex information.
Region
emeaSector
public_sectorScenario
adaptationStart Date
2023Location: Finland
ClimateQ&A is a generative AI chatbot developed from the ChatGPT API to provide responses to queries about climate change. The chatbot was created by Ekimetrics -- a data and AI firm based in France -- and uses data from reports from the Intergovernmental Panel on Climate Change (IPCC) and the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). While its primary objective is to make climate change scientific information more accessible, it also helps to understand the types of questions people have about climate change. The team uses NLP to analyze these questions and identify where there are knowledge gaps.
Region
internationalSector
private_sectorScenario
inference_and_insight_generationStart Date
2023Location: France, Global
Region
north_americaSector
academiaScenario
adaptationStart Date
2023Location: United States
Region
north_americaSector
non-profitprivate_sectorScenario
open-ended_explorationStart Date
2023Location: Global
Region
apacSector
private_sectoracademiaScenario
adaptationStart Date
2023Location: Hong Kong
Representatives from Digital Green India (a NGO) and Microsoft Research (India) developed a generative AI chatbot for agricultural services. The chatbot provides farmers with text, audio, and video responses to queries about agriculture. The chatbot uses RAG and draws on research papers and other data sources. It has been implemented in Kenya, India, Ethiopia, and Nigeria thus far.
Region
internationalSector
private_sectorScenario
adaptationStart Date
2023Location: Kenya, India, Ethiopia, and Nigeria
Region
north_americaSector
public_sectorScenario
data_augmentationStart Date
2023Location: United States
GenSpectrum is a generative AI chatbot for COVID-19 genomic sequencing data from the GISAID Data Science Initiative (an initiative focused on generating access to data related to pathogens through partnerships). The chatbot was developed by researchers at the Department of Biosystems Science and Engineering, ETH Zürich and the Swiss Institute of Bioinformatics. The team aims to support research in the medical domain. The chatbot is not yet available online.
Region
internationalSector
academianon-profitScenario
inference_and_insight_generationStart Date
2023Location: Switzerland
Jugalbandi is a generative AI-powered language translation tool that improves access to government programs and rights information across India. It leverages open government data related to various welfare schemes and services, using generative AI models to provide accurate translations in multiple local languages. The AI facilitates communication between citizens and the government, helping individuals understand and access services regardless of language barriers. This initiative democratizes access to official data and government resources, promoting inclusion in public services.
Region
apacSector
public_sectorScenario
inference_and_insight_generationStart Date
2023Location: India
Region
north_americaSector
academianon-profitScenario
adaptationStart Date
2023Location: United States
Region
north_americaSector
privateScenario
adaptationStart Date
2023Location: United States
Region
apacSector
academiaScenario
pre-trainingStart Date
2023Location: India
Region
north_americaSector
academiaScenario
open-ended_explorationStart Date
2023Location: United States
Region
emeaSector
academiaScenario
pre-trainingadaptationStart Date
2023Location: Germany
Region
emeaSector
civic_techScenario
open-ended_explorationStart Date
2023Location: Germany
Region
north_americaSector
privateScenario
data_augmentationStart Date
2023Location: United States
Region
apacSector
public_sectorScenario
inference_and_insight_generationStart Date
2023Location: Singapore
To help improve the accessibility and usability of their open data platform, the International Monetary Fund (IMF) is prototyping a new generative AI tool that they are calling StatGPT. StatGPT will act as a user interface that processes natural language requests to find relevant datasets from the IMF’s repository. StatGPT will help users find indicators, visualize data in tables and charts, and generate Python code for analysis. The team is currently developing interface features and will then seek to integrate it in Excel.
Region
internationalSector
multilateral_sectorScenario
inference_and_insight_generationStart Date
2023Location: Europe and North America
Region
north_americaSector
public_sectorScenario
data_augmentationStart Date
2023Location: Canada
Region
internationalSector
non-profitScenario
open-ended_explorationStart Date
2023Location: United States
Region
north_americaSector
civic_techScenario
inference_and_insight_generationStart Date
2023Location: Canada
Region
emeaSector
private_sectorScenario
inference_and_insight_generationStart Date
2023Location: Spain
Region
north_americaSector
public_sectorScenario
pre-trainingStart Date
2023Location: United States
Wobby is a generative AI-powered interface that can answer queries related to a specific open datasets and produce summaries of those datasets and visualizations as responses. The platform is focused primarily on democratizing access to open government data, and currently hosts datasets from organizations like Statbel (Belgium’s national statistical office), Statistics Netherlands and Eurostat, as well as data from intergovernmental organizations like the World Bank. Wobby's last update allows for automatic data updates and real-time analysis based on current information.
Region
emeaSector
privateScenario
inference_and_insight_generationStart Date
2023Location: Belgium
Region
north_americaSector
civic_techScenario
adaptationStart Date
2022Location: United States
Region
north_americaSector
private_sectoracademiaScenario
inference_and_insight_generationStart Date
2022Location: United States
Region
emeaSector
civic_techScenario
inference_and_insight_generationStart Date
2022Location: Europe and North America
Region
emeaSector
public_sectorScenario
open-ended_explorationStart Date
2022Location: Europe and North America
Region
north_americaSector
private_sectorScenario
inference_and_insight_generationStart Date
2021Location: United States
UrbanSim is an open-source platform that uses generative AI to model urban growth and simulate land use, transportation, and demographic shifts. The platform integrates various datasets, including open-source data on land use, population demographics, and transportation infrastructure, to generate development scenarios that help city planners and researchers make informed decisions about urban growth. UrbanSim aids in visualizing the impacts of policy changes, transportation development, and housing strategies, offering a dynamic tool for sustainable urban planning. The project emphasizes the use of open research data from official sources to simulate realistic and adaptive urban environments.
Region
internationalSector
private_sectorScenario
adaptationStart Date
2021Location: Global
Region
north_americaSector
academiaScenario
adaptationStart Date
2020Location: United States
Region
emeaSector
public_sectorScenario
inference_and_insight_generationStart Date
2020Location: Europe and North America
Region
latin_america_and_the_caribbeanSector
public_sectorScenario
inference_and_insight_generationStart Date
2020Location: Mexico
Region
north_americaSector
academiaScenario
inference_and_insight_generationStart Date
2019Location: United States
Region
latin_america_and_the_caribbeanSector
public_sectorScenario
inference_and_insight_generationStart Date
2019Location: Argentina
Region
emeaSector
academiaScenario
adaptationStart Date
2019Location: United States
Region
internationalSector
privateScenario
data_augmentationStart Date
2019Location: United States
Virtual Singapore is a dynamic 3D digital twin model that leverages generative AI to simulate and analyze urban development scenarios. The platform integrates various open data sources, including satellite imagery, sensor data, and social media inputs, to create a real-time representation of the city. Using generative AI, the system generates scenarios for urban planning, infrastructure development, and emergency response planning. Virtual Singapore helps city planners visualize the impact of policy decisions, environmental changes, and demographic trends. The platform is built on open research data and open data from various governmental and institutional sources, supporting data-driven decision-making for sustainable urban growth.
Region
apacSector
public_sectorScenario
open-ended_explorationStart Date
2019Location: Singapore
Region
north_americaSector
private_sectorScenario
data_augmentationStart Date
2017Location: United States
Region
north_americaSector
non-profitacademiaScenario
pre-trainingStart Date
2014Location: United States