Observatory of Examples of How Open Data and Generative AI Intersect

A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.

Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.

expand_more
expand_more
expand_more
expand_more
Name
Region
Sector
Scenario
Start Date
DataGemma

DataGemma is an initiative by Google and Data Commons which seeks to improve the quality of the AI output using statistical data. The team augments the Gemma model using RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation) using data from its Data Commons initiative and makes the model open access. Through these processes, the team aims to create LLMs for researchers and developers to use.

Region

international

Sector

private_sector

Scenario

pre-training

Start Date

2024

Location: Global

Common Corpus

Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.

Region

international

Sector

private_sectorcivic_tech

Scenario

pre-training

Start Date

2024

Location: Global

Alva

Alva is a generative AI chatbot based on GPT 4o-mini that uses RAG to answer queries about Basel-Stadt. A key feature of the chatbot is its ability to provide attributed responses - citing the respective webpage or information source where the response came from. Currently, the chatbot can draw from publicly available information on the Basel-Stadt website (www.bs.ch.)

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Switzerland

South Korea's AI Hub

The AI Hub is a platform developed by the government of South Korea that aims to accelerate AI innovation using open government data in the private sector. The platform houses South Korea's AI infrastructure and open government datasets for AI development and offers several services such as data quality evaluations. To complement these efforts, the government of Seoul is experimenting with creating synthetic data from open government data. One initiative developed using the AI Hub is the TTCare initiative (an AI driven mobile application for pets) which was trained on data from the AI Hub along with other sources.

Region

apac

Sector

public_sector

Scenario

pre-trainingdata_augmentation

Start Date

2024

Location: South Korea

Bayaan Platform

Bayaan is a conversational tool developed by the Statistics Centre Abu Dhabi that aims to improve access to data from the Statistical Department. The tool uses generative AI to rapidly provide decision makers with data analytics, visualizations, and information that they can use in their decision making processes. The data included  focuses on 7 areas and indicators: "Economy, Population, Industry, Social Statistics, Labour Force, Agriculture, and Environment." 

Region

emea

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2024

Location: United Arab Emirates

Asclepius
Asclepius is a large language model for the medical domain trained on synthetic clinical notes generated through public biomedical information. The team chose to experiment with synthetic data given privacy concerns associated with using patient data in LLMs. The authors indicate that the LLM trained on synthetic data can have similar quality outputs to those trained on patient data.

Region

apac

Sector

academia

Scenario

data_augmentation

Start Date

2024

Location: South Korea

Berufsinfomat

Berufsinfomat is a generative AI-driven tool (relying on ChatGPT) introduced by the Austrian Public Employment Service for career coaching. The system, trained on the Austrian Public Employment service's knowledge database on professions, training, and education is intended to offer users with information on professions, training, and education. The Berufsinfomat received 160,000 prompts in January 2024 and around 20,000 additional monthly inquiries. It received criticism for producing responses that conformed to stereotypes about men and women, bias in responses, and for producing various problematic answers. It has received several revisions in response to these problems.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Austria

CensusGPT
CensusGPT is a natural language interface for United States census data, developed as part of the textSQL project. It allows users to ask questions about census information in plain English, which are then converted to queries formatted for the programming language SQL to retrieve relevant data from a census database. This tool aims to allow more individuals to analyze population statistics, demographics, and other census-related information without needing technical SQL knowledge.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

ChatTCU

In February 2023, Brazil's Federal Court of Accounts launched ChatTCU, which uses OpenAI's ChatGPT and data sourced from the Federal Court of Accounts system. It allows auditors to request a summary of a case document, pose technical questions about the TCU and court decisions, and provide administrative services.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

open-ended_exploration_adaptation

Start Date

2024

Location: Brazil

Citymeetings.nyc

Citymeetings.nyc is an independent initiative that uses LLMs to synthesize information from New York City Council meetings. It uses data from Legistar, an online platform where the government posts meetings summaries and agendas.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

ClassifAI

The Data Science Campus of the United Kingdom's Office for National Statistics has developed ClassifAI, an experimental tool that uses large language models to organize text into categories (e.g. industry). It aims to improve upon existing classification methods by offering greater flexibility and potentially higher accuracy for tasks such as categorizing labor market survey responses. The code has been released as open-source. The developers note that further assessment is needed before potential use in official statistics production.

Region

international

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United Kingdom

Clay

Trained on satellite imagery and earth observation data, Clay is a generative AI foundation model designed to understand and analyze Earth's surface. It can generate mathematical representations of any location on Earth at any given time, which can be used for various tasks like creating land cover maps, detecting crop or burn scars, and tracking deforestation. The AI model is open source.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2024

Location: United States

CroissantLLM
Developed by researchers at various European and American universities as well as private technology companies, CroissantLLM is a large language model that aims to support English-French language queries. The model is trained on both web scraped data and open government data from France. This initiative aims to improve LLMs capability to analyze non-English data.

Region

emea

Sector

academiaprivate_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: France

Data Commons
Data Commons is a platform developed by Google that aggregates and standardizes public datasets from various global sources. The initiative uses AI and large language models to provide a natural language interface, enabling users to query complex data without requiring technical expertise. Through partnerships with organizations like the UN, Indian Institute of Technology Madras, and Feeding America, Data Commons offers specialized data portals on topics such as Sustainable Development Goals, India-specific information, and U.S. food security, presenting information through visualizations and analysis tools.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Data Provenance Explorer
The Data Provenance Explorer is an interactive tool developed as part of MITs Center for Constructive Communication audit of AI training datasets. This tool allows researchers to explore information about datasets, including its origins, licenses, creators, and other metadata. This resource aims to enhance transparency around AI datasets and promote more informed use of datasets in AI research and development.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2024

Location: United States

DataLaw.Bot

Developed by the DS-I Africa (a research program in the United States funded by the National Institutes of Health) and the University of KwaZulu-Natal, DataLaw.Bot is a generative AI chatbot launched in October 2024 for researchers from several countries across the African continent to use in assessing data sharing regulations for scientific research. The chatbot was adapted from ChatGPT with national level data sharing regulations with the goal of increasing access to research data across the continent.

Region

emea

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Botswana, Cameroon, Ghana, Kenya, Malawi, Nigeria, Rwanda, South Africa, Tanzania, The Gambia, Uganda, and Zimbabwe

DC Compass

The DC Compass AI assistant is a generative AI chat interface that provides answers to user queries based on datasets from Open Data DC. The interface can provide a summary of a dataset, supporting visualizations, graphs, and other maps. Currently, this project is a pilot program running a beta test open to the public. The team notes that the quality of the output is impacted by the quality of the data from Open Data DC as well as the breadth of data included.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

DRG-LLaMA
Researchers from the Mayo Clinic and University of Illinois Urbana-Champaign developed DRG-LLaMA, a tool designed for healthcare professionals involved in hospital billing and coding. DRG-LLaMA is an advanced large language model fine-tuned on clinical notes to improve the assignment of Diagnosis-Related Groups (DRGs) in the United States inpatient payment system. The model improves DRG assignment by analyzing patient discharge summaries to predict DRGs and their components, achieving higher accuracy than previous methods used for this task

Region

north_america

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: United States

F13 - InnoLab-bw
The Baden-Württemberg Innovation Laboratory within the Government of Germany created F13, a generative AI chatbot to support with government administrative tasks. This chatbot helps personnel summarize documents and provides research support.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Germany

GeneSilico Copilot

Developed by researchers at the Indraprastha Institute of Information Technology-Delhi, GeneSilico Copilot is a tool used to support oncologists. Drawing from data from Drugbank Open Data, FDA drug labels, RxList, Therapeutic Target Database, Drugs.com, and Wikipedia to offer advice on treatment decisions based on observed facts about a given patient.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: India

GeoLLM-Engine

GeoLLM-Engine, developed by researchers at CoStrategist R&D Group and Microsoft Corporation, is an interface for interacting with geospatial data. The system includes a set of tools for analyzing maps and conducting spatial research. The development team is currently focused on improving the quality of outputs and refining the user interface. GeoLLM-Engine aims to serve professionals in fields that utilize geospatial analysis, such as urban planning and environmental monitoring.

Region

north_america

Sector

private_sector

Scenario

pre-trainingopen-ended_exploration

Start Date

2024

Location: United States

GoldCoin

GoldCoin is a large language model developed for the legal domain by researchers at the Department of Computer Science and Engineering, HKUST, in Hong Kong SAR, China. It specializes in detecting violations of HIPAA privacy rules based on specific queries. The model was trained using legal data from Harvard University's Caselaw Access Project, which offers public access to United States legal decisions. The research team suggests that GoldCoin could potentially be adapted to address other privacy laws in the future.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: China

GovTech DSAID

GovTech's Data Science and Artificial Intelligence Division (DSAID) has developed a system to assist in drafting parliamentary replies* using artificial intelligence. The project uses machine learning techniques to train language models on past parliamentary data, aiming to generate responses that match the style and accuracy of official replies. This tool is designed to help public servants in Singapore more efficiently prepare answers to parliamentary questions, while also exploring the broader potential of customized AI models for government applications. *Parliamentary replies are official answers given by government ministers or representatives to questions asked by members of parliament during legislative sessions.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Singapore

I14Y Interoperability Platform

The I14Y Interoperability Platform is Switzerlands national data catalogue, designed to improve access to data between authorities, businesses, and citizens. It provides a centralized repository for data collections, application interfaces, and government services from different levels of government. The platform offers services such as a searchable catalogue, concept definitions, news updates, and a handbook to support users in navigating and using Switzerland's data infrastructure.

Region

emea

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: Switzerland

IN.gov

The Indiana Office of Technology and Tyler Technologies (a technology firm), launched a beta version of an AI chatbot that aims to support the public in navigating public services. The chatbot is trained on public information from several departments within the State government and housed on the Government of Indiana website. Before opening the chatbot, there is a clause stating that the State will not be liable for any incorrect or misleading information from the chatbot.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Intern-LM-Law
Researchers from Shanghai AI Laboratory, Nanjing University, Eastern Institute of Technology, Ningbo, and Saarland University, Saarland Informatics Campus developed a chatbot that can address queries and analyze documents within the legal domain in China. The team fine-tuned the LLM using legal data from the Chinese National Legal Database along with other data sources. The team has made both the model and data sources publicly available.

Region

apac

Sector

academia

Scenario

adaptation

Start Date

2024

Location: China

KemenkeuGPT

With the support of Indonesia Endowment Fund for Education (LPDP) of the Ministry of Finance of the Republic of Indonesia, researchers at the University of Nottingham developed KemenkeuGPT - a generative AI chatbot that aims to support policy makers within Indonesia's Ministry of Finance. The chatbot uses RAG and combines data from the Ministry of Finance, Statistics Indonesia, and the International Monetary Fund among other sources.

Region

apac

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Indonesia

LLaMandement
Developed by the Government of France, LLaMandement aims to support administrative agents in analyzing and drafting summaries of legal bills developed in the French Parliament for other ministries and departments. The team used data from SIGNALE (a platform used in the French government’s lawmaking including data from several ministries such as the Ministry of Ecological Transition and Territorial Cohesion, Ministry of Culture and others) to fine-tune the pretrained model.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: France

LLM on FHIR - A Project to Demystify Health Records
Researchers at Stanford University developed a mobile application that uses artificial intelligence to help patients better understand their health records and medical information. The mobile application, called LLM on FHIR, translates complex medical data into plain language and can answer patients health questions based on their personal medical history. While the application showed promise in making health information more accessible, the study also revealed challenges such as occasional inconsistent responses, highlighting areas for future improvement.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2024

Location: United States

LLM-Potus Score

Researchers at the University of Georgia and State University of New York at Albany used LLMs to analyze the transcripts of United States presidential debates. The team tested 7 debates from the last 24 years using GPT40 and Claude3. The team aimed to demonstrate how LLMs can be used to help minimize bias in judging.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

LLMoin
LLMoin is a chatbot developed by the City of Hamburg that aims to provide administrative support to government personnel. The tool is based on the Luminous Language Model of AlephAlpha which was developed in Germany. LLMoin is currently a pilot program undergoing testing.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Germany

LuminLab

LuminLab is an online platform that employs generative AI to offer information on improving building energy efficiency. The model is trained using open data from the Energy Performance Certificate dataset provided by the Sustainable Energy Authority of Ireland. The developers are currently working on enhancements, including the integration of geospatial data to generate 3D images of various areas, aiming to expand the platform's capabilities and visual representations.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2024

Location: Ireland

Microsoft AI for Good - Damage Assessment Visualizer for Hurricane Beryl in Grenada

Microsoft and Planet collaborated with humanitarian organizations to analyze the impact of Hurricane Beryl in Grenada. This experimental tool uses the Microsoft AI for Good Damage Assessment Visualizer to analyze satellite images from Planet, estimating damage to buildings and structures on the island of Carriacou. The tool provides visual data to support frontline workers in disaster response and logistics.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Grenada

NextGen NHTS Origin-Destination Data
Researchers at the National Transportation Research Center are using machine learning and AI techniques to analyze truck transportation patterns across the United States. The team is combining truck trip data with population and employment statistics using advanced algorithms to model and predict truck flows between regions. This effort is helping to uncover factors influencing truck transportation, such as the nonlinear relationship between distance and truck trips, providing valuable insights for transportation planning and investment decisions.

Region

north_america

Sector

public

Scenario

data_augmentation

Start Date

2024

Location: United States

Quantitative Reasoning with Data Benchmark

Researchers at Wangxuan Institute of Computer Technology at Peking University and the Computer Science Department of the University of California (Los Angeles) developed the Quantitative Reasoning with Data (QRData) Benchmark to assess LLM's ability to analyze statistical data. QRData includes data from open texts books, research papers, and other sources and is combined with 411 questions. Of the LLM's tested, GPT-4 performed the best, but the researchers noted the need for improvement.

Region

international

Sector

academia

Scenario

pre-training

Start Date

2024

Location: United States, China

RAG for Culturally Inclusive Hakka Chatbots

Researchers in Taiwan experimented with using RAG to improve LLM's ability to answer queries about the Taiwanese Hakka culture. The team combined data from the Ministry of Education's Cultural Knowledge Base and Hakka Dictionary along with other data sources focused on languae and geographic locations. Through this effort, the team aimed to demonstrate the value of integrating a translation function in LLMs to support generative AI technologies that reflect minority cultures.

Region

apac

Sector

private_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Taiwan

SaulLM-7B
Developed by researchers in Portugal and France, SaulLM-7B is a large language model that summarizes legal documents. The model is pretrained on legal texts from the United States, Europe, and Australia.

Region

emea

Sector

academiaprivate_sector

Scenario

pre-training

Start Date

2024

Location: France

Sidekick
Sidekick is an AI chatbot developed by mySidewalk that answers queries about public issues. The chatbot responses are drawn from several official data sources including data from the U.S. Census Bureau, United States Department of Agriculture, and Bureau of Labor Statistics. Among its goals, it seeks to improve access to data to non-technical audiences.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

StatBot.Swiss
StatBot.Swiss is a benchmark dataset developed by Swiss researchers that can be used to test generative AI models ability to answer queries in English and German. The dataset includes data from the OpenData.Swiss government portal. Moving forward, the team is looking into expanding it to include other languages such as French and Italian.

Region

emea

Sector

public_sectoracademia

Scenario

pre-training

Start Date

2024

Location: Switzerland

Synthetic Australian Healthcare Data Using Synthea
In January 2024, researchers from the Australian e-Health Research Centre (CSIRO) and Macquarie University launched a study on using synthetic data to enhance access to healthcare information. They adapted the Synthea tool, which typically uses US census data, to incorporate Australian demographic and hospital data, creating around 117,000 synthetic health records specific to Queensland. The team used these records to analyze disease patterns, noting that while the synthetic data provides valuable access, further real-world testing is needed to ensure it accurately represents the local context.

Region

apac

Sector

public_sectoracademia

Scenario

data_augmentation

Start Date

2024

Location: Australia

Synthetic Data for Official Statistics
This guide assists National Statistical Offices (NSOs) in managing data access using synthetic data while maintaining confidentiality. It is suitable for statisticians and data managers in government agencies interested in implementing synthetic data. The guide covers the creation of synthetic data, addresses privacy risks, and provides practical tips for application, including a case where the Office for National Statistics Data Science Campus in the United States created a synthetic dataset using the U.S. Census Bureau’s income data to test the 2021 Census model.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: United States

The Virtual Intelligent Chat Assistant's Department of Statistics Proof of Concept

The Virtual Intelligent Chat Assistant (VICA) is an online platfrom by Singapore's Government Technology Agency (GovTech) that public servants from across the government can use to create their own generative AI chatbots. In a blog published in Towards Data Science, representatives from GovTech discuss a proof of concept they developed using VICA for the Department of Statistics' Data. The team created a chatbot that could respond to queries about national statistics (such as GDP) in a table format.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Singapore

TitiBot
Developed by BORDE (a Mexican non-profit), TitiBot is a spanish language Whatsapp chatbot that helps improve access to voting records on legislative reforms. It uses data from Mexicos Congress of the Union (e.g. parlimentary voting records) from between 2018 and 2024 and can provide summaries of the data.

Region

latin_america_and_the_caribbean

Sector

non-profit

Scenario

inference_and_insight_generation

Start Date

2024

Location: Mexico

Dolma

Dolma is an open dataset created for the Allen Institute of AI made up of academic research along with other data sources such as books, website content, and code. The dataset currently hosts 3 trillion tokens and is accompanied by a toolkit on how to source datasets for training purposes.

Region

international

Sector

civic_technon-profit

Scenario

pre-training

Start Date

2023

Location: Global

ChatDoctor
ChatDoctor is a generative AI chatbot that can answer queries in the medical domain. The model was trained on patient conversations from an online medical platform. It also uses data from Medline Plus (a government health information website for medical practitioners) in addition to other data sources.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

City of Helsinki's AI Register

The City of Helsinki has adapted general purpose LLMs to improve its civic services, including urban planning and public facilities. These generative AI tools are fine-tuned using open city data, such as zoning regulations and planning documents, to facilitate civic engagement. These tools aim to enable more efficient communication with residents while enhancing the accessibility of complex information.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2023

Location: Finland

Climate Q&A

ClimateQ&A is a generative AI chatbot developed from the ChatGPT API to provide responses to queries about climate change. The chatbot was created by Ekimetrics -- a data and AI firm based in France -- and uses data from reports from the Intergovernmental Panel on Climate Change (IPCC) and the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). While its primary objective is to make climate change scientific information more accessible, it also helps to understand the types of questions people have about climate change. The team uses NLP to analyze these questions and identify where there are knowledge gaps.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: France, Global

covLLM
Researchers at Stanford University developed covLLM, a generative AI tool to support doctors in understanding the most up-to-date COVID-19 research. The model was trained on the COVID-19 Open Research Dataset (CORD-19) and can provide summaries of research based on specific queries. Its objective is to address healthcare professionals need to stay updated on fast evolving topics.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

Democratic Fine-Tuning with a Moral Graph
Democratic Fine-Tuning with a Moral Graph (DFTmg) is a new method for aligning AI language models with human values through large-scale public discussions. The project used a survey of 500 Americans political views that they anonymized and made public as open research data on github. This process aims to develop AI models that make better decisions by incorporating public input into the training process. This work was supported by OpenAI.

Region

north_america

Sector

non-profitprivate_sector

Scenario

open-ended_exploration

Start Date

2023

Location: Global

ESGReveal
ESGReveal uses Retrieval Augmented Generation to adapt Environmental, Social, and Governance (ESG) data from corporate reports to help users find information from these reports when searching a database or the internet. The generative AI model was trained on ESG reports from 166 companies on the Hong Kong Stock Exchange.

Region

apac

Sector

private_sectoracademia

Scenario

adaptation

Start Date

2023

Location: Hong Kong

Farmer.chat

Representatives from Digital Green India (a NGO) and Microsoft Research (India) developed a generative AI chatbot for agricultural services. The chatbot provides farmers with text, audio, and video responses to queries about agriculture. The chatbot uses RAG and draws on research papers and other data sources. It has been implemented in Kenya, India, Ethiopia, and Nigeria thus far.

Region

international

Sector

private_sector

Scenario

adaptation

Start Date

2023

Location: Kenya, India, Ethiopia, and Nigeria

Generating a Fully Synthetic Human Services Dataset
This report, produced by researchers at the Urban Institute in collaboration with Allegheny County partners, describes the process of creating a synthetic version of the countys 2021 human services dataset. The synthetic data aims to replicate statistical properties of the confidential data while protecting individual privacy, enabling wider access to detailed human services information. The document covers the data synthesis methodology, evaluation of data quality and privacy risks, and the challenges of balancing utility and confidentiality in synthetic administrative data.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: United States

GenSpectrum Chat

GenSpectrum is a generative AI chatbot for COVID-19 genomic sequencing data from the GISAID Data Science Initiative (an initiative focused on generating access to data related to pathogens through partnerships). The chatbot was developed by researchers at the Department of Biosystems Science and Engineering, ETH Zürich and the Swiss Institute of Bioinformatics. The team aims to support research in the medical domain. The chatbot is not yet available online.

Region

international

Sector

academianon-profit

Scenario

inference_and_insight_generation

Start Date

2023

Location: Switzerland

Jugalbandi AI for Multilingual Access to Government Services

Jugalbandi is a generative AI-powered language translation tool that improves access to government programs and rights information across India. It leverages open government data related to various welfare schemes and services, using generative AI models to provide accurate translations in multiple local languages. The AI facilitates communication between citizens and the government, helping individuals understand and access services regardless of language barriers. This initiative democratizes access to official data and government resources, promoting inclusion in public services.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: India

Llema
Llema is a generative AI model fine-tuned for the mathematics domain. It was fine-tuned using the Proof-Pile-2 dataset, which combines scientific papers with other mathematics datasets. The researchers have provided public access to the models, dataset, code to encourage future research around the topic of AI and mathematics.

Region

north_america

Sector

academianon-profit

Scenario

adaptation

Start Date

2023

Location: United States

Med-PaLM2
Med-PaLM2 is a generative AI chatbot by Google Research which seeks to provide long-form written answers to medical questions. Med-PaLM2 is fine-tuned using “publicly available question-answering data and physician writing responses” including MedQA and MedMCQA among other datasets. Med-PaLM2 achieved 86.5% accuracy on United States Medical Licensing Examination questions.

Region

north_america

Sector

private

Scenario

adaptation

Start Date

2023

Location: United States

MILDSum
Developed by researchers and legal practitioners from the Indian Institute of Technology Kharagpur, MILDSum is a research initiative that aims to bring together open data from the legal domain (i.e. case judgements) to create Hindi summaries of case judgements that can be used for training purposes.

Region

apac

Sector

academia

Scenario

pre-training

Start Date

2023

Location: India

NEPAccess
NEPAccess, developed by the University of Arizona, employs AI and data science to improve the National Environmental Protection Act (NEPA) environmental review process. The project uses generative AI to compile insights from previous projects and assist in drafting environmental impact assessments (EIAs) on specific topics. By integrating open data from federal agencies, NEPAccess provides public access to a centralized database of environmental reviews. The project was funded by the National Science Foundation (NSF) from 2021-2024 and is now seeking new funding to build new features into its platform.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2023

Location: United States

OpenAssistant Conversations
Researchers have released a free, public collection of conversations called OpenAssistant Conversations to help improve AI language models. This dataset, created by over 13,500 volunteers worldwide, includes conversations in 35 languages along with quality ratings. By making this resource freely available, the researchers aim to democratize the development of more user-friendly and capable AI assistants across various fields.

Region

emea

Sector

academia

Scenario

pre-trainingadaptation

Start Date

2023

Location: Germany

Parla
Parla is an AI interface in development at CityLab Berlin. It aims to enhance access to public administration data across the city for both government officials and the general public. Functioning as both a retrieval system and an analytical tool, Parla accesses over 10,000 public documents from city departments, systems, and formats to answer specific queries. However, due to challenges like poorly structured data and insufficient metadata, Parla sometimes generates inaccurate outputs. To address this, Parla ensures its responses include source references, improving transparency and accountability.

Region

emea

Sector

civic_tech

Scenario

open-ended_exploration

Start Date

2023

Location: Germany

Phi-2
Phi-2 is an open-source small language model with 2.7 billion parameters that demonstrates outstanding reasoning and language understanding capabilities. Due to its small size, researchers use it to study AI model interactions, enhance safety features, and customize it for specific applications. The training data contains a mix of curated web data and synthetic data made to focus on common sense reasoning and general knowledge.

Region

north_america

Sector

private

Scenario

data_augmentation

Start Date

2023

Location: United States

SELENA+
SELENA+, developed by Synapxe (a department within the Government of Singapore focused on healthtech), the National University of Singapore and the Singapore National Eye Center, uses generative AI to detect diabetes-related eye conditions, specifically, diabetic eye disease, glaucoma, and age-related macular degeneration. The tool analyzes imagery from the National Eye Center. The team plans to expand this tool to cardiovascular diseases in the future.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Singapore

StatGPT

To help improve the accessibility and usability of their open data platform, the International Monetary Fund (IMF) is prototyping a new generative AI tool that they are calling StatGPT. StatGPT will act as a user interface that processes natural language requests to find relevant datasets from the IMF’s repository. StatGPT will help users find indicators, visualize data in tables and charts, and generate Python code for analysis. The team is currently developing interface features and will then seek to integrate it in Excel.

Region

international

Sector

multilateral_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Europe and North America

Statistics Canada
Statistics Canada conducted a pilot program around generating synthetic data for training purposes. The team created synthetic datasets from census data that includes sensitive information. These datasets were used in two Hackathons, with the condition that they could not be publicly shared. Organizers highlighted that the synthetic datasets preserved the usefulness of the original data for analysis while minimizing the risk of revealing sensitive information. Hackathon participants successfully used these datasets for training purposes.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: Canada

Talk to the City
Talk to the City is an open-source tool that uses advanced AI to analyze and summarize qualitative data, particularly human opinions. It aims to improve collective decision-making and enhance public discourse around policy making by clustering similar arguments and creating summaries and visualizations. Talk to the City has been used in citizens assemblies in Taiwan as of 2023.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2023

Location: United States

TaxGPT
TaxGPT is an independently developed generative AI chatbot that answers tax related queries based on information from the Canada Revenue Agency website. Its goal is to make tax information at the population level more understandable. It was updated in 2024 and is currently operational.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2023

Location: Canada

Tendios
Tendios is a Software-as-a-Service company from Spain that developed a chatbot to support public tender analysis and bidding. The chatbot is trained on government tender documents and aims to improve public procurement processes.

Region

emea

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Spain

The Harmonized Landsat and Sentinel-2 (HLS) Project
The Harmonized Landsat and Sentinel-2 (HLS) project by NASA aims to create a record of Earths surface using images from multiple satellites. The HLS dataset combines data from four NASA satellites as well as US Geological Survey (USGS) sensors around the globe. The dataset was used to train NASA and IBM’s watsonx.ai geospatial foundation model, which can be used to develop AI systems that provide maps and analytics about natural disasters and environmental changes. The latest dataset includes information from across the globe (except Antarctica). This work was a collaboration between NASA, the US Geological Survey (USGS), and several NASA research centers.

Region

north_america

Sector

public_sector

Scenario

pre-training

Start Date

2023

Location: United States

Wobby

Wobby is a generative AI-powered interface that can answer queries related to a specific open datasets and produce summaries of those datasets and visualizations as responses. The platform is focused primarily on democratizing access to open government data, and currently hosts datasets from organizations like Statbel (Belgium’s national statistical office), Statistics Netherlands and Eurostat, as well as data from intergovernmental organizations like the World Bank. Wobby's last update allows for automatic data updates and real-time analysis based on current information.

Region

emea

Sector

private

Scenario

inference_and_insight_generation

Start Date

2023

Location: Belgium

AgricultureBERT
AgricultureBERT is a generative AI model for the agriculture domain that was developed with data from the United States National Agricultural Library. This model is used to answer questions related to agricultural knowledge such as crop growing best practices or fertilization techniques in different climates. The intention is to improve access to agricultural information and advance research in the field.

Region

north_america

Sector

civic_tech

Scenario

adaptation

Start Date

2022

Location: United States

BioGPT
BioGPT is a generative AI model that can answer queries about biomedicine. BioGPT was trained using biomedical literature from PubMed. This tool was developed by representatives of Microsoft Research and Peking University.

Region

north_america

Sector

private_sectoracademia

Scenario

inference_and_insight_generation

Start Date

2022

Location: United States

CROZ RenEUwable
This project is an AI-driven application that provides users with sustainability recommendations as it relates to their energy consumption based on a specific set of queries. It was trained on open climate and energy datasets. This project won the EU Datathon 2022 in the European Green Deal Category and is currently in development.

Region

emea

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2022

Location: Europe and North America

European Cancer Imaging Institute
The European Cancer Imaging Initiative (part of the Europes Beating Cancer Plan) is an initiative that will bring together cancer-related resources and databases into a singular platform for health practitioners and researchers to use. The initiative aims to improve access to information and advance cancer and AI related research.

Region

emea

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2022

Location: Europe and North America

PubMedBERT (Biomed-NLP or BiomedBERT)
Microsoft researchers created PubMedBERT, a generative AI model pretrained on biomedical text from PubMed and research from PubMedCentral. This model is used to help answer questions related to biomedical tasks. Training the LLM on medical literature (as opposed to adapting the model) helped improve the quality of the output.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2021

Location: United States

UrbanSim

UrbanSim is an open-source platform that uses generative AI to model urban growth and simulate land use, transportation, and demographic shifts. The platform integrates various datasets, including open-source data on land use, population demographics, and transportation infrastructure, to generate development scenarios that help city planners and researchers make informed decisions about urban growth. UrbanSim aids in visualizing the impacts of policy changes, transportation development, and housing strategies, offering a dynamic tool for sustainable urban planning. The project emphasizes the use of open research data from official sources to simulate realistic and adaptive urban environments.

Region

international

Sector

private_sector

Scenario

adaptation

Start Date

2021

Location: Global

ChemBERTa
ChemBERTa is designed to analyze molecules, similar to how language models read and understand text. Its goal is to help practitioners within drug discovery and materials science domains. The authors utilized a curated dataset of chemical molecules from PubMed, maintained by the National Institute of Health.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2020

Location: United States

Extopia
The EXTOPIA Project, funded by the Luxembourg Ministry of Digitalisation, uses AI to analyze aerial images. EXTOPIA uses machine learning algorithms to detect changes in geographic databases (e.g. new buildings) and document them in the output.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Europe and North America

Sam Petrino
Sam Petrino chatbot is a Spanish language generative AI enabled chatbot on WhatsApp and other web platforms for citizen engagement in San Pedro Garza García (Mexico). It uses government data to answer frequently asked questions and provides a tool to make reports. During the Covid-19 pandemic, it facilitated vaccine registrations as well.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Mexico

BioBERT
Researchers developed BioBERT, a generative AI model adapted to answer queries about the biomedical domain. The model was trained on biomedical literature from PubMed along with other data sources. The model aims to support research and improve access to information in biomedicine.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2019

Location: United States

Boti
Buenos Aires Citys chatbot, Boti, uses generative AI to provide residents and visitors with municipal information and services related to Beunos Aires. Introduced in 2019, it was the first municipal bot on Whatsapp globally. Boti offers an array of services, from reporting civic issues to scheduling appointments and accessing cultural insights using open government data to train the model. It supports multilingual interactions and facilitates mobility by offering information on parking, EcoBici stations and subway statuses.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2019

Location: Argentina

FinBERT
FinBERT is a generative AI model built to analyze financial documents. The model was developed with financial texts from Reuters and the open-source "Financial Phrase Bank" dataset (from open research) which allows the AI to dissect the meaning of different types of financial language.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2019

Location: United States

Gretel AI
Gretel is a synthetic data platform that helps developers generate artificial datasets with the same characteristics as real data, improving AI models while preserving privacy. The platform offers tools for training generative AI models, validating data quality and privacy, and generating synthetic data. Previous clients include the Government of South Australia and the United States Department of Justice.

Region

international

Sector

private

Scenario

data_augmentation

Start Date

2019

Location: United States

Virtual Singapore

Virtual Singapore is a dynamic 3D digital twin model that leverages generative AI to simulate and analyze urban development scenarios. The platform integrates various open data sources, including satellite imagery, sensor data, and social media inputs, to create a real-time representation of the city. Using generative AI, the system generates scenarios for urban planning, infrastructure development, and emergency response planning. Virtual Singapore helps city planners visualize the impact of policy decisions, environmental changes, and demographic trends. The platform is built on open research data and open data from various governmental and institutional sources, supporting data-driven decision-making for sustainable urban growth.

Region

apac

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2019

Location: Singapore

Mostly AI
MOSTLY AI has developed a platform that produces synthetic data for data scientists, analysts, and developers. The system uses AI models to generate artificial datasets, enabling users to create and manage data for purposes including training, test data creation, and analytics. The platform also features a generative AI chatbot that allows users to analyze synthetic data using search queries.

Region

north_america

Sector

private_sector

Scenario

data_augmentation

Start Date

2017

Location: United States

ELMo
ELMo, or the Embeddings from Language Models, is an open source model created by a team of AI researchers at the University of Washington and the Allen Institute for Artificial Intelligence. ELMo supports Natural Language Processing (NLP) systems by converting words into numbers, which are then used to train machine learning models. The original ELMo model was trained on the 1 Billion Word Benchmark, which is a publicly available training dataset of nearly 1 billion words for statistical language models developed by researchers at Google, the University of Edinburgh and Cantab Research Lab.

Region

north_america

Sector

non-profitacademia

Scenario

pre-training

Start Date

2014

Location: United States