Observatory of Examples of How Open Data and Generative AI Intersect

A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.

Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.

Contribute to this repository Back to Open Data Policy Lab

List

Description

Chemma

Chemma is a fine-tuned chemistry LLM designed to assist with organic chemistry synthesis. It was created by reserachers at Shanghai Jiao Tong University. The model is based on LLaMa-2-7b and is trained on publicly available chemical reaction data from the Open Reaction Database and the USPTO-50k dataset, a collection of 40,000 chemical reaction samples with 5,000 samples for model validation and 5,000 samples for testing. Chemists can use Chemma to help with tasks like yield prediction and reaction optimization. According to the research team's report, Chemma outperforms all base-line models. Chemma is free to use and accessible online.

Region

apac

Sector

academia

Scenario

adaptation

Start Date

2025

Location: China

Open in new tab Source

m-KAILIN

m-KAILIN is a framework for constructing high-quality textual training data for biomedical LLMs. The m-KAILIN process uses several fine-tuned LLMs to generate and structure training content based on textual research data from PubMed. This is a method of distilling the training data for improved quality, which in turn can boost the performance of biomedical LLMs (e.g. LLaMA-3-70B).

Region

apac

Sector

academia

Scenario

pre-training

Start Date

2025

Location: China

Open in new tab Source

AgroLLM

AgroLLM is a conversational tool developed by researchers from Pittsburg State University, Oxford Brookes University, and other institutions. It is designed to help farmers make better decisions by providing useful information on farming practices. AgroLLM uses Generative AI to answer farmers' questions, offering advice on topics like crop management, climate impact, and pest control. The tool works by searching through a collection of agricultural resources, including agricultural textbooks, research articles, and other open agricultural datasets. AgroLLM generates responses based on these resources in efforts to assist farmers in improving their practices.

Region

international

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2025

Location: United States, United Kingdom

Open in new tab Source

ALIA-40b

ALIA-40b is a generative AI model developed by the Barcelona Supercomputing Center (BSC). It was trained on official open data sources, including the Norwegian Colossal Corpus, the Estonian National Corpus, the Danish Parliament Corpus, and additional multilingual open datasets. As a generative AI model, it uses large language model (LLM) technology to perform tasks such as content generation, text summarization, conversational interactions, and translation in multiple languages.

Region

emea

Sector

academia

Scenario

pre-training

Start Date

2025

Location: Spain

Open in new tab Source

Case Operations Resource Assistant (CORA)

Washington DC's Child and Family Services Agency (CFSA) is releasing an AI chatbot called CORA (Case Operations Resource Assistant) built in to its new case management system, STAAND II. CORA can answer staff questions about procedures, policies, and system functionality. The model is based on ChatGPT 4.0 Turbo. According to the CFSA AI Values Alignment Report, CORA will be trained on an open dataset consisting of information from the CFSA website and the CFSA policy index. No confidential case files are included in the training set.

Region

north_america

Sector

public_sector

Scenario

adaptation

Start Date

2025

Location: United States

Open in new tab Source

Climate Policy Scenario Generation for Sub-Saharan Africa

Researchers from George Washington University and the Hamoye Foundation created an AI retrieval-augmented generation (RAG) tool to simulate climate policy scenarios in Sub-Saharan Africa. The team trained a LLaMa3.2-3B model on the United Nations Climate Change Conference (COP) documents. Future work may include refining the model to address contextual biases and include expended datasets like national climate reports. For now, it serves as a framework that could eventually be used by policymakers to explore future climate pathways.

Region

emea

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2025

Location: United States

Open in new tab Source

Extract

Extract, a model built with Google's Gemini, helps United Kingdom government councils process historical public planning documents that hold important information, like site boundaries, policy zones, and technical drawings. Historically, these public planning documents can include blurry maps and handwritten annotations, which can take up to 2 hours for a planning professional to convert and digitize manually. Extract can generate the relevant digital data in 40 seconds. This tool is able to accelerate planning processes and make open planning records more usable and accessible.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2025

Location: United Kingdom

Open in new tab Source

Inlook.AI

Inlook.AI is conversational tool designed to help users access and visualize statistical data. It allows users to query statistical datasets using natural language. Inlook.AI uses generative AI to retrieve and generate responses from a wide range of official statistical datasets and supports multilingual user queries. It uses open government data from official sources, including datasets from the Swiss Federal Statistical Office (FSO) and OFS-City Statistics. The tool is intended for both statistical offices and private companies to support the accessibility and analysis of statistical data.

Region

emea

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2025

Location: Switzerland

Open in new tab Source

Instruction Tuning for Low-Resource Languages: A Case Study in Kazakh

This project, developed by researchers from Mohammed Bin Zayed University for Artificial Intelligence (MBZUAI) and Cerebras Systems, focuses on creating a large-scale instruction-following dataset for the Kazakh language. It uses open data from government and cultural sources, including Kazakhstan's e-Government portal (gov.kz) and cultural data from Kazakh Wikidata. The dataset covers key aspects of Kazakhstan’s governmental structure, legal frameworks, and cultural heritage. The project uses generative AI, specifically GPT-4o, to help create instructional data from government and cultural texts. This data is used to help language models better understand and follow instructions in Kazakh. The goal of this project is to improve language models' ability to understand local governance and culture in Kazakhstan.

Region

apac

Sector

private_sectoracademia

Scenario

pre-training

Start Date

2025

Location: United States, United Arab Emirates

Open in new tab Source

LawPal

LawPal is a generative AI chatbot developed by researchers from Vidyavardhini’s College of Engineering and Technology in India. It is designed to make legal information more accessible by answering users' legal questions and providing insights on various topics such as case law, statutory provisions, and legal principles. LawPal uses generative AI to generate contextually relevant responses based on official legal texts. The tool is designed to assist users in understanding legal information by providing responses based on legal data such as government legal databases, Supreme Court judgments, statutes, and academic legal literature.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2025

Location: India

Open in new tab Source

Pubbie

Pubbie is an LLM developed by the National Research Council of Canada (NRC) to support the categorization and access of thousands of open NRC publications. Pubbie is a proof-of-concept pilot for a specific use case: matching NRC publications with a relevant NRC challenge program, a labor-intensive task typically done manually. Pubbie is a fine-tuned RoBERTa model trained on existing sorted publications from the National Research Council and is used to analyze and sort the new NRC publications yearly. Pubbie can also answer user-entered open-ended prompts about NRC's publications.

Region

north_america

Sector

public_sectoracademia

Scenario

adaptationinference_and_insight_generation

Start Date

2025

Location: Canada

Open in new tab Source

SAR Image Synthesis Using Foundation Models for Multi-Scale Adaptation

Researchers at the University of Paris-Saclay developed a method of using AI to augment Synthetic Aperture Radar (SAR) satellite imagery. The team's pipeline generates airborne SAR representations—a more robust form of image data—from 15 years of archival satellite imagery from the Office National d'Études et de Recherches Aérospatiales (ONERA, the French Aereospace Lab). SAR data in the airborne configuration is scarce and often expensive, so this project focuses on leveraging the more accessible lower-resolution image data to synthetically generate the desired format. The existing data is processed and enhanced by ControlNet and StableDiffusion XL, two AI image generation models that the team adapted for this task. This proof-of-concept pipeline is one way to expand the selection of accessible airborne SAR data for remote sensing applications.

Region

emea

Sector

academia

Scenario

data_augmentation

Start Date

2025

Location: France

Open in new tab Source

Virtual Support Agent for e-Albania Portal (ARIADNE)

The National Agency for Information Society (AKSHI) of Albania commissioned a project to create a RAG chatbot agent to assist citizens with the e-Albania portal, a centralized platform for government services in Albania. AKSHI partnered with the software and technology company UBITECH to create this tool. The e-Albania agent is developed from UBITECH's chatbot, ARIADNE, and it provides citizens with context-specific support and up-to-date government information from the content of the e-Albania site. The newest version was released in January 2025.

Region

emea

Sector

public_sectorprivate_sector

Scenario

adaptation

Start Date

2025

Location: Albania

Open in new tab Source

WeatherLab

Google's WeatherLab is based on an AI-driven weather model that aims to improve hurricane prediction. The WeatherLab is an interactive website that generates animations of predicted hurricane tracks. The model is trained on historical forecast information and weather data from sources like the Copernicus Climate Change Service Information and the International Best Track Archive for Climate Stewardship (IBTrACS), both of which provide openly available datasets. In contrast to typical physics-based models, WeatherLab uses a neural network model to predict and simulate tropical cyclone scenarios up to 15 days ahead. Google is also releasing an archive of historical cyclone track data for evaulation and research.

Region

international

Sector

public_sectorprivate_sectoracademia

Scenario

adaptation

Start Date

2025

Location: Global

Open in new tab Source

DataGemma

DataGemma is an initiative by Google and Data Commons which seeks to improve the quality of the AI output using statistical data. The team augments the Gemma model using RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation) using data from its Data Commons initiative and makes the model open access. Through these processes, the team aims to create LLMs for researchers and developers to use.

Region

international

Sector

private_sector

Scenario

pre-training

Start Date

2024

Location: Global

Open in new tab Source

Common Corpus

Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.

Region

international

Sector

private_sectorcivic_tech

Scenario

pre-training

Start Date

2024

Location: Global

Open in new tab Source

Alva

Alva is a generative AI chatbot based on GPT 4o-mini that uses RAG to answer queries about Basel-Stadt. A key feature of the chatbot is its ability to provide attributed responses - citing the respective webpage or information source where the response came from. Currently, the chatbot can draw from publicly available information on the Basel-Stadt website (www.bs.ch.)

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Switzerland

Open in new tab Source

South Korea's AI Hub

The AI Hub is a platform developed by the government of South Korea that aims to accelerate AI innovation using open government data in the private sector. The platform houses South Korea's AI infrastructure and open government datasets for AI development and offers several services such as data quality evaluations. To complement these efforts, the government of Seoul is experimenting with creating synthetic data from open government data. One initiative developed using the AI Hub is the TTCare initiative (an AI driven mobile application for pets) which was trained on data from the AI Hub along with other sources.

Region

apac

Sector

public_sector

Scenario

pre-trainingdata_augmentation

Start Date

2024

Location: South Korea

Open in new tab Source

Bayaan Platform

Bayaan is a conversational tool developed by the Statistics Centre Abu Dhabi that aims to improve access to data from the Statistical Department. The tool uses generative AI to rapidly provide decision makers with data analytics, visualizations, and information that they can use in their decision making processes. The data included focuses on 7 areas and indicators: "Economy, Population, Industry, Social Statistics, Labour Force, Agriculture, and Environment."

Region

emea

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2024

Location: United Arab Emirates

Open in new tab Source

Artificial Intelligence Supported Interfaces (DAA-24) – America's Datahub Consortium (ADC)

The United States Federal Government is developing a generative AI chatbot that allows users to query federal statistical data. The chatbot sources data from government agencies such as the National Center for Science and Engineering Statistics. It uses natural language processing (NLP) to interpret user queries about federal statistical data and provide relevant information from the available data. The project aims to improve public access to statistical data and support evidence-based policymaking and research.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

Asclepius

Asclepius is a large language model for the medical domain trained on synthetic clinical notes generated through public biomedical information. The team chose to experiment with synthetic data given privacy concerns associated with using patient data in LLMs. The authors indicate that the LLM trained on synthetic data can have similar quality outputs to those trained on patient data.

Region

apac

Sector

academia

Scenario

data_augmentation

Start Date

2024

Location: South Korea

Open in new tab Source

Ask ReliefWeb

Ask ReliefWeb is a generative AI-powered tool developed by ReliefWeb, a humanitarian information service managed by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA). The tool is powered by Amazon's Bedrock Generative AI service and the Titan Foundational Model. Ask ReliefWeb allows users to interact with ReliefWeb’s repository of humanitarian reports through a chatbot-like interface. It is designed to assist users in retrieving relevant information from specific reports by generating responses to queries, aiming to enable humanitarian workers to access the data they need in real time. Ask ReliefWeb relies on data sourced exclusively from ReliefWeb’s reports and content.

Region

international

Sector

multilateral_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Global

Open in new tab Source

AuroraGPT

AuroraGPT is an AI model developed by Argonne National Laboratory to support scientific research in areas like biology, cancer studies, and climate science. It was trained on scientific papers and computational data using the Aurora supercomputer. The model aims to help researchers analyze information and generate insights more efficiently. The project is supported by Intel and other partners to develop AI tools for scientific use.

Region

international

Sector

public_sector

Scenario

pre-training

Start Date

2024

Location: United States

Open in new tab Source

Berufsinfomat

Berufsinfomat is a generative AI-driven tool (relying on ChatGPT) introduced by the Austrian Public Employment Service for career coaching. The system, trained on the Austrian Public Employment service's knowledge database on professions, training, and education is intended to offer users with information on professions, training, and education. The Berufsinfomat received 160,000 prompts in January 2024 and around 20,000 additional monthly inquiries. It received criticism for producing responses that conformed to stereotypes about men and women, bias in responses, and for producing various problematic answers. It has received several revisions in response to these problems.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Austria

Open in new tab Source

Bielik 7B v0.1

Bielik 7B v0.1 is a generative AI model developed collaboratively by SpeakLeash and the ACK Cyfronet AGH computing center in Poland. It was trained on publicly available, official open datasets, primarily the SpeakLeash dataset—an open repository of verified Polish texts, including Wikipedia, Polish parliamentary records, Polish literature, and other publicly accessible multilingual repositories such as SlimPajama. The model performs natural language processing (NLP) tasks, including text generation, sentiment analysis, and question answering, specifically in Polish and English.

Region

emea

Sector

academianon-profit

Scenario

pre-training

Start Date

2024

Location: Poland

Open in new tab Source

Catalonia's AI Legal Texts Summarization Tool

Catalonia's Center for Telecommunications and Information Technology (CTTI) developed a generative AI tool to summarize legal texts in simple language that could be understood by citizens. The goal is to increase citizens' understanding of legal processes and allow for more transparency in the Government of Catalonia. The tool was trained on published legal texts, and eventually used to summarize all 14,000 laws published by the Catalonia Publications Office.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Catalonia

Open in new tab Source

CensusGPT

CensusGPT is a natural language interface for United States census data, developed as part of the textSQL project. It allows users to ask questions about census information in plain English, which are then converted to queries formatted for the programming language SQL to retrieve relevant data from a census database. This tool aims to allow more individuals to analyze population statistics, demographics, and other census-related information without needing technical SQL knowledge.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

ChatTCU

In February 2023, Brazil's Federal Court of Accounts launched ChatTCU, which uses OpenAI's ChatGPT and data sourced from the Federal Court of Accounts system. It allows auditors to request a summary of a case document, pose technical questions about the TCU and court decisions, and provide administrative services.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

open-ended_explorationadaptation

Start Date

2024

Location: Brazil

Open in new tab Source

Citymeetings.nyc

Citymeetings.nyc is an independent initiative that uses LLMs to synthesize information from New York City Council meetings. It uses data from Legistar, an online platform where the government posts meetings summaries and agendas.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

ClassifAI

The Data Science Campus of the United Kingdom's Office for National Statistics has developed ClassifAI, an experimental tool that uses large language models to organize text into categories (e.g. industry). It aims to improve upon existing classification methods by offering greater flexibility and potentially higher accuracy for tasks such as categorizing labor market survey responses. The code has been released as open-source. The developers note that further assessment is needed before potential use in official statistics production.

Region

international

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United Kingdom

Open in new tab Source

Clay

Trained on satellite imagery and earth observation data, Clay is a generative AI foundation model designed to understand and analyze Earth's surface. It can generate mathematical representations of any location on Earth at any given time, which can be used for various tasks like creating land cover maps, detecting crop or burn scars, and tracking deforestation. The AI model is open source.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2024

Location: United States

Open in new tab Source

CroissantLLM

Developed by researchers at various European and American universities as well as private technology companies, CroissantLLM is a large language model that aims to support English-French language queries. The model is trained on both web scraped data and open government data from France. This initiative aims to improve LLMs capability to analyze non-English data.

Region

emea

Sector

academiaprivate_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: France

Open in new tab Source

Data Commons

Data Commons is a platform developed by Google that aggregates and standardizes public datasets from various global sources. The initiative uses AI and large language models to provide a natural language interface, enabling users to query complex data without requiring technical expertise. Through partnerships with organizations like the UN, Indian Institute of Technology Madras, and Feeding America, Data Commons offers specialized data portals on topics such as Sustainable Development Goals, India-specific information, and U.S. food security, presenting information through visualizations and analysis tools.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

Data Provenance Explorer

The Data Provenance Explorer is an interactive tool developed as part of MITs Center for Constructive Communication audit of AI training datasets. This tool allows researchers to explore information about datasets, including its origins, licenses, creators, and other metadata. This resource aims to enhance transparency around AI datasets and promote more informed use of datasets in AI research and development.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2024

Location: United States

Open in new tab Source

DataLaw.Bot

Developed by the DS-I Africa (a research program in the United States funded by the National Institutes of Health) and the University of KwaZulu-Natal, DataLaw.Bot is a generative AI chatbot launched in October 2024 for researchers from several countries across the African continent to use in assessing data sharing regulations for scientific research. The chatbot was adapted from ChatGPT with national level data sharing regulations with the goal of increasing access to research data across the continent.

Region

emea

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Botswana, Cameroon, Ghana, Kenya, Malawi, Nigeria, Rwanda, South Africa, Tanzania, The Gambia, Uganda, and Zimbabwe

Open in new tab Source

DC Compass

The DC Compass AI assistant is a generative AI chat interface that provides answers to user queries based on datasets from Open Data DC. The interface can provide a summary of a dataset, supporting visualizations, graphs, and other maps. Currently, this project is a pilot program running a beta test open to the public. The team notes that the quality of the output is impacted by the quality of the data from Open Data DC as well as the breadth of data included.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

DRG-LLaMA

Researchers from the Mayo Clinic and University of Illinois Urbana-Champaign developed DRG-LLaMA, a tool designed for healthcare professionals involved in hospital billing and coding. DRG-LLaMA is an advanced large language model fine-tuned on clinical notes to improve the assignment of Diagnosis-Related Groups (DRGs) in the United States inpatient payment system. The model improves DRG assignment by analyzing patient discharge summaries to predict DRGs and their components, achieving higher accuracy than previous methods used for this task

Region

north_america

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: United States

Open in new tab Source

eLangTech

The European Commission's Directorate-General for Translation developed a collection of AI-powered language tools accessible to EU institutions, public administrations, SMEs, NGOs, academia, Digital Europe Programme projects, and European Personnel Selection Office candidates. These tools, including eTranslation, eBriefing, and eSummary, are trained on existing EU policy, legislative, and governmental documents and official translations. For accuracy and data privacy, the responses of these tools are based only on the information from user-submitted document(s), meaning the model does not directly incorporate material or context from the training data.

Region

emea

Sector

public_sector

Scenario

pre-training

Start Date

2024

Location: European Union

Open in new tab Source

F13 - InnoLab-bw

The Baden-Württemberg Innovation Laboratory within the Government of Germany created F13, a generative AI chatbot to support with government administrative tasks. This chatbot helps personnel summarize documents and provides research support.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Germany

Open in new tab Source

GENAI4LEX-B

GENAI4LEX-B is an AI-powered legislative tool created by researchers at the University of Bologna to support the Italian Chamber of Deputies in legal research and drafting bills. GENAI4LEX-B uses constitutional court decisions, parliamentary documents, and national and European legislation that have been structured in XML format in accordance with Akoma Ntoso – LegalDocML OASIS standards. The tool can identify relevant supporting legal documents and can use legal ontologies (like EuroVoc) to classify documents. In the future, this tool will be piloted by the European Commission's Directorates-General.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Italy

Open in new tab Source

Generative AI Chatbot for Drilling and Production

Generative AI Chatbot for Drilling and Production integrates large language models (LLMs) with the Volve dataset, a publicly available dataset in the oil and gas industry. The Volve dataset was developed by Equinor, a Norwegian energy company. The chatbot is designed to analyze historical drilling and production reports, preform diagnostic analysis, generate structured query language (SQL), and provide recommendations for improving operations. The dataset is used to identify non-productive time, compare well performance, and diagnose root causes for poor-preforming wells. The chatbot uses machine learning for users to ask questions about the dataset, providing insights and analysis based on operational data.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

GeneSilico Copilot

Developed by researchers at the Indraprastha Institute of Information Technology-Delhi, GeneSilico Copilot is a tool used to support oncologists. Drawing from data from Drugbank Open Data, FDA drug labels, RxList, Therapeutic Target Database, Drugs.com, and Wikipedia to offer advice on treatment decisions based on observed facts about a given patient.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: India

Open in new tab Source

GeoLLM-Engine

GeoLLM-Engine, developed by researchers at CoStrategist R&D Group and Microsoft Corporation, is an interface for interacting with geospatial data. The system includes a set of tools for analyzing maps and conducting spatial research. The development team is currently focused on improving the quality of outputs and refining the user interface. GeoLLM-Engine aims to serve professionals in fields that utilize geospatial analysis, such as urban planning and environmental monitoring.

Region

north_america

Sector

private_sector

Scenario

pre-trainingopen-ended_exploration

Start Date

2024

Location: United States

Open in new tab Source

GoldCoin

GoldCoin is a large language model developed for the legal domain by researchers at the Department of Computer Science and Engineering, HKUST, in Hong Kong SAR, China. It specializes in detecting violations of HIPAA privacy rules based on specific queries. The model was trained using legal data from Harvard University's Caselaw Access Project, which offers public access to United States legal decisions. The research team suggests that GoldCoin could potentially be adapted to address other privacy laws in the future.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: China

Open in new tab Source

GovTech DSAID

GovTech's Data Science and Artificial Intelligence Division (DSAID) has developed a system to assist in drafting parliamentary replies* using artificial intelligence. The project uses machine learning techniques to train language models on past parliamentary data, aiming to generate responses that match the style and accuracy of official replies. This tool is designed to help public servants in Singapore more efficiently prepare answers to parliamentary questions, while also exploring the broader potential of customized AI models for government applications. *Parliamentary replies are official answers given by government ministers or representatives to questions asked by members of parliament during legislative sessions.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Singapore

Open in new tab Source

I14Y Interoperability Platform

The I14Y Interoperability Platform is Switzerlands national data catalogue, designed to improve access to data between authorities, businesses, and citizens. It provides a centralized repository for data collections, application interfaces, and government services from different levels of government. The platform offers services such as a searchable catalogue, concept definitions, news updates, and a handbook to support users in navigating and using Switzerland's data infrastructure.

Region

emea

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: Switzerland

Open in new tab Source

IN.gov

The Indiana Office of Technology and Tyler Technologies (a technology firm), launched a beta version of an AI chatbot that aims to support the public in navigating public services. The chatbot is trained on public information from several departments within the State government and housed on the Government of Indiana website. Before opening the chatbot, there is a clause stating that the State will not be liable for any incorrect or misleading information from the chatbot.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

InkubaLM

IInkubaLM-0.4B aims to increase inclusivity and accessibility of AI for underrepresented language communities. InkubaLM-0.4B is a language model developed by Lelapa AI to support AI applications for African languages with limited digital resources. The model was trained using open datasets published on Zenodo, a publicly accessible repository for scientific and research data. Specifically, InkubaLM utilized datasets such as Inkuba-Mono and Inkuba-Instruct, covering languages like Hausa, Yoruba, Swahili, isiZulu, and isiXhosa. The model helps with tasks like text translation, sentiment analysis, and keyword recognition, aiming to make AI more inclusive and accessible for underrepresented language communities.

Region

emea

Sector

private_sector

Scenario

pre-training

Start Date

2024

Location: South Africa

Open in new tab Source

Intern-LM-Law

Researchers from Shanghai AI Laboratory, Nanjing University, Eastern Institute of Technology, Ningbo, and Saarland University, Saarland Informatics Campus developed a chatbot that can address queries and analyze documents within the legal domain in China. The team fine-tuned the LLM using legal data from the Chinese National Legal Database along with other data sources. The team has made both the model and data sources publicly available.

Region

apac

Sector

academia

Scenario

adaptation

Start Date

2024

Location: China

Open in new tab Source

Justice Folder

Spain's Ministry of Justice developed the Justice Folder, a project that uses AI to improve the accessibility of data and information in the justice domain, including over 700,000 open proceedings. Generative AI is used to create plain language summaries of judicial documents, as well as an advanced search function. In the future, the Ministry of Justice is planning to create conversational tools for document classification, nominal entity extraction, document anonymization, named releationship recognition, and documentary extaction.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Spain

Open in new tab Source

KemenkeuGPT

With the support of Indonesia Endowment Fund for Education (LPDP) of the Ministry of Finance of the Republic of Indonesia, researchers at the University of Nottingham developed KemenkeuGPT - a generative AI chatbot that aims to support policy makers within Indonesia's Ministry of Finance. The chatbot uses RAG and combines data from the Ministry of Finance, Statistics Indonesia, and the International Monetary Fund among other sources.

Region

apac

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Indonesia

Open in new tab Source

LLaMandement

Developed by the Government of France, LLaMandement aims to support administrative agents in analyzing and drafting summaries of legal bills developed in the French Parliament for other ministries and departments. The team used data from SIGNALE (a platform used in the French government’s lawmaking including data from several ministries such as the Ministry of Ecological Transition and Territorial Cohesion, Ministry of Culture and others) to fine-tune the pretrained model.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: France

Open in new tab Source

LLM on FHIR - A Project to Demystify Health Records

Researchers at Stanford University developed a mobile application that uses artificial intelligence to help patients better understand their health records and medical information. The mobile application, called LLM on FHIR, translates complex medical data into plain language and can answer patients health questions based on their personal medical history. While the application showed promise in making health information more accessible, the study also revealed challenges such as occasional inconsistent responses, highlighting areas for future improvement.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2024

Location: United States

Open in new tab Source

LLM-Potus Score

Researchers at the University of Georgia and State University of New York at Albany used LLMs to analyze the transcripts of United States presidential debates. The team tested 7 debates from the last 24 years using GPT40 and Claude3. The team aimed to demonstrate how LLMs can be used to help minimize bias in judging.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

LLMoin

LLMoin is a chatbot developed by the City of Hamburg that aims to provide administrative support to government personnel. The tool is based on the Luminous Language Model of AlephAlpha which was developed in Germany. LLMoin is currently a pilot program undergoing testing.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Germany

Open in new tab Source

LuminLab

LuminLab is an online platform that employs generative AI to offer information on improving building energy efficiency. The model is trained using open data from the Energy Performance Certificate dataset provided by the Sustainable Energy Authority of Ireland. The developers are currently working on enhancements, including the integration of geospatial data to generate 3D images of various areas, aiming to expand the platform's capabilities and visual representations.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2024

Location: Ireland

Open in new tab Source

Microsoft AI for Good - Damage Assessment Visualizer for Hurricane Beryl in Grenada

Microsoft and Planet collaborated with humanitarian organizations to analyze the impact of Hurricane Beryl in Grenada. This experimental tool uses the Microsoft AI for Good Damage Assessment Visualizer to analyze satellite images from Planet, estimating damage to buildings and structures on the island of Carriacou. The tool provides visual data to support frontline workers in disaster response and logistics.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Grenada

Open in new tab Source

NextGen NHTS Origin-Destination Data

Researchers at the National Transportation Research Center are using machine learning and AI techniques to analyze truck transportation patterns across the United States. The team is combining truck trip data with population and employment statistics using advanced algorithms to model and predict truck flows between regions. This effort is helping to uncover factors influencing truck transportation, such as the nonlinear relationship between distance and truck trips, providing valuable insights for transportation planning and investment decisions.

Region

north_america

Sector

public

Scenario

data_augmentation

Start Date

2024

Location: United States

Open in new tab Source

OLMo 2

OLMo 2 is an open-source generative AI language model developed by the Allen Institute for AI (AI2). It was trained primarily on data sources such as Wikipedia and Wikibooks via Dolma 1.7, academic papers from arXiv, and additional datasets like OpenWebMath and Algebraic Stack from ProofPile II. These sources that are available to the general public and researchers worldwide include web pages, code repositories, and academic content. OLMo 2 aims to support tasks like instruction-following, conversational AI, and text generation.

Region

international

Sector

academianon-profit

Scenario

pre-training

Start Date

2024

Location: United States

Open in new tab Source

Quantitative Reasoning with Data Benchmark

Researchers at Wangxuan Institute of Computer Technology at Peking University and the Computer Science Department of the University of California (Los Angeles) developed the Quantitative Reasoning with Data (QRData) Benchmark to assess LLM's ability to analyze statistical data. QRData includes data from open texts books, research papers, and other sources and is combined with 411 questions. Of the LLM's tested, GPT-4 performed the best, but the researchers noted the need for improvement.

Region

international

Sector

academia

Scenario

pre-training

Start Date

2024

Location: United States, China

Open in new tab Source

Queried

Queried is a research tool developed by Climate Policy Radar, a not-for-profit organization focused on advancing climate policy through open data and AI tools. Queried uses generative AI to assist users in analyzing climate law and policy documents. Using Large Language Models (LLMs), the tool allows users to query specific documents and receive responses based on the content. The tool is built on data from the Climate Change Laws of the World database, this database includes laws and policies on energy, transport, land use, climate resilience, and low-carbon transitions, and is continuously updated with data from official government websites and parliamentary records. Queried aims to help governments, researchers, and other stakeholders access and analyze climate policy documents, supporting their efforts to understand climate-related laws and policies.

Region

international

Sector

academianon-profit

Scenario

inference_and_insight_generation

Start Date

2024

Location: United Kingdom

Open in new tab Source

RAG for Culturally Inclusive Hakka Chatbots

Researchers in Taiwan experimented with using RAG to improve LLM's ability to answer queries about the Taiwanese Hakka culture. The team combined data from the Ministry of Education's Cultural Knowledge Base and Hakka Dictionary along with other data sources focused on languae and geographic locations. Through this effort, the team aimed to demonstrate the value of integrating a translation function in LLMs to support generative AI technologies that reflect minority cultures.

Region

apac

Sector

private_sectoracademia

Scenario

adaptation

Start Date

2024

Location: Taiwan

Open in new tab Source

SATGPT

SatGPT is a conversational tool developed by the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP) that integrates generative AI with Earth observation data. SatGPT generates readable descriptions from satellite imagery, in efforts to make geospatial data more interpretable. Users can ask questions and receive insights for applications such as flood monitoring, agricultural assessments, and urban planning. SatGPT uses open data sources including ESA WorldCover 2020 for land cover classification, which identifies different types of surface coverage such as forests, croplands, and urban areas; Humanitarian Data Exchange (HDX) for administrative and humanitarian data; and Google Earth Engine for accessing global geospatial datasets like the Global Surface Water Mapping Layers from the European Union's Joint Research Centre (JRC).

Region

apac

Sector

multilateral_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Thailand

Open in new tab Source

SaulLM-7B

Developed by researchers in Portugal and France, SaulLM-7B is a large language model that summarizes legal documents. The model is pretrained on legal texts from the United States, Europe, and Australia.

Region

emea

Sector

academiaprivate_sector

Scenario

pre-training

Start Date

2024

Location: France

Open in new tab Source

Scholastic AI

ScholasticAI is a tool that uses retrieval-augmented generation (RAG) to help users extract and analyze information from documents, such as portable document format (PDFs). It allows users to upload their own files and generate responses based on the content within them, along with querying external knowledge databases. ScholasticAI is powered by the open-source Pleias-Pico language model. The model is trained on publicly available data which includes public domain books and newspapers in multiple languages. ScholasticAI is designed to support multiple languages and improve the accuracy of information by referencing and grounding its responses in the original sources.

Region

international

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

Sidekick

Sidekick is an AI chatbot developed by mySidewalk that answers queries about public issues. The chatbot responses are drawn from several official data sources including data from the U.S. Census Bureau, United States Department of Agriculture, and Bureau of Labor Statistics. Among its goals, it seeks to improve access to data to non-technical audiences.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

StatBot.Swiss

StatBot.Swiss is a benchmark dataset developed by Swiss researchers that can be used to test generative AI models ability to answer queries in English and German. The dataset includes data from the OpenData.Swiss government portal. Moving forward, the team is looking into expanding it to include other languages such as French and Italian.

Region

emea

Sector

public_sectoracademia

Scenario

pre-training

Start Date

2024

Location: Switzerland

Open in new tab Source

Synthetic Australian Healthcare Data Using Synthea

In January 2024, researchers from the Australian e-Health Research Centre (CSIRO) and Macquarie University launched a study on using synthetic data to enhance access to healthcare information. They adapted the Synthea tool, which typically uses US census data, to incorporate Australian demographic and hospital data, creating around 117,000 synthetic health records specific to Queensland. The team used these records to analyze disease patterns, noting that while the synthetic data provides valuable access, further real-world testing is needed to ensure it accurately represents the local context.

Region

apac

Sector

public_sectoracademia

Scenario

data_augmentation

Start Date

2024

Location: Australia

Open in new tab Source

Synthetic Data for Official Statistics

This guide assists National Statistical Offices (NSOs) in managing data access using synthetic data while maintaining confidentiality. It is suitable for statisticians and data managers in government agencies interested in implementing synthetic data. The guide covers the creation of synthetic data, addresses privacy risks, and provides practical tips for application, including a case where the Office for National Statistics Data Science Campus in the United States created a synthetic dataset using the U.S. Census Bureau’s income data to test the 2021 Census model.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: United States

Open in new tab Source

The Virtual Intelligent Chat Assistant's Department of Statistics Proof of Concept

The Virtual Intelligent Chat Assistant (VICA) is an online platfrom by Singapore's Government Technology Agency (GovTech) that public servants from across the government can use to create their own generative AI chatbots. In a blog published in Towards Data Science, representatives from GovTech discuss a proof of concept they developed using VICA for the Department of Statistics' Data. The team created a chatbot that could respond to queries about national statistics (such as GDP) in a table format.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Singapore

Open in new tab Source

TitiBot

Developed by BORDE (a Mexican non-profit), TitiBot is a spanish language Whatsapp chatbot that helps improve access to voting records on legislative reforms. It uses data from Mexicos Congress of the Union (e.g. parlimentary voting records) from between 2018 and 2024 and can provide summaries of the data.

Region

latin_america_and_the_caribbean

Sector

non-profit

Scenario

inference_and_insight_generation

Start Date

2024

Location: Mexico

Open in new tab Source

USAFacts

USAFacts processes and standardizes open government data from federal, state, and local sources. It uses generative AI specifically to create written content, such as summaries and explanations, based on official government data.

Region

north_america

Sector

non-profit

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Open in new tab Source

Dolma

Dolma is an open dataset created for the Allen Institute of AI made up of academic research along with other data sources such as books, website content, and code. The dataset currently hosts 3 trillion tokens and is accompanied by a toolkit on how to source datasets for training purposes.

Region

international

Sector

civic_technon-profit

Scenario

pre-training

Start Date

2023

Location: Global

Open in new tab Source

AI4Culture

AI4Culture, a public platform developed by the Digital Europe Programme of the European Union, offers a collection of deployable generative AI tools and open datasets for training AI. Components on the platform are interoperable with the Common European Data Space for cultural heritage. The platform's tools and data can be used for AI-generated translations of cultural heritage metadata, multilingual subtitle generation, and multilingual text recognition in scanned documents. Some of the open datasets include verified translations, transcriptions of scanned handwritten documents, European artwork classification data, and 950,000 hours of speech data.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2023

Location: European Union

Open in new tab Source

ChatDoctor

ChatDoctor is a generative AI chatbot that can answer queries in the medical domain. The model was trained on patient conversations from an online medical platform. It also uses data from Medline Plus (a government health information website for medical practitioners) in addition to other data sources.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

Open in new tab Source

City of Helsinki's AI Register

The City of Helsinki has adapted general purpose LLMs to improve its civic services, including urban planning and public facilities. These generative AI tools are fine-tuned using open city data, such as zoning regulations and planning documents, to facilitate civic engagement. These tools aim to enable more efficient communication with residents while enhancing the accessibility of complex information.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2023

Location: Finland

Open in new tab Source

Climate Q&A

ClimateQ&A is a generative AI chatbot developed from the ChatGPT API to provide responses to queries about climate change. The chatbot was created by Ekimetrics -- a data and AI firm based in France -- and uses data from reports from the Intergovernmental Panel on Climate Change (IPCC) and the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). While its primary objective is to make climate change scientific information more accessible, it also helps to understand the types of questions people have about climate change. The team uses NLP to analyze these questions and identify where there are knowledge gaps.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: France, Global

Open in new tab Source

covLLM

Researchers at Stanford University developed covLLM, a generative AI tool to support doctors in understanding the most up-to-date COVID-19 research. The model was trained on the COVID-19 Open Research Dataset (CORD-19) and can provide summaries of research based on specific queries. Its objective is to address healthcare professionals need to stay updated on fast evolving topics.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

Open in new tab Source

Democratic Fine-Tuning with a Moral Graph

Democratic Fine-Tuning with a Moral Graph (DFTmg) is a new method for aligning AI language models with human values through large-scale public discussions. The project used a survey of 500 Americans political views that they anonymized and made public as open research data on github. This process aims to develop AI models that make better decisions by incorporating public input into the training process. This work was supported by OpenAI.

Region

north_america

Sector

non-profitprivate_sector

Scenario

open-ended_exploration

Start Date

2023

Location: Global

Open in new tab Source

Dolma Dataset

Dolma is a 3-trillion-token open dataset created for the Allen Institute of AI made up of academic research along with other data sources such as web pages, academic publications, code, books, and encyclopedic materials. The dataset is accompanied by a toolkit on how to source datasets for training purposes and aims to support transparency, risk mitigation, and reproducibility for responsible AI development.

Region

international

Sector

academia

Scenario

pre-training

Start Date

2023

Location: United States

Open in new tab Source

ESGReveal

ESGReveal uses Retrieval Augmented Generation to adapt Environmental, Social, and Governance (ESG) data from corporate reports to help users find information from these reports when searching a database or the internet. The generative AI model was trained on ESG reports from 166 companies on the Hong Kong Stock Exchange.

Region

apac

Sector

private_sectoracademia

Scenario

adaptation

Start Date

2023

Location: Hong Kong

Open in new tab Source

Farmer.chat

Representatives from Digital Green India (a NGO) and Microsoft Research (India) developed a generative AI chatbot for agricultural services. The chatbot provides farmers with text, audio, and video responses to queries about agriculture. The chatbot uses RAG and draws on research papers and other data sources. It has been implemented in Kenya, India, Ethiopia, and Nigeria thus far.

Region

international

Sector

private_sector

Scenario

adaptation

Start Date

2023

Location: Kenya, India, Ethiopia, and Nigeria

Open in new tab Source

Generating a Fully Synthetic Human Services Dataset

This report, produced by researchers at the Urban Institute in collaboration with Allegheny County partners, describes the process of creating a synthetic version of the countys 2021 human services dataset. The synthetic data aims to replicate statistical properties of the confidential data while protecting individual privacy, enabling wider access to detailed human services information. The document covers the data synthesis methodology, evaluation of data quality and privacy risks, and the challenges of balancing utility and confidentiality in synthetic administrative data.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: United States

Open in new tab Source

GenSpectrum Chat

GenSpectrum is a generative AI chatbot for COVID-19 genomic sequencing data from the GISAID Data Science Initiative (an initiative focused on generating access to data related to pathogens through partnerships). The chatbot was developed by researchers at the Department of Biosystems Science and Engineering, ETH Zürich and the Swiss Institute of Bioinformatics. The team aims to support research in the medical domain. The chatbot is not yet available online.

Region

international

Sector

academianon-profit

Scenario

inference_and_insight_generation

Start Date

2023

Location: Switzerland

Open in new tab Source

GPT-SW3

GPT-SW3 is an open-source generative AI model collaboratively developed by AI Sweden, RISE, and Wallenberg AI, Autonomous Systems, and Software Programs (WASP WARA). It was trained on datasets, including Wikipedia, Wikimedia, and the Norwegian Colossal Corpus—an open dataset comprising texts from government publications, parliamentary records, newspapers, literature, and public reports. GPT-SW3 is designed to perform natural language processing tasks specifically for Nordic languages such as Swedish, Norwegian, Danish, and Icelandic, including content generation, translation, and digital assistant functions.

Region

emea

Sector

public_sectoracademianon-profit

Scenario

pre-training

Start Date

2023

Location: Sweden

Open in new tab Source

Jugalbandi AI for Multilingual Access to Government Services

Jugalbandi is a generative AI-powered language translation tool that improves access to government programs and rights information across India. It leverages open government data related to various welfare schemes and services, using generative AI models to provide accurate translations in multiple local languages. The AI facilitates communication between citizens and the government, helping individuals understand and access services regardless of language barriers. This initiative democratizes access to official data and government resources, promoting inclusion in public services.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: India

Open in new tab Source

Llema

Llema is a generative AI model fine-tuned for the mathematics domain. It was fine-tuned using the Proof-Pile-2 dataset, which combines scientific papers with other mathematics datasets. The researchers have provided public access to the models, dataset, code to encourage future research around the topic of AI and mathematics.

Region

north_america

Sector

academianon-profit

Scenario

adaptation

Start Date

2023

Location: United States

Open in new tab Source

Med-PaLM2

Med-PaLM2 is a generative AI chatbot by Google Research which seeks to provide long-form written answers to medical questions. Med-PaLM2 is fine-tuned using “publicly available question-answering data and physician writing responses” including MedQA and MedMCQA among other datasets. Med-PaLM2 achieved 86.5% accuracy on United States Medical Licensing Examination questions.

Region

north_america

Sector

private

Scenario

adaptation

Start Date

2023

Location: United States

Open in new tab Source

MILDSum

Developed by researchers and legal practitioners from the Indian Institute of Technology Kharagpur, MILDSum is a research initiative that aims to bring together open data from the legal domain (i.e. case judgements) to create Hindi summaries of case judgements that can be used for training purposes.

Region

apac

Sector

academia

Scenario

pre-training

Start Date

2023

Location: India

Open in new tab Source

NEPAccess

NEPAccess, developed by the University of Arizona, employs AI and data science to improve the National Environmental Protection Act (NEPA) environmental review process. The project uses generative AI to compile insights from previous projects and assist in drafting environmental impact assessments (EIAs) on specific topics. By integrating open data from federal agencies, NEPAccess provides public access to a centralized database of environmental reviews. The project was funded by the National Science Foundation (NSF) from 2021-2024 and is now seeking new funding to build new features into its platform.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2023

Location: United States

Open in new tab Source

OpenAssistant Conversations

Researchers have released a free, public collection of conversations called OpenAssistant Conversations to help improve AI language models. This dataset, created by over 13,500 volunteers worldwide, includes conversations in 35 languages along with quality ratings. By making this resource freely available, the researchers aim to democratize the development of more user-friendly and capable AI assistants across various fields.

Region

emea

Sector

academia

Scenario

pre-trainingadaptation

Start Date

2023

Location: Germany

Open in new tab Source

Parla

Parla is an AI interface in development at CityLab Berlin. It aims to enhance access to public administration data across the city for both government officials and the general public. Functioning as both a retrieval system and an analytical tool, Parla accesses over 10,000 public documents from city departments, systems, and formats to answer specific queries. However, due to challenges like poorly structured data and insufficient metadata, Parla sometimes generates inaccurate outputs. To address this, Parla ensures its responses include source references, improving transparency and accountability.

Region

emea

Sector

civic_tech

Scenario

open-ended_exploration

Start Date

2023

Location: Germany

Open in new tab Source

Phi-2

Phi-2 is an open-source small language model with 2.7 billion parameters that demonstrates outstanding reasoning and language understanding capabilities. Due to its small size, researchers use it to study AI model interactions, enhance safety features, and customize it for specific applications. The training data contains a mix of curated web data and synthetic data made to focus on common sense reasoning and general knowledge.

Region

north_america

Sector

private

Scenario

data_augmentation

Start Date

2023

Location: United States

Open in new tab Source

SEA-LION

SEA-LION is a family of open-source large language models developed by AI Singapore as part of the National Multi-Modal Large Language Model project. Trained on multilingual datasets from Southeast Asia, SEA-LION supports low-resource languages like Thai, Vietnamese, and Bahasa Indonesia. The models aim to improve cultural representation in AI and enhance accessibility for multilingual natural language processing (NLP) tasks, including translation, summarization, and question answering.

Region

apac

Sector

academia

Scenario

pre-training

Start Date

2023

Location: Singapore

Open in new tab Source

SELENA+

SELENA+, developed by Synapxe (a department within the Government of Singapore focused on healthtech), the National University of Singapore and the Singapore National Eye Center, uses generative AI to detect diabetes-related eye conditions, specifically, diabetic eye disease, glaucoma, and age-related macular degeneration. The tool analyzes imagery from the National Eye Center. The team plans to expand this tool to cardiovascular diseases in the future.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Singapore

Open in new tab Source

StatGPT

To help improve the accessibility and usability of their open data platform, the International Monetary Fund (IMF) is prototyping a new generative AI tool that they are calling StatGPT. StatGPT will act as a user interface that processes natural language requests to find relevant datasets from the IMF’s repository. StatGPT will help users find indicators, visualize data in tables and charts, and generate Python code for analysis. The team is currently developing interface features and will then seek to integrate it in Excel.

Region

international

Sector

multilateral_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Europe and North America

Open in new tab Source

Statistics Canada

Statistics Canada conducted a pilot program around generating synthetic data for training purposes. The team created synthetic datasets from census data that includes sensitive information. These datasets were used in two Hackathons, with the condition that they could not be publicly shared. Organizers highlighted that the synthetic datasets preserved the usefulness of the original data for analysis while minimizing the risk of revealing sensitive information. Hackathon participants successfully used these datasets for training purposes.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: Canada

Open in new tab Source

Talk to the City

Talk to the City is an open-source tool that uses advanced AI to analyze and summarize qualitative data, particularly human opinions. It aims to improve collective decision-making and enhance public discourse around policy making by clustering similar arguments and creating summaries and visualizations. Talk to the City has been used in citizens assemblies in Taiwan as of 2023.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2023

Location: United States

Open in new tab Source

TaxGPT

TaxGPT is an independently developed generative AI chatbot that answers tax related queries based on information from the Canada Revenue Agency website. Its goal is to make tax information at the population level more understandable. It was updated in 2024 and is currently operational.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2023

Location: Canada

Open in new tab Source

Tendios

Tendios is a Software-as-a-Service platform that uses AI to support governments and tenders in the public procurement process. Tendios offers a chatbot where users from contracting authorities (e.g. governments) can inquire about various bids, and the platform provides a tool to automatically generate administrative and technical documents based on existing bids. The open data for this platform is provided by Spain's Public Sector Procurement Platform (PLACSP) and other public produrement data sources from throughout e. Tendios provides other services including AI-powered open exploration of the available data and prediction dashboards of bidders and prices. Overall, Tendios is designed to streamline the typically lengthy and complicated bidding process.

Region

emea

Sector

private_sector

Scenario

open-ended_explorationinference_and_insight_generation

Start Date

2023

Location: Spain

Open in new tab Source

The Harmonized Landsat and Sentinel-2 (HLS) Project

The Harmonized Landsat and Sentinel-2 (HLS) project by NASA aims to create a record of Earths surface using images from multiple satellites. The HLS dataset combines data from four NASA satellites as well as US Geological Survey (USGS) sensors around the globe. The dataset was used to train NASA and IBM’s watsonx.ai geospatial foundation model, which can be used to develop AI systems that provide maps and analytics about natural disasters and environmental changes. The latest dataset includes information from across the globe (except Antarctica). This work was a collaboration between NASA, the US Geological Survey (USGS), and several NASA research centers.

Region

north_america

Sector

public_sector

Scenario

pre-training

Start Date

2023

Location: United States

Open in new tab Source

UrbanistAI

UrbanistAI is a generative AI platform that leverages AI for urban planning, specifically to allow for citizen input and participation into the planning process. The UrbanistAI platform can be adopted by a municipality, trained on open local policy requirements, and then the tool can implement renderings that align with those policies. Citizens can experiment with text-to-image prompts that the tool will incorporate into visual renderings based on real photos of an urban area.

Region

emea

Sector

public_sector

Scenario

pre-training

Start Date

2023

Location: Finland

Open in new tab Source

Wobby

Wobby is a generative AI-powered interface that can answer queries related to a specific open datasets and produce summaries of those datasets and visualizations as responses. The platform is focused primarily on democratizing access to open government data, and currently hosts datasets from organizations like Statbel (Belgium’s national statistical office), Statistics Netherlands and Eurostat, as well as data from intergovernmental organizations like the World Bank. Wobby's last update allows for automatic data updates and real-time analysis based on current information.

Region

emea

Sector

private

Scenario

inference_and_insight_generation

Start Date

2023

Location: Belgium

Open in new tab Source

AgricultureBERT

AgricultureBERT is a generative AI model for the agriculture domain that was developed with data from the United States National Agricultural Library. This model is used to answer questions related to agricultural knowledge such as crop growing best practices or fertilization techniques in different climates. The intention is to improve access to agricultural information and advance research in the field.

Region

north_america

Sector

civic_tech

Scenario

adaptation

Start Date

2022

Location: United States

Open in new tab Source

BioGPT

BioGPT is a generative AI model that can answer queries about biomedicine. BioGPT was trained using biomedical literature from PubMed. This tool was developed by representatives of Microsoft Research and Peking University.

Region

north_america

Sector

private_sectoracademia

Scenario

inference_and_insight_generation

Start Date

2022

Location: United States

Open in new tab Source

BLOOM, a BigScience Initiative

BLOOM is an open-access, multilingual large language model (LLM) trained using a mix of publicly available datasets, including community-selected data and filtered web-crawled data. Its training corpus, known as ROOTS, includes open data from sources like Project Gutenberg, OpenSubtitles, and HAL (open-access scientific publications), as well as government data and open research repositories such as the Catalan Government Crawling and the United Nations Parallel Corpus. BLOOM is designed to generate human-like text in 46 languages and 13 programming languages, and it is available for use and further development by researchers and institutions worldwide.

Region

international

Sector

academia

Scenario

pre-training

Start Date

2022

Location: Global

Open in new tab Source

CROZ RenEUwable

This project is an AI-driven application that provides users with sustainability recommendations as it relates to their energy consumption based on a specific set of queries. It was trained on open climate and energy datasets. This project won the EU Datathon 2022 in the European Green Deal Category and is currently in development.

Region

emea

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2022

Location: Europe and North America

Open in new tab Source

European Cancer Imaging Institute

The European Cancer Imaging Initiative (part of the Europes Beating Cancer Plan) is an initiative that will bring together cancer-related resources and databases into a singular platform for health practitioners and researchers to use. The initiative aims to improve access to information and advance cancer and AI related research.

Region

emea

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2022

Location: Europe and North America

Open in new tab Source

PubMedBERT (Biomed-NLP or BiomedBERT)

Microsoft researchers created PubMedBERT, a generative AI model pretrained on biomedical text from PubMed and research from PubMedCentral. This model is used to help answer questions related to biomedical tasks. Training the LLM on medical literature (as opposed to adapting the model) helped improve the quality of the output.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2021

Location: United States

Open in new tab Source

UrbanSim

UrbanSim is an open-source platform that uses generative AI to model urban growth and simulate land use, transportation, and demographic shifts. The platform integrates various datasets, including open-source data on land use, population demographics, and transportation infrastructure, to generate development scenarios that help city planners and researchers make informed decisions about urban growth. UrbanSim aids in visualizing the impacts of policy changes, transportation development, and housing strategies, offering a dynamic tool for sustainable urban planning. The project emphasizes the use of open research data from official sources to simulate realistic and adaptive urban environments.

Region

international

Sector

private_sector

Scenario

adaptation

Start Date

2021

Location: Global

Open in new tab Source

ChemBERTa

ChemBERTa is designed to analyze molecules, similar to how language models read and understand text. Its goal is to help practitioners within drug discovery and materials science domains. The authors utilized a curated dataset of chemical molecules from PubMed, maintained by the National Institute of Health.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2020

Location: United States

Open in new tab Source

Climate TRACE

Climate TRACE is an initiative that uses AI with satellite and remote sensing data to create an inventory of global emissions. Climate TRACE has developed AI models that analyze Earth imagery data to identify sources of emissions. For example, the model can identify a power plant and detect plumes of smoke to determine the plant's greenhouse gas emissions. The AI algorithms are trained on a wide range of data sources, many of which are open, such as image data from the European Space Agency's Sentinel-2 satellites. Additionally, the models are trained on "ground truth" emissions data from on-the-ground sensors that can be used in tandem with the remote sensing data. Climate TRACE has created a worldwide emissions map with tools to view changes over time, different sectors, and different greenhouse gases.

Region

international

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Global

Open in new tab Source

Extopia

The EXTOPIA Project, funded by the Luxembourg Ministry of Digitalisation, uses AI to analyze aerial images. EXTOPIA uses machine learning algorithms to detect changes in geographic databases (e.g. new buildings) and document them in the output.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Europe and North America

Open in new tab Source

KB-BERT

KB-BERT is a natural language processing model developed by the National Library of Sweden (KB). The library's long-term digitization projects result in a large amount of national textual data on which the model can be trained. This material includes digitized archival newspapers, official reports of the Swedish government, legal e-deposits (e-books and e-magazines), and Swedish Wikipedia. The National Library describes this model as a tool that can eventually support automated classification of materials, improved searchability, and better access for researchers. Additionally, RISE, Sweden's research institute, has discussed an initiative to use this textual data for a language model for the public sector. This model could assist with daily tasks for Swedish authorities, like document and email management, and power an AI question and answer platform for citizens.

Region

emea

Sector

public_sectoracademia

Scenario

pre-training

Start Date

2020

Location: Sweden

Open in new tab Source

Sam Petrino

Sam Petrino chatbot is a Spanish language generative AI enabled chatbot on WhatsApp and other web platforms for citizen engagement in San Pedro Garza García (Mexico). It uses government data to answer frequently asked questions and provides a tool to make reports. During the Covid-19 pandemic, it facilitated vaccine registrations as well.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Mexico

Open in new tab Source

BioBERT

Researchers developed BioBERT, a generative AI model adapted to answer queries about the biomedical domain. The model was trained on biomedical literature from PubMed along with other data sources. The model aims to support research and improve access to information in biomedicine.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2019

Location: United States

Open in new tab Source

Boti

Buenos Aires Citys chatbot, Boti, uses generative AI to provide residents and visitors with municipal information and services related to Beunos Aires. Introduced in 2019, it was the first municipal bot on Whatsapp globally. Boti offers an array of services, from reporting civic issues to scheduling appointments and accessing cultural insights using open government data to train the model. It supports multilingual interactions and facilitates mobility by offering information on parking, EcoBici stations and subway statuses.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2019

Location: Argentina

Open in new tab Source

Data Foundry Scotland

The Data Foundry Scotland is an open-data delivery platform from the National Library of Scotland that makes its digital collections available in machine learning-ready formats. The data includes sources like digitized archival books, newspapers, and historical military lists. The platform provides meta-data and quality assurance. Special attention is given to cultural heritage data. The Data Foundry is used for various projects, including an upcoming text and data mining platform.

Region

emea

Sector

public_sectoracademia

Scenario

pre-training

Start Date

2019

Location: Scotland

Open in new tab Source

FinBERT

FinBERT is a generative AI model built to analyze financial documents. The model was developed with financial texts from Reuters and the open-source "Financial Phrase Bank" dataset (from open research) which allows the AI to dissect the meaning of different types of financial language.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2019

Location: United States

Open in new tab Source

Gretel AI

Gretel is a synthetic data platform that helps developers generate artificial datasets with the same characteristics as real data, improving AI models while preserving privacy. The platform offers tools for training generative AI models, validating data quality and privacy, and generating synthetic data. Previous clients include the Government of South Australia and the United States Department of Justice.

Region

international

Sector

private

Scenario

data_augmentation

Start Date

2019

Location: United States

Open in new tab Source

Virtual Singapore

Virtual Singapore is a dynamic 3D digital twin model that leverages generative AI to simulate and analyze urban development scenarios. The platform integrates various open data sources, including satellite imagery, sensor data, and social media inputs, to create a real-time representation of the city. Using generative AI, the system generates scenarios for urban planning, infrastructure development, and emergency response planning. Virtual Singapore helps city planners visualize the impact of policy decisions, environmental changes, and demographic trends. The platform is built on open research data and open data from various governmental and institutional sources, supporting data-driven decision-making for sustainable urban growth.

Region

apac

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2019

Location: Singapore

Open in new tab Source

Mostly AI

MOSTLY AI has developed a platform that produces synthetic data for data scientists, analysts, and developers. The system uses AI models to generate artificial datasets, enabling users to create and manage data for purposes including training, test data creation, and analytics. The platform also features a generative AI chatbot that allows users to analyze synthetic data using search queries.

Region

north_america

Sector

private_sector

Scenario

data_augmentation

Start Date

2017

Location: United States

Open in new tab Source

ELMo

ELMo, or the Embeddings from Language Models, is an open source model created by a team of AI researchers at the University of Washington and the Allen Institute for Artificial Intelligence. ELMo supports Natural Language Processing (NLP) systems by converting words into numbers, which are then used to train machine learning models. The original ELMo model was trained on the 1 Billion Word Benchmark, which is a publicly available training dataset of nearly 1 billion words for statistical language models developed by researchers at Google, the University of Edinburgh and Cantab Research Lab.

Region

north_america

Sector

non-profitacademia

Scenario

pre-training

Start Date

2014

Location: United States

Open in new tab Source