Observatory of Examples of How Open Data and Generative AI Intersect

A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.

Share your project for inclusion. We seek to learn from generative AI initiatives that use open government and research data across a Spectrum of Scenarios. More information on each scenario can be found in our report: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.

expand_more
expand_more
expand_more
expand_more
Name
Region
Sector
Scenario
Start Date
Asclepius
Asclepius is a large language model for the medical domain trained on synthetic clinical notes generated through public biomedical information. The team chose to experiment with synthetic data given privacy concerns associated with using patient data in LLMs. The authors indicate that the LLM trained on synthetic data can have similar quality outputs to those trained on patient data.

Region

apac

Sector

academia

Scenario

data_augmentation

Start Date

2024

Location: South Korea

CensusGPT
CensusGPT is a natural language interface for United States census data, developed as part of the textSQL project. It allows users to ask questions about census information in plain English, which are then converted to queries formatted for the programming language SQL to retrieve relevant data from a census database. This tool aims to allow more individuals to analyze population statistics, demographics, and other census-related information without needing technical SQL knowledge.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Citymeetings.nyc

Citymeetings.nyc is an independent initiative that uses LLMs to synthesize information from New York City Council meetings. It uses data from Legistar, an online platform where the government posts meetings summaries and agendas.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

ClassifAI

The Data Science Campus of the United Kingdom's Office for National Statistics has developed ClassifAI, an experimental tool that uses large language models to organize text into categories (e.g. industry). It aims to improve upon existing classification methods by offering greater flexibility and potentially higher accuracy for tasks such as categorizing labor market survey responses. The code has been released as open-source. The developers note that further assessment is needed before potential use in official statistics production.

Region

international

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United Kingdom

Clay

Trained on satellite imagery and earth observation data, Clay is a generative AI foundation model designed to understand and analyze Earth's surface. It can generate mathematical representations of any location on Earth at any given time, which can be used for various tasks like creating land cover maps, detecting crop or burn scars, and tracking deforestation. The AI model is open source.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2024

Location: United States

CroissantLLM
Developed by researchers at various European and American universities as well as private technology companies, CroissantLLM is a large language model that aims to support English-French language queries. The model is trained on both web scraped data and open government data from France. This initiative aims to improve LLMs capability to analyze non-English data.

Region

emea

Sector

academiaprivate_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: France

Data Commons
Data Commons is a platform developed by Google that aggregates and standardizes public datasets from various global sources. The initiative uses AI and large language models to provide a natural language interface, enabling users to query complex data without requiring technical expertise. Through partnerships with organizations like the UN, Indian Institute of Technology Madras, and Feeding America, Data Commons offers specialized data portals on topics such as Sustainable Development Goals, India-specific information, and U.S. food security, presenting information through visualizations and analysis tools.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

Data Provenance Explorer
The Data Provenance Explorer is an interactive tool developed as part of MITs Center for Constructive Communication audit of AI training datasets. This tool allows researchers to explore information about datasets, including its origins, licenses, creators, and other metadata. This resource aims to enhance transparency around AI datasets and promote more informed use of datasets in AI research and development.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2024

Location: United States

DC Compass

The DC Compass AI assistant is a generative AI chat interface that provides answers to user queries based on datasets from Open Data DC. The interface can provide a summary of a dataset, supporting visualizations, graphs, and other maps. Currently, this project is a pilot program running a beta test open to the public. The team notes that the quality of the output is impacted by the quality of the data from Open Data DC as well as the breadth of data included.

Region

north_america

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

DRG-LLaMA
Researchers from the Mayo Clinic and University of Illinois Urbana-Champaign developed DRG-LLaMA, a tool designed for healthcare professionals involved in hospital billing and coding. DRG-LLaMA is an advanced large language model fine-tuned on clinical notes to improve the assignment of Diagnosis-Related Groups (DRGs) in the United States inpatient payment system. The model improves DRG assignment by analyzing patient discharge summaries to predict DRGs and their components, achieving higher accuracy than previous methods used for this task

Region

north_america

Sector

public_sectoracademia

Scenario

adaptation

Start Date

2024

Location: United States

F13 - InnoLab-bw
The Baden-Württemberg Innovation Laboratory within the Government of Germany created F13, a generative AI chatbot to support with government administrative tasks. This chatbot helps personnel summarize documents and provides research support.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Germany

GeoLLM-Engine

GeoLLM-Engine, developed by researchers at CoStrategist R&D Group and Microsoft Corporation, is an interface for interacting with geospatial data. The system includes a set of tools for analyzing maps and conducting spatial research. The development team is currently focused on improving the quality of outputs and refining the user interface. GeoLLM-Engine aims to serve professionals in fields that utilize geospatial analysis, such as urban planning and environmental monitoring.

Region

north_america

Sector

private_sector

Scenario

pre-trainingopen-ended_exploration

Start Date

2024

Location: United States

GoldCoin

GoldCoin is a large language model developed for the legal domain by researchers at the Department of Computer Science and Engineering, HKUST, in Hong Kong SAR, China. It specializes in detecting violations of HIPAA privacy rules based on specific queries. The model was trained using legal data from Harvard University's Caselaw Access Project, which offers public access to United States legal decisions. The research team suggests that GoldCoin could potentially be adapted to address other privacy laws in the future.

Region

apac

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2024

Location: China

GovTech DSAID

GovTech's Data Science and Artificial Intelligence Division (DSAID) has developed a system to assist in drafting parliamentary replies* using artificial intelligence. The project uses machine learning techniques to train language models on past parliamentary data, aiming to generate responses that match the style and accuracy of official replies. This tool is designed to help public servants in Singapore more efficiently prepare answers to parliamentary questions, while also exploring the broader potential of customized AI models for government applications. *Parliamentary replies are official answers given by government ministers or representatives to questions asked by members of parliament during legislative sessions.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Singapore

I14Y Interoperability Platform

The I14Y Interoperability Platform is Switzerlands national data catalogue, designed to improve access to data between authorities, businesses, and citizens. It provides a centralized repository for data collections, application interfaces, and government services from different levels of government. The platform offers services such as a searchable catalogue, concept definitions, news updates, and a handbook to support users in navigating and using Switzerland's data infrastructure.

Region

emea

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: Switzerland

Intern-LM-Law
Researchers from Shanghai AI Laboratory, Nanjing University, Eastern Institute of Technology, Ningbo, and Saarland University, Saarland Informatics Campus developed a chatbot that can address queries and analyze documents within the legal domain in China. The team fine-tuned the LLM using legal data from the Chinese National Legal Database along with other data sources. The team has made both the model and data sources publicly available.

Region

apac

Sector

academia

Scenario

adaptation

Start Date

2024

Location: China

LLaMandement
Developed by the Government of France, LLaMandement aims to support administrative agents in analyzing and drafting summaries of legal bills developed in the French Parliament for other ministries and departments. The team used data from SIGNALE (a platform used in the French government’s lawmaking including data from several ministries such as the Ministry of Ecological Transition and Territorial Cohesion, Ministry of Culture and others) to fine-tune the pretrained model.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: France

LLM on FHIR - A Project to Demystify Health Records
Researchers at Stanford University developed a mobile application that uses artificial intelligence to help patients better understand their health records and medical information. The mobile application, called LLM on FHIR, translates complex medical data into plain language and can answer patients health questions based on their personal medical history. While the application showed promise in making health information more accessible, the study also revealed challenges such as occasional inconsistent responses, highlighting areas for future improvement.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2024

Location: United States

LLMoin
LLMoin is a chatbot developed by the City of Hamburg that aims to provide administrative support to government personnel. The tool is based on the Luminous Language Model of AlephAlpha which was developed in Germany. LLMoin is currently a pilot program undergoing testing.

Region

emea

Sector

public_sector

Scenario

adaptation

Start Date

2024

Location: Germany

LuminLab

LuminLab is an online platform that employs generative AI to offer information on improving building energy efficiency. The model is trained using open data from the Energy Performance Certificate dataset provided by the Sustainable Energy Authority of Ireland. The developers are currently working on enhancements, including the integration of geospatial data to generate 3D images of various areas, aiming to expand the platform's capabilities and visual representations.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2024

Location: Ireland

Microsoft AI for Good - Damage Assessment Visualizer for Hurricane Beryl in Grenada

Microsoft and Planet collaborated with humanitarian organizations to analyze the impact of Hurricane Beryl in Granada. This experimental tool uses the Microsoft AI for Good Damage Assessment Visualizer to analyze satellite images from Planet, estimating damage to buildings and structures on the island of Carriacou. The tool provides visual data to support frontline workers in disaster response and logistics.

Region

international

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: Grenada

NextGen NHTS Origin-Destination Data
Researchers at the National Transportation Research Center are using machine learning and AI techniques to analyze truck transportation patterns across the United States. The team is combining truck trip data with population and employment statistics using advanced algorithms to model and predict truck flows between regions. This effort is helping to uncover factors influencing truck transportation, such as the nonlinear relationship between distance and truck trips, providing valuable insights for transportation planning and investment decisions.

Region

north_america

Sector

public

Scenario

data_augmentation

Start Date

2024

Location: United States

SaulLM-7B
Developed by researchers in Portugal and France, SaulLM-7B is a large language model that summarizes legal documents. The model is pretrained on legal texts from the United States, Europe, and Australia.

Region

emea

Sector

academiaprivate_sector

Scenario

pre-training

Start Date

2024

Location: France

Sidekick
Sidekick is an AI chatbot developed by mySidewalk that answers queries about public issues. The chatbot responses are drawn from several official data sources including data from the U.S. Census Bureau, United States Department of Agriculture, and Bureau of Labor Statistics. Among its goals, it seeks to improve access to data to non-technical audiences.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2024

Location: United States

StatBot.Swiss
StatBot.Swiss is a benchmark dataset developed by Swiss researchers that can be used to test generative AI models ability to answer queries in English and German. The dataset includes data from the OpenData.Swiss government portal. Moving forward, the team is looking into expanding it to include other languages such as French and Italian.

Region

emea

Sector

public_sectoracademia

Scenario

pre-training

Start Date

2024

Location: Switzerland

Synthetic Australian Healthcare Data Using Synthea
In January 2024, researchers from the Australian e-Health Research Centre (CSIRO) and Macquarie University launched a study on using synthetic data to enhance access to healthcare information. They adapted the Synthea tool, which typically uses US census data, to incorporate Australian demographic and hospital data, creating around 117,000 synthetic health records specific to Queensland. The team used these records to analyze disease patterns, noting that while the synthetic data provides valuable access, further real-world testing is needed to ensure it accurately represents the local context.

Region

apac

Sector

public_sectoracademia

Scenario

data_augmentation

Start Date

2024

Location: Australia

Synthetic Data for Official Statistics
This guide assists National Statistical Offices (NSOs) in managing data access using synthetic data while maintaining confidentiality. It is suitable for statisticians and data managers in government agencies interested in implementing synthetic data. The guide covers the creation of synthetic data, addresses privacy risks, and provides practical tips for application, including a case where the Office for National Statistics Data Science Campus in the United States created a synthetic dataset using the U.S. Census Bureau’s income data to test the 2021 Census model.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2024

Location: United States

TitiBot
Developed by BORDE (a Mexican non-profit), TitiBot is a spanish language Whatsapp chatbot that helps improve access to voting records on legislative reforms. It uses data from Mexicos Congress of the Union (e.g. parlimentary voting records) from between 2018 and 2024 and can provide summaries of the data.

Region

latin_america_and_the_caribbean

Sector

non-profit

Scenario

inference_and_insight_generation

Start Date

2024

Location: Mexico

ChatDoctor
ChatDoctor is a generative AI chatbot that can answer queries in the medical domain. The model was trained on patient conversations from an online medical platform. It also uses data from Medline Plus (a government health information website for medical practitioners) in addition to other data sources.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

covLLM
Researchers at Stanford University developed covLLM, a generative AI tool to support doctors in understanding the most up-to-date COVID-19 research. The model was trained on the COVID-19 Open Research Dataset (CORD-19) and can provide summaries of research based on specific queries. Its objective is to address healthcare professionals need to stay updated on fast evolving topics.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2023

Location: United States

Democratic Fine-Tuning with a Moral Graph
Democratic Fine-Tuning with a Moral Graph (DFTmg) is a new method for aligning AI language models with human values through large-scale public discussions. The project used a survey of 500 Americans political views that they anonymized and made public as open research data on github. This process aims to develop AI models that make better decisions by incorporating public input into the training process. This work was supported by OpenAI.

Region

north_america

Sector

non-profitprivate_sector

Scenario

open-ended_exploration

Start Date

2023

Location: Global

ESGReveal
ESGReveal uses Retrieval Augmented Generation to adapt Environmental, Social, and Governance (ESG) data from corporate reports to help users find information from these reports when searching a database or the internet. The generative AI model was trained on ESG reports from 166 companies on the Hong Kong Stock Exchange.

Region

apac

Sector

private_sectoracademia

Scenario

adaptation

Start Date

2023

Location: Hong Kong

Generating a Fully Synthetic Human Services Dataset
This report, produced by researchers at the Urban Institute in collaboration with Allegheny County partners, describes the process of creating a synthetic version of the countys 2021 human services dataset. The synthetic data aims to replicate statistical properties of the confidential data while protecting individual privacy, enabling wider access to detailed human services information. The document covers the data synthesis methodology, evaluation of data quality and privacy risks, and the challenges of balancing utility and confidentiality in synthetic administrative data.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: United States

Llema
Llema is a generative AI model fine-tuned for the mathematics domain. It was fine-tuned using the Proof-Pile-2 dataset, which combines scientific papers with other mathematics datasets. The researchers have provided public access to the models, dataset, code to encourage future research around the topic of AI and mathematics.

Region

north_america

Sector

academianon-profit

Scenario

adaptation

Start Date

2023

Location: United States

Med-PaLM2
Med-PaLM2 is a generative AI chatbot by Google Research which seeks to provide long-form written answers to medical questions. Med-PaLM2 is fine-tuned using “publicly available question-answering data and physician writing responses” including MedQA and MedMCQA among other datasets. Med-PaLM2 achieved 86.5% accuracy on United States Medical Licensing Examination questions.

Region

north_america

Sector

private

Scenario

adaptation

Start Date

2023

Location: United States

MILDSum
Developed by researchers and legal practitioners from the Indian Institute of Technology Kharagpur, MILDSum is a research initiative that aims to bring together open data from the legal domain (i.e. case judgements) to create Hindi summaries of case judgements that can be used for training purposes.

Region

apac

Sector

academia

Scenario

pre-training

Start Date

2023

Location: India

NEPAccess
NEPAccess, developed by the University of Arizona, employs AI and data science to improve the National Environmental Protection Act (NEPA) environmental review process. The project uses generative AI to compile insights from previous projects and assist in drafting environmental impact assessments (EIAs) on specific topics. By integrating open data from federal agencies, NEPAccess provides public access to a centralized database of environmental reviews. The project was funded by the National Science Foundation (NSF) from 2021-2024 and is now seeking new funding to build new features into its platform.

Region

north_america

Sector

academia

Scenario

open-ended_exploration

Start Date

2023

Location: United States

OpenAssistant Conversations
Researchers have released a free, public collection of conversations called OpenAssistant Conversations to help improve AI language models. This dataset, created by over 13,500 volunteers worldwide, includes conversations in 35 languages along with quality ratings. By making this resource freely available, the researchers aim to democratize the development of more user-friendly and capable AI assistants across various fields.

Region

emea

Sector

academia

Scenario

pre-trainingadaptation

Start Date

2023

Location: Germany

Parla
Parla is an AI interface in development at CityLab Berlin. It aims to enhance access to public administration data across the city for both government officials and the general public. Functioning as both a retrieval system and an analytical tool, Parla accesses over 10,000 public documents from city departments, systems, and formats to answer specific queries. However, due to challenges like poorly structured data and insufficient metadata, Parla sometimes generates inaccurate outputs. To address this, Parla ensures its responses include source references, improving transparency and accountability.

Region

emea

Sector

civic_tech

Scenario

open-ended_exploration

Start Date

2023

Location: Germany

Phi-2
Phi-2 is an open-source small language model with 2.7 billion parameters that demonstrates outstanding reasoning and language understanding capabilities. Due to its small size, researchers use it to study AI model interactions, enhance safety features, and customize it for specific applications. The training data contains a mix of curated web data and synthetic data made to focus on common sense reasoning and general knowledge.

Region

north_america

Sector

private

Scenario

data_augmentation

Start Date

2023

Location: United States

SELENA+
SELENA+, developed by Synapxe (a department within the Government of Singapore focused on healthtech), the National University of Singapore and the Singapore National Eye Center, uses generative AI to detect diabetes-related eye conditions, specifically, diabetic eye disease, glaucoma, and age-related macular degeneration. The tool analyzes imagery from the National Eye Center. The team plans to expand this tool to cardiovascular diseases in the future.

Region

apac

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Singapore

StatGPT
To help improve the accessibility and usability of their open data platform, the International Monetary Fund (IMF) is prototyping a new generative AI tool that they are calling StatGPT. StatGPT will act as a user interface that processes natural language requests to find relevant datasets from the IMF’s repository. StatGPT will help users find indicators, visualize data in tables and charts, and generate Python code for analysis. The team is currently developing interface features and will then seek to integrate it in Excel.

Region

international

Sector

multilateral_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Europe and North America

Statistics Canada
Statistics Canada conducted a pilot program around generating synthetic data for training purposes. The team created synthetic datasets from census data that includes sensitive information. These datasets were used in two Hackathons, with the condition that they could not be publicly shared. Organizers highlighted that the synthetic datasets preserved the usefulness of the original data for analysis while minimizing the risk of revealing sensitive information. Hackathon participants successfully used these datasets for training purposes.

Region

north_america

Sector

public_sector

Scenario

data_augmentation

Start Date

2023

Location: Canada

Talk to the City
Talk to the City is an open-source tool that uses advanced AI to analyze and summarize qualitative data, particularly human opinions. It aims to improve collective decision-making and enhance public discourse around policy making by clustering similar arguments and creating summaries and visualizations. Talk to the City has been used in citizens assemblies in Taiwan as of 2023.

Region

international

Sector

non-profit

Scenario

open-ended_exploration

Start Date

2023

Location: United States

TaxGPT
TaxGPT is an independently developed generative AI chatbot that answers tax related queries based on information from the Canada Revenue Agency website. Its goal is to make tax information at the population level more understandable. It was updated in 2024 and is currently operational.

Region

north_america

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2023

Location: Canada

Tendios
Tendios is a Software-as-a-Service company from Spain that developed a chatbot to support public tender analysis and bidding. The chatbot is trained on government tender documents and aims to improve public procurement processes.

Region

emea

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2023

Location: Spain

The Harmonized Landsat and Sentinel-2 (HLS) Project
The Harmonized Landsat and Sentinel-2 (HLS) project by NASA aims to create a record of Earths surface using images from multiple satellites. The HLS dataset combines data from four NASA satellites as well as US Geological Survey (USGS) sensors around the globe. The dataset was used to train NASA and IBM’s watsonx.ai geospatial foundation model, which can be used to develop AI systems that provide maps and analytics about natural disasters and environmental changes. The latest dataset includes information from across the globe (except Antarctica). This work was a collaboration between NASA, the US Geological Survey (USGS), and several NASA research centers.

Region

north_america

Sector

public_sector

Scenario

pre-training

Start Date

2023

Location: United States

Wobby

Wobby is a generative AI-powered interface that can answer queries related to a specific open datasets and produce summaries of those datasets and visualizations as responses. The platform is focused primarily on democratizing access to open government data, and currently hosts datasets from organizations like Statbel (Belgium’s national statistical office), Statistics Netherlands and Eurostat, as well as data from intergovernmental organizations like the World Bank. Wobby's last update allows for automatic data updates and real-time analysis based on current information.

Region

emea

Sector

private

Scenario

inference_and_insight_generation

Start Date

2023

Location: Belgium

AgricultureBERT
AgricultureBERT is a generative AI model for the agriculture domain that was developed with data from the United States National Agricultural Library. This model is used to answer questions related to agricultural knowledge such as crop growing best practices or fertilization techniques in different climates. The intention is to improve access to agricultural information and advance research in the field.

Region

north_america

Sector

civic_tech

Scenario

adaptation

Start Date

2022

Location: United States

BioGPT
BioGPT is a generative AI model that can answer queries about biomedicine. BioGPT was trained using biomedical literature from PubMed. This tool was developed by representatives of Microsoft Research and Peking University.

Region

north_america

Sector

private_sectoracademia

Scenario

inference_and_insight_generation

Start Date

2022

Location: United States

CROZ RenEUwable
This project is an AI-driven application that provides users with sustainability recommendations as it relates to their energy consumption based on a specific set of queries. It was trained on open climate and energy datasets. This project won the EU Datathon 2022 in the European Green Deal Category and is currently in development.

Region

emea

Sector

civic_tech

Scenario

inference_and_insight_generation

Start Date

2022

Location: Europe and North America

European Cancer Imaging Institute
The European Cancer Imaging Initiative (part of the Europes Beating Cancer Plan) is an initiative that will bring together cancer-related resources and databases into a singular platform for health practitioners and researchers to use. The initiative aims to improve access to information and advance cancer and AI related research.

Region

emea

Sector

public_sector

Scenario

open-ended_exploration

Start Date

2022

Location: Europe and North America

PubMedBERT (Biomed-NLP or BiomedBERT)
Microsoft researchers created PubMedBERT, a generative AI model pretrained on biomedical text from PubMed and research from PubMedCentral. This model is used to help answer questions related to biomedical tasks. Training the LLM on medical literature (as opposed to adapting the model) helped improve the quality of the output.

Region

north_america

Sector

private_sector

Scenario

inference_and_insight_generation

Start Date

2021

Location: United States

ChemBERTa
ChemBERTa is designed to analyze molecules, similar to how language models read and understand text. Its goal is to help practitioners within drug discovery and materials science domains. The authors utilized a curated dataset of chemical molecules from PubMed, maintained by the National Institute of Health.

Region

north_america

Sector

academia

Scenario

adaptation

Start Date

2020

Location: United States

Extopia
The EXTOPIA Project, funded by the Luxembourg Ministry of Digitalisation, uses AI to analyze aerial images. EXTOPIA uses machine learning algorithms to detect changes in geographic databases (e.g. new buildings) and document them in the output.

Region

emea

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Europe and North America

Sam Petrino
Sam Petrino chatbot is a Spanish language generative AI enabled chatbot on WhatsApp and other web platforms for citizen engagement in San Pedro Garza García (Mexico). It uses government data to answer frequently asked questions and provides a tool to make reports. During the Covid-19 pandemic, it facilitated vaccine registrations as well.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2020

Location: Mexico

BioBERT
Researchers developed BioBERT, a generative AI model adapted to answer queries about the biomedical domain. The model was trained on biomedical literature from PubMed along with other data sources. The model aims to support research and improve access to information in biomedicine.

Region

north_america

Sector

academia

Scenario

inference_and_insight_generation

Start Date

2019

Location: United States

Boti
Buenos Aires Citys chatbot, Boti, uses generative AI to provide residents and visitors with municipal information and services related to Beunos Aires. Introduced in 2019, it was the first municipal bot on Whatsapp globally. Boti offers an array of services, from reporting civic issues to scheduling appointments and accessing cultural insights using open government data to train the model. It supports multilingual interactions and facilitates mobility by offering information on parking, EcoBici stations and subway statuses.

Region

latin_america_and_the_caribbean

Sector

public_sector

Scenario

inference_and_insight_generation

Start Date

2019

Location: Argentina

FinBERT
FinBERT is a generative AI model built to analyze financial documents. The model was developed with financial texts from Reuters and the open-source "Financial Phrase Bank" dataset (from open research) which allows the AI to dissect the meaning of different types of financial language.

Region

emea

Sector

academia

Scenario

adaptation

Start Date

2019

Location: United States

Gretel AI
Gretel is a synthetic data platform that helps developers generate artificial datasets with the same characteristics as real data, improving AI models while preserving privacy. The platform offers tools for training generative AI models, validating data quality and privacy, and generating synthetic data. Previous clients include the Government of South Australia and the United States Department of Justice.

Region

international

Sector

private

Scenario

data_augmentation

Start Date

2019

Location: United States

Mostly AI
MOSTLY AI has developed a platform that produces synthetic data for data scientists, analysts, and developers. The system uses AI models to generate artificial datasets, enabling users to create and manage data for purposes including training, test data creation, and analytics. The platform also features a generative AI chatbot that allows users to analyze synthetic data using search queries.

Region

north_america

Sector

private_sector

Scenario

data_augmentation

Start Date

2017

Location: United States

ELMo
ELMo, or the Embeddings from Language Models, is an open source model created by a team of AI researchers at the University of Washington and the Allen Institute for Artificial Intelligence. ELMo supports Natural Language Processing (NLP) systems by converting words into numbers, which are then used to train machine learning models. The original ELMo model was trained on the 1 Billion Word Benchmark, which is a publicly available training dataset of nearly 1 billion words for statistical language models developed by researchers at Google, the University of Edinburgh and Cantab Research Lab.

Region

north_america

Sector

non-profitacademia

Scenario

pre-training

Start Date

2014

Location: United States