Observatory of Examples of How Open Data and Generative AI Intersect
A growing observatory of examples of how open data from official sources and generative artificial intelligence (AI) are intersecting across domains and geographies.
Asclepius is a large language model for the medical domain trained on synthetic clinical notes generated through public biomedical information. The team chose to experiment with synthetic data given privacy concerns associated with using patient data in LLMs. The authors indicate that the LLM trained on synthetic data can have similar quality outputs to those trained on patient data.
CensusGPT is a natural language interface for United States census data, developed as part of the textSQL project. It allows users to ask questions about census information in plain English, which are then converted to queries formatted for the programming language SQL to retrieve relevant data from a census database. This tool aims to allow more individuals to analyze population statistics, demographics, and other census-related information without needing technical SQL knowledge.
Citymeetings.nyc is an independent initiative that uses LLMs to synthesize information from New York City Council meetings. It uses data from Legistar, an online platform where the government posts meetings summaries and agendas.
The Data Science Campus of the United Kingdom's Office for National Statistics has developed ClassifAI, an experimental tool that uses large language models to organize text into categories (e.g. industry). It aims to improve upon existing classification methods by offering greater flexibility and potentially higher accuracy for tasks such as categorizing labor market survey responses. The code has been released as open-source. The developers note that further assessment is needed before potential use in official statistics production.
Trained on satellite imagery and earth observation data, Clay is a generative AI foundation model designed to understand and analyze Earth's surface. It can generate mathematical representations of any location on Earth at any given time, which can be used for various tasks like creating land cover maps, detecting crop or burn scars, and tracking deforestation. The AI model is open source.
Developed by researchers at various European and American universities as well as private technology companies, CroissantLLM is a large language model that aims to support English-French language queries. The model is trained on both web scraped data and open government data from France. This initiative aims to improve LLMs capability to analyze non-English data.
Data Commons is a platform developed by Google that aggregates and standardizes public datasets from various global sources. The initiative uses AI and large language models to provide a natural language interface, enabling users to query complex data without requiring technical expertise. Through partnerships with organizations like the UN, Indian Institute of Technology Madras, and Feeding America, Data Commons offers specialized data portals on topics such as Sustainable Development Goals, India-specific information, and U.S. food security, presenting information through visualizations and analysis tools.
The Data Provenance Explorer is an interactive tool developed as part of MITs Center for Constructive Communication audit of AI training datasets. This tool allows researchers to explore information about datasets, including its origins, licenses, creators, and other metadata. This resource aims to enhance transparency around AI datasets and promote more informed use of datasets in AI research and development.
The DC Compass AI assistant is a generative AI chat interface that provides answers to user queries based on datasets from Open Data DC. The interface can provide a summary of a dataset, supporting visualizations, graphs, and other maps. Currently, this project is a pilot program running a beta test open to the public. The team notes that the quality of the output is impacted by the quality of the data from Open Data DC as well as the breadth of data included.
Researchers from the Mayo Clinic and University of Illinois Urbana-Champaign developed DRG-LLaMA, a tool designed for healthcare professionals involved in hospital billing and coding. DRG-LLaMA is an advanced large language model fine-tuned on clinical notes to improve the assignment of Diagnosis-Related Groups (DRGs) in the United States inpatient payment system. The model improves DRG assignment by analyzing patient discharge summaries to predict DRGs and their components, achieving higher accuracy than previous methods used for this task
The Baden-Württemberg Innovation Laboratory within the Government of Germany created F13, a generative AI chatbot to support with government administrative tasks. This chatbot helps personnel summarize documents and provides research support.
GeoLLM-Engine, developed by researchers at CoStrategist R&D Group and Microsoft Corporation, is an interface for interacting with geospatial data. The system includes a set of tools for analyzing maps and conducting spatial research. The development team is currently focused on improving the quality of outputs and refining the user interface. GeoLLM-Engine aims to serve professionals in fields that utilize geospatial analysis, such as urban planning and environmental monitoring.
GoldCoin is a large language model developed for the legal domain by researchers at the Department of Computer Science and Engineering, HKUST, in Hong Kong SAR, China. It specializes in detecting violations of HIPAA privacy rules based on specific queries. The model was trained using legal data from Harvard University's Caselaw Access Project, which offers public access to United States legal decisions. The research team suggests that GoldCoin could potentially be adapted to address other privacy laws in the future.
GovTech's Data Science and Artificial Intelligence Division (DSAID) has developed a system to assist in drafting parliamentary replies* using artificial intelligence. The project uses machine learning techniques to train language models on past parliamentary data, aiming to generate responses that match the style and accuracy of official replies. This tool is designed to help public servants in Singapore more efficiently prepare answers to parliamentary questions, while also exploring the broader potential of customized AI models for government applications. *Parliamentary replies are official answers given by government ministers or representatives to questions asked by members of parliament during legislative sessions.
The I14Y Interoperability Platform is Switzerlands national data catalogue, designed to improve access to data between authorities, businesses, and citizens. It provides a centralized repository for data collections, application interfaces, and government services from different levels of government. The platform offers services such as a searchable catalogue, concept definitions, news updates, and a handbook to support users in navigating and using Switzerland's data infrastructure.
Researchers from Shanghai AI Laboratory, Nanjing University, Eastern Institute of Technology, Ningbo, and Saarland University, Saarland Informatics Campus developed a chatbot that can address queries and analyze documents within the legal domain in China. The team fine-tuned the LLM using legal data from the Chinese National Legal Database along with other data sources. The team has made both the model and data sources publicly available.
Developed by the Government of France, LLaMandement aims to support administrative agents in analyzing and drafting summaries of legal bills developed in the French Parliament for other ministries and departments. The team used data from SIGNALE (a platform used in the French government’s lawmaking including data from several ministries such as the Ministry of Ecological Transition and Territorial Cohesion, Ministry of Culture and others) to fine-tune the pretrained model.
LLM on FHIR - A Project to Demystify Health Records
Researchers at Stanford University developed a mobile application that uses artificial intelligence to help patients better understand their health records and medical information. The mobile application, called LLM on FHIR, translates complex medical data into plain language and can answer patients health questions based on their personal medical history. While the application showed promise in making health information more accessible, the study also revealed challenges such as occasional inconsistent responses, highlighting areas for future improvement.
LLMoin is a chatbot developed by the City of Hamburg that aims to provide administrative support to government personnel. The tool is based on the Luminous Language Model of AlephAlpha which was developed in Germany. LLMoin is currently a pilot program undergoing testing.
LuminLab is an online platform that employs generative AI to offer information on improving building energy efficiency. The model is trained using open data from the Energy Performance Certificate dataset provided by the Sustainable Energy Authority of Ireland. The developers are currently working on enhancements, including the integration of geospatial data to generate 3D images of various areas, aiming to expand the platform's capabilities and visual representations.
Microsoft AI for Good - Damage Assessment Visualizer for Hurricane Beryl in Grenada
Microsoft and Planet collaborated with humanitarian organizations to analyze the impact of Hurricane Beryl in Grenada. This experimental tool uses the Microsoft AI for Good Damage Assessment Visualizer to analyze satellite images from Planet, estimating damage to buildings and structures on the island of Carriacou. The tool provides visual data to support frontline workers in disaster response and logistics.
Researchers at the National Transportation Research Center are using machine learning and AI techniques to analyze truck transportation patterns across the United States. The team is combining truck trip data with population and employment statistics using advanced algorithms to model and predict truck flows between regions. This effort is helping to uncover factors influencing truck transportation, such as the nonlinear relationship between distance and truck trips, providing valuable insights for transportation planning and investment decisions.
Developed by researchers in Portugal and France, SaulLM-7B is a large language model that summarizes legal documents. The model is pretrained on legal texts from the United States, Europe, and Australia.
Sidekick is an AI chatbot developed by mySidewalk that answers queries about public issues. The chatbot responses are drawn from several official data sources including data from the U.S. Census Bureau, United States Department of Agriculture, and Bureau of Labor Statistics. Among its goals, it seeks to improve access to data to non-technical audiences.
StatBot.Swiss is a benchmark dataset developed by Swiss researchers that can be used to test generative AI models ability to answer queries in English and German. The dataset includes data from the OpenData.Swiss government portal. Moving forward, the team is looking into expanding it to include other languages such as French and Italian.
Synthetic Australian Healthcare Data Using Synthea
In January 2024, researchers from the Australian e-Health Research Centre (CSIRO) and Macquarie University launched a study on using synthetic data to enhance access to healthcare information. They adapted the Synthea tool, which typically uses US census data, to incorporate Australian demographic and hospital data, creating around 117,000 synthetic health records specific to Queensland. The team used these records to analyze disease patterns, noting that while the synthetic data provides valuable access, further real-world testing is needed to ensure it accurately represents the local context.
This guide assists National Statistical Offices (NSOs) in managing data access using synthetic data while maintaining confidentiality. It is suitable for statisticians and data managers in government agencies interested in implementing synthetic data. The guide covers the creation of synthetic data, addresses privacy risks, and provides practical tips for application, including a case where the Office for National Statistics Data Science Campus in the United States created a synthetic dataset using the U.S. Census Bureau’s income data to test the 2021 Census model.
Developed by BORDE (a Mexican non-profit), TitiBot is a spanish language Whatsapp chatbot that helps improve access to voting records on legislative reforms. It uses data from Mexicos Congress of the Union (e.g. parlimentary voting records) from between 2018 and 2024 and can provide summaries of the data.
ChatDoctor is a generative AI chatbot that can answer queries in the medical domain. The model was trained on patient conversations from an online medical platform. It also uses data from Medline Plus (a government health information website for medical practitioners) in addition to other data sources.
Researchers at Stanford University developed covLLM, a generative AI tool to support doctors in understanding the most up-to-date COVID-19 research. The model was trained on the COVID-19 Open Research Dataset (CORD-19) and can provide summaries of research based on specific queries. Its objective is to address healthcare professionals need to stay updated on fast evolving topics.
Democratic Fine-Tuning with a Moral Graph (DFTmg) is a new method for aligning AI language models with human values through large-scale public discussions. The project used a survey of 500 Americans political views that they anonymized and made public as open research data on github. This process aims to develop AI models that make better decisions by incorporating public input into the training process. This work was supported by OpenAI.
ESGReveal uses Retrieval Augmented Generation to adapt Environmental, Social, and Governance (ESG) data from corporate reports to help users find information from these reports when searching a database or the internet. The generative AI model was trained on ESG reports from 166 companies on the Hong Kong Stock Exchange.
Generating a Fully Synthetic Human Services Dataset
This report, produced by researchers at the Urban Institute in collaboration with Allegheny County partners, describes the process of creating a synthetic version of the countys 2021 human services dataset. The synthetic data aims to replicate statistical properties of the confidential data while protecting individual privacy, enabling wider access to detailed human services information. The document covers the data synthesis methodology, evaluation of data quality and privacy risks, and the challenges of balancing utility and confidentiality in synthetic administrative data.
Llema is a generative AI model fine-tuned for the mathematics domain. It was fine-tuned using the Proof-Pile-2 dataset, which combines scientific papers with other mathematics datasets. The researchers have provided public access to the models, dataset, code to encourage future research around the topic of AI and mathematics.
Med-PaLM2 is a generative AI chatbot by Google Research which seeks to provide long-form written answers to medical questions. Med-PaLM2 is fine-tuned using “publicly available question-answering data and physician writing responses” including MedQA and MedMCQA among other datasets. Med-PaLM2 achieved 86.5% accuracy on United States Medical Licensing Examination questions.
Developed by researchers and legal practitioners from the Indian Institute of Technology Kharagpur, MILDSum is a research initiative that aims to bring together open data from the legal domain (i.e. case judgements) to create Hindi summaries of case judgements that can be used for training purposes.
NEPAccess, developed by the University of Arizona, employs AI and data science to improve the National Environmental Protection Act (NEPA) environmental review process. The project uses generative AI to compile insights from previous projects and assist in drafting environmental impact assessments (EIAs) on specific topics. By integrating open data from federal agencies, NEPAccess provides public access to a centralized database of environmental reviews. The project was funded by the National Science Foundation (NSF) from 2021-2024 and is now seeking new funding to build new features into its platform.
Researchers have released a free, public collection of conversations called OpenAssistant Conversations to help improve AI language models. This dataset, created by over 13,500 volunteers worldwide, includes conversations in 35 languages along with quality ratings. By making this resource freely available, the researchers aim to democratize the development of more user-friendly and capable AI assistants across various fields.
Parla is an AI interface in development at CityLab Berlin. It aims to enhance access to public administration data across the city for both government officials and the general public. Functioning as both a retrieval system and an analytical tool, Parla accesses over 10,000 public documents from city departments, systems, and formats to answer specific queries. However, due to challenges like poorly structured data and insufficient metadata, Parla sometimes generates inaccurate outputs. To address this, Parla ensures its responses include source references, improving transparency and accountability.
Phi-2 is an open-source small language model with 2.7 billion parameters that demonstrates outstanding reasoning and language understanding capabilities. Due to its small size, researchers use it to study AI model interactions, enhance safety features, and customize it for specific applications. The training data contains a mix of curated web data and synthetic data made to focus on common sense reasoning and general knowledge.
SELENA+, developed by Synapxe (a department within the Government of Singapore focused on healthtech), the National University of Singapore and the Singapore National Eye Center, uses generative AI to detect diabetes-related eye conditions, specifically, diabetic eye disease, glaucoma, and age-related macular degeneration. The tool analyzes imagery from the National Eye Center. The team plans to expand this tool to cardiovascular diseases in the future.
To help improve the accessibility and usability of their open data platform, the International Monetary Fund (IMF) is prototyping a new generative AI tool that they are calling StatGPT. StatGPT will act as a user interface that processes natural language requests to find relevant datasets from the IMF’s repository. StatGPT will help users find indicators, visualize data in tables and charts, and generate Python code for analysis. The team is currently developing interface features and will then seek to integrate it in Excel.
Statistics Canada conducted a pilot program around generating synthetic data for training purposes. The team created synthetic datasets from census data that includes sensitive information. These datasets were used in two Hackathons, with the condition that they could not be publicly shared. Organizers highlighted that the synthetic datasets preserved the usefulness of the original data for analysis while minimizing the risk of revealing sensitive information. Hackathon participants successfully used these datasets for training purposes.
Talk to the City is an open-source tool that uses advanced AI to analyze and summarize qualitative data, particularly human opinions. It aims to improve collective decision-making and enhance public discourse around policy making by clustering similar arguments and creating summaries and visualizations. Talk to the City has been used in citizens assemblies in Taiwan as of 2023.
TaxGPT is an independently developed generative AI chatbot that answers tax related queries based on information from the Canada Revenue Agency website. Its goal is to make tax information at the population level more understandable. It was updated in 2024 and is currently operational.
Tendios is a Software-as-a-Service company from Spain that developed a chatbot to support public tender analysis and bidding. The chatbot is trained on government tender documents and aims to improve public procurement processes.
The Harmonized Landsat and Sentinel-2 (HLS) Project
The Harmonized Landsat and Sentinel-2 (HLS) project by NASA aims to create a record of Earths surface using images from multiple satellites. The HLS dataset combines data from four NASA satellites as well as US Geological Survey (USGS) sensors around the globe. The dataset was used to train NASA and IBM’s watsonx.ai geospatial foundation model, which can be used to develop AI systems that provide maps and analytics about natural disasters and environmental changes. The latest dataset includes information from across the globe (except Antarctica). This work was a collaboration between NASA, the US Geological Survey (USGS), and several NASA research centers.
Wobby is a generative AI-powered interface that can answer queries related to a specific open datasets and produce summaries of those datasets and visualizations as responses. The platform is focused primarily on democratizing access to open government data, and currently hosts datasets from organizations like Statbel (Belgium’s national statistical office), Statistics Netherlands and Eurostat, as well as data from intergovernmental organizations like the World Bank. Wobby's last update allows for automatic data updates and real-time analysis based on current information.
AgricultureBERT is a generative AI model for the agriculture domain that was developed with data from the United States National Agricultural Library. This model is used to answer questions related to agricultural knowledge such as crop growing best practices or fertilization techniques in different climates. The intention is to improve access to agricultural information and advance research in the field.
BioGPT is a generative AI model that can answer queries about biomedicine. BioGPT was trained using biomedical literature from PubMed. This tool was developed by representatives of Microsoft Research and Peking University.
This project is an AI-driven application that provides users with sustainability recommendations as it relates to their energy consumption based on a specific set of queries. It was trained on open climate and energy datasets. This project won the EU Datathon 2022 in the European Green Deal Category and is currently in development.
The European Cancer Imaging Initiative (part of the Europes Beating Cancer Plan) is an initiative that will bring together cancer-related resources and databases into a singular platform for health practitioners and researchers to use. The initiative aims to improve access to information and advance cancer and AI related research.
Microsoft researchers created PubMedBERT, a generative AI model pretrained on biomedical text from PubMed and research from PubMedCentral. This model is used to help answer questions related to biomedical tasks. Training the LLM on medical literature (as opposed to adapting the model) helped improve the quality of the output.
ChemBERTa is designed to analyze molecules, similar to how language models read and understand text. Its goal is to help practitioners within drug discovery and materials science domains. The authors utilized a curated dataset of chemical molecules from PubMed, maintained by the National Institute of Health.
The EXTOPIA Project, funded by the Luxembourg Ministry of Digitalisation, uses AI to analyze aerial images. EXTOPIA uses machine learning algorithms to detect changes in geographic databases (e.g. new buildings) and document them in the output.
Sam Petrino chatbot is a Spanish language generative AI enabled chatbot on WhatsApp and other web platforms for citizen engagement in San Pedro Garza García (Mexico). It uses government data to answer frequently asked questions and provides a tool to make reports. During the Covid-19 pandemic, it facilitated vaccine registrations as well.
Researchers developed BioBERT, a generative AI model adapted to answer queries about the biomedical domain. The model was trained on biomedical literature from PubMed along with other data sources. The model aims to support research and improve access to information in biomedicine.
Buenos Aires Citys chatbot, Boti, uses generative AI to provide residents and visitors with municipal information and services related to Beunos Aires. Introduced in 2019, it was the first municipal bot on Whatsapp globally. Boti offers an array of services, from reporting civic issues to scheduling appointments and accessing cultural insights using open government data to train the model. It supports multilingual interactions and facilitates mobility by offering information on parking, EcoBici stations and subway statuses.
FinBERT is a generative AI model built to analyze financial documents. The model was developed with financial texts from Reuters and the open-source "Financial Phrase Bank" dataset (from open research) which allows the AI to dissect the meaning of different types of financial language.
Gretel is a synthetic data platform that helps developers generate artificial datasets with the same characteristics as real data, improving AI models while preserving privacy. The platform offers tools for training generative AI models, validating data quality and privacy, and generating synthetic data. Previous clients include the Government of South Australia and the United States Department of Justice.
MOSTLY AI has developed a platform that produces synthetic data for data scientists, analysts, and developers. The system uses AI models to generate artificial datasets, enabling users to create and manage data for purposes including training, test data creation, and analytics. The platform also features a generative AI chatbot that allows users to analyze synthetic data using search queries.
ELMo, or the Embeddings from Language Models, is an open source model created by a team of AI researchers at the University of Washington and the Allen Institute for Artificial Intelligence. ELMo supports Natural Language Processing (NLP) systems by converting words into numbers, which are then used to train machine learning models. The original ELMo model was trained on the 1 Billion Word Benchmark, which is a publicly available training dataset of nearly 1 billion words for statistical language models developed by researchers at Google, the University of Edinburgh and Cantab Research Lab.