Blog Post
The Intersections of Generative AI and Open Data: Latest Additions to the Observatory – July
Posted on 16th of July 2025 by Claire Skatrud, Hannah Chafetz, Stefaan Verhulst
How are governments and researchers using generative AI to make better use of open data? In what ways can AI help make public information more accessible, interpretable, or actionable? And what new types of public services or research tools are emerging at this intersection?
These are some of the questions explored in our Observatory of Open Data and Generative AI —a growing collection of real-world use cases showing how open data from official sources is being used with generative AI technologies.
Launched last year, the Observatory builds on the findings of our report, "A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.”
In our latest update, we showcase 18 new examples from diverse sectors and regions. These cases offer a snapshot of the evolving relationship between open data and AI around the world.
Highlights include:
Pubbie, an AI assistant developed by the National Research Council of Canada to classify and respond to questions about thousands of research publications.
Chemma, a large language model trained on open scientific reaction data to assist chemists with experimentation.
A plain language AI summarization tool developed by the Government of Catalonia to help citizens better understand legal texts.
Below we provide the full list of examples added and key themes across these additions.
What’s New in the Observatory?
AI4Culture: A collection of deployable generative AI tools and open training datasets for translation, transcription, and text identification of cultural heritage documents in the European Union.
Case Operations Resource Assistant (CORA): An AI chatbot trained on procedural and policy data from Washington DC’s Child and Family Services Agency, designed to answer staff questions about the agency’s case management system.
Catalonia’s AI Legal Texts Summarization Tool: A generative AI language tool to summarize legal texts in plain language for citizens of Catalonia.
Chemma: An open access LLM developed to assist with chemistry tasks such as yield prediction and reaction optimization based on open reaction datasets.
Climate Policy Scenario Generation for Sub-Saharan Africa: A retrieval-augmented generation (RAG) tool that can simulate climate policy scenarios in Sub-Saharan Africa based on United Nations Climate Change Conference documents.
Climate TRACE: An ongoing tracking platform using AI with satellite and remote sensing data to create an inventory of global emissions.
Data Foundry Scotland: An open data delivery platform from the National Library of Scotland that makes its digital collections available for machine learning and AI training.
eLangTech: The European Commission’s suite of AI-powered multilingual tools for translations, summaries, and briefings, trained on existing European Union policy, legislative, and governmental documents and official translations.
Extract: A Google Gemini-based model that processes and transcribes historical public planning documents for United Kingdom government councils, making open planning records more usable and accessible
GENAI4LEX-B: An AI-powered legislative tool to support the Italian Chamber of Deputies in legal research, bill drafting, and document classification based on existing legislation and legal ontologies.
Justice Folder: A project by Spain’s Ministry of Justice to create AI and natural language processing (NLP) tools for plain language summaries of judicial documents, document classification and anonymization, and other services based on hundreds of thousands of open proceedings.
KB-BERT: A natural language processing model trained on national textual data from the National Library of Sweden, designed to support automated document management at the library. KB-BERT and its training data may advance potential future plans for a public sector language model that can assist with daily tasks for Swedish authorities.
m-KAILIN: A framework using several AI language models to refine open research data into high-quality structured training content for biomedical large language models.
Pubbie: A large language model developed by the National Research Council (NRC) of Canada to automate the categorization and access of thousands of open NRC publications, with an additional conversational tool that can answer questions about the publications.
Synthetic Aperture Radar (SAR) Imagery Augmentation: An AI-driven pipeline to synthetically generate enhanced higher-resolution formats of archival satellite imagery from ONERA, France’s national Aerospace Lab.
UrbanistAI: A generative AI platform that can create visual renderings of citizens’ urban planning suggestions and prompts based on open local policy requirements and street images.
Virtual Support Agent for e-Albania Portal: An AI chatbot agent for the e-Albania centralized government services platform. The retrieval-augmented generation tool responds to citizen queries based on up-to-date government information.
WeatherLab: An interactive website designed by Google to create AI-generated hurricane path predictions based on past open weather data and historical forecasts.
Key Themes
Across these additions, we identified four key themes.
Increasing Accessibility of Government Information: Catalonia’s AI Legal Text Summarization Tool and Spain’s Justice Folder use generative AI to improve access to government information. These tools can transform legalese into plain language that citizens can use for up-to-date, searchable information. The LLMs simplify language, removing the need for technical expertise to navigate open government texts.
Automating Daily Tasks in the Public Sector: We identified use cases that harness AI to assist with time-consuming tasks. For example, Extract automates the transcription and digitization of public planning documents, relieving public planners in the United Kingdom of a time-intensive repetitive process. Likewise, Pubbie in Canada handles the previously-manual classification of hundreds of new National Research Council reports each year.
Legislative, Judicial, and Policy Documents as Training Data: In several of our examples, AI models are trained on vast quantities of publicly available government documents, reflecting the increased use of these documents as collective textual datasets. GENAI4LEX-B, for instance, supports legislative research and bill drafting in Italy by retrieving relevant legal documents and ontologies. eLangTech, though not a retrieval tool like GENAI4LEX-B, trains its eSummary, eTranslation, and eBriefing tools on various existing European Union legislation, policy, and official translations.
Enhancing Climate and Environmental Forecasting: Tools such as the WeatherLab and the Climate Policy Scenario Generation Tool reveal novel uses of AI in forecasting and simulations. These models build on open historical data and textual training material to generate future projections, whether they be 15-day simulations of hurricane paths or long-term climate scenarios.
***
Do you know of any real-world examples of generative AI and open data that should be included in the Observatory? Submit an example by visiting our Observatory.