The UK has set out an ambitious plan for artificial intelligence, but there’s a problem, writes Professor Elena Simperl, Director of Research, The Open Data Institute. Disjointed and fragmented datasets in public-sector repositories are so difficult to use that AI agents choose to avoid them and search the internet for statistics from other, less reliable sources instead.

The Open Data Institute recently created and published a ready-to-use prototype of the National Data Library (NDL). The prototype, called NDL-Lite, featured 38GB of data from six public-sector sources, aggregating, processing, and standardising over 100,000 files into a central resource, accessible to researchers and innovators from around the world using cutting-edge agentic AI. The prototype demonstrates that a nation-wide cross-government data asset could be created quickly and cheaply, but several barriers must be addressed to move it from a promising prototype to a workable reality.

Access to high-quality public-sector data will determine whether the UK can support innovation, empower public services, and compete globally. The ODI prototype shows that this can be done. In one example, an AI agent combined official data on electric vehicle charging infrastructure with Hansard debate transcripts to produce a short report on business opportunities in rural EV charging in under a minute.

Missing metadata, and inconsistent formats

But many public datasets cannot support meaningful work by human data users, let alone AI agents, whether for financial analysis, customer service, or policy research. During testing, the ODI found that many official datasets, particularly on data.gov.uk, are difficult for both humans and AI systems to interpret. Titles are often vague or misleading, metadata is sparse or missing, formats are inconsistent, and even basic standards, such as date formats, vary. Datasets that look useful at first glance often turn out to be too narrow, old or badly structured to support the kind of analyses users are trying to carry out.

Crime data is one example. In one experiment, a search for crime statistics returned datasets that appeared relevant, but many were local authority releases rather than national data that could be compared or combined. One important Home Office crime dataset on data.gov.uk had not been updated since 2018, making it unsuitable for current AI use. A newer version exists, but it could not be accessed through the Office of National Statistics’ API, which meant it could not be properly integrated into the prototype.

This meant that AI agents chose to avoid public-sector datasets and searched the internet for statistics from other, less reliable sources instead; quite the opposite of what the government would want. Public data is supposed to provide a trusted, authoritative foundation for AI use in research, policy making and product development; it cannot be acceptable for systems to bypass it in favour of whatever they can scrape from less reliable sources online.

ONS data was clean and standardised

In contrast, although datasets from the Office for National Statistics (ONS) lacked attached metadata, the clear titles made it easy to find, understand, and use up-to-date economic statistics. The datasets are clean and standardised, making them interoperable and reusable across many research domains. The difference is not in the volume of data, but in its quality and structure.

Data.gov.uk is working hard to tackle the problems of historic underinvestment. While budgets remain tight, data grouped around core national teams is being curated and cleaned, and a data manual for future best practices is being developed. However, there is still some way to go.  

Since 2024, the ODI has consistently recommended that any work on the NDL must begin with the quality of the underlying public sector datasets. This means improving inconsistent formats, missing metadata, misleading dataset titles, fragmented statistics and outdated datasets.

There is also a growing need to prepare data for AI applications, including the use of machine-readable metadata standards such as Croissant and clearer documentation around dataset limitations and potential biases, which AI systems can otherwise amplify. As a critical data provider for AI, the government has a responsibility to ensure its data is both usable and available equitably. Without this, UK based start-ups and academic institutions will lose out to large, entrenched entities from the US and China. And along with this, developers need the code and tooling to use the data. That’s why we published the NDL-lite dataset with purpose-built infrastructure to enable access by humans and AI agents alike. 

NDL has huge potential, but...

The National Data Library has huge potential, allowing tools that can write code, analyse datasets and generate insights in seconds to work across the UK’s economic, scientific and policy data, unlocking new insights and faster decision-making. NDL-Lite demonstrates that building an initial prototype is not especially difficult. The tools are widely available, and the code can be developed openly over time.

The harder task is ensuring the data is reliable, usable and accessible. Without that, AI agents are likely to simply ignore the National Data Library. If ministers want the National Data Library to deliver real value, the immediate priority is not branding or another round of strategy papers - it is making government datasets fit for purpose. 

Join the discussion on LinkedIn

The link has been copied!