Published on

Measuring changes in government data over time with the Wayback Machine

    Question: can we measure what is new or missing over time?

    I used to work as a NASA contractor and a small part of my role was trying to ensure what NASA submitted in it's data.json to data.gov was accurate and up to date.

    Recent reporting discussed how the total number of datasets on data.gov had decreased over time based on looking at the total number of datasets reported in data.gov. There was also some discussion about the Internet Archive Wayback Machine has been useful in preserving old copies of edited or deleted websites, but does not cover many datasets that require user interactions to download or have other complications related to size, format, etc.

    As (1.) a data.json is required to exist for all U.S. federal government agencies and be available at an URL, so it can be harvested into data.gov. (2.) And I knew data.json is a json file existing at the end of an URL so it should be snapshotted by the Internet Archive Wayback Machine, I wondered if we could use the Internet Archive Wayback Machine to measure changes in agency's data.json over time.

    It turns out we can, and I've put some experimental little scripts to do so in a repository called data_dot_json_over_time.

    Early learnings

    I have done initial analysis of Department of Commerce, Department of Education, and NASA.

    Only NASA shows drops in data in data.json across the January 20th 2025 boundary, but that may reflect the relevant update speed of some datasets in NASA's data.json versus the other two. More details found in the repository data_dot_json_over_time.

    Data.json lags reality

    Department of Education and Department of Commerce (NOAA and others) don't have any datasets missing in data.json across the Biden/Trump administration boundary. Given reports of at least some NOAA datasets being taken down, at least temporarily, this may reflect data.json lagging reality, meaning updates to data.json likely occur some time after they dataset access has been removed or datasets no longer exist. This was somewhat expected, but confirmed by the data.

    Some of those lags between a dataset no longer being available and it being removed from data.gov can be months or years. In some cases, the data might never be corrected. A major reason of this is that data.gov is a harvester of metadata from different agencies and each government agency has it's own process for updating or removing datasets. In most cases, there are many intra-agency data systems or even layers of data systems with different types of manual or programmatic harvesting of metadata. Each agency also has many different data systems or platforms that they use to host data and how and when the state of those systems get reflected in the data.json or data.gov is incredibly variable. Some are even manually added by researchers or data stewards and then never updated when the host for that data goes away. Others dataset descriptions in an agency's data.json are automatically updated by the same system that hosts a dataset, so those updates can be as quick as a day or within a month if the update script runs once a month.

    Some removed dataset identifiers appear to be version updates or identifier changes with no other changes but not all

    Of the 3 agencies in the initial analysis set, only NASA's' data.json showed datasets identifiers in the first snapshot not present in the second snapshot, totally 198 dataset identifiers. This compares data.json snapshots on 2024-10-09 and 2025-02-07.

    The 2024-10-09 data.json had 22,360 datasets described. The 2025-02-07 had 22,382 datasets, so even with some datasets seemingly missing in the newer data.json, the total number of datasets described by metadata went up.

    Of those 'missing' 198, it seems at least 33 were situations where the datset identifer changed but the title did not, suggesting these might just be identifer changes related to underlying system updates or version updates that didn't change dataset title.

    Of those 'missing' 198, another 32 had nearly the same dataset title with only the last part of the title changed in a way consistent with version updates. For example a long title and then the end has "-v3" instead of "-v2".

    Both of those situations likely do not reflect dataset removal so much as dataset evolution over time.

    148 of 198 did not fall into those two previous situation and require additional analysis to see if they are true dataset removals. Most of those 148 have landing page and distribution URls that give 200 status messages or mostly do, suggesting that data is still available. Others give 403 messages suggesting there is a need to sign in first, especially for some earth data systems that assume a user is signed in. There is a smaller subset that may indeed be gone but current programmatic checks here do not ensure this with high accuracy yet.

    Possible future analysis pathways suggested by this analysis

    Some of the missing dataset's whose metadata doesn't appear in the more recent data.json have several dataset distribution URLs listed as well as a landing page. For several of the datasets, only some of the URLs return 404 statuses and others 200 or other statues. There might be value in analyzing those statuses and seeing if it possible to programmatically recognize when a landing page might be down but a download distribution URL still be up as those might be targets for archiving.

    If I continue working on this side project, I will update this blog post with what I learn.