Introduction to NASA's proposed Public Access Plan
NASA recently released an request for information on their proposed 2023 Public Access Plan. This effort is part of NASA's "Transform to Open Science (TOPS)" , a push for open science across several US government agencies and the White House. A "Request for Information", is a standard way government agencies ask for feedback from the public. As someone with experience working on open source code release and development, I had thoughts. The comment period ends on 08/17/2023.This blog post was written as an exercise in considering what feedback might be given in response.
Background on NASA on public access to publications, code, data, etc.
NASA has a long history of making its publications, data, and code available to the public. However, that task is never as easy as it sounds. To give a few examples of complications:
- Scientists have an incentive to publish in high impact journals. Those journals often want to charge people to read them, which can reduce access to publicly funded research results, code, etc.
- Certain types of data that NASA wants to release to the public can get quite large making it necessary to operate, and in some cases invent, systems to enable the public to access the data and work with it in ways that aren't so terrible that no one bothers.
- NASA and its contractors both write code that is extremely sensitive, for example code that is ued to control rockets, and less sensitive code that can be widely distributed. It is difficult to create processes that worth extremely well at both ends of the spectrum.
The Public Access Plan is an attempt to improve from the current situation to something a little better. You should go and read it at the link above as it won't be summarized here. However, I will include a few bullet points that the 2023 plan highlights as key revisions compared to the 2014 plan.
- There shall be no publication embargo period for peer-reviewed publications (a 12 month embargo had been allowed)
- Data that support peer-reviewed publications shall be made available in a public archive at the time of publication
- Software should be included as part of Open Access, subject to NASA software release requirements
- Software used to generate research findings/results should be made available in a public archive at the time of publication
- Other data products beyond peer-reviewed publications and software should be considered as part of Open Access
These shifts represent a step forward and are in direct response to real frustrations with data and code in NASA-funded studies not always being available or available without significant friction that limited their reuse and impact.
Personal experiences with public access at NASA
I previously worked as a contractor within NASA's Office of Chief Information Officer in what has been reorganized into Information, Data, and Analytics Services (IDAS) and among other things helped to run data.nasa.gov, code.nasa.gov, and the NASA GitHub organization. These experiences gave me insight into the challenges of releasing data, releasing code, and working in the open in the NASA context that inform my thoughts shared here.
NASA is made of many sub-organizations with different missions. This effort is driven by one subset of the larger NASA
It is worth pointing out that NASA is an agency made up of several sub-organizations. Each of these sub-organizations has different cultures, missions, and things they are better or worse at. These differences are more pronounced at NASA than I have experienced with BP or Microsoft. This public access plan is driven by the Science Mission Directorate. Compared to other parts of NASA, they fund academic research a lot more, work with outside partners a lot more, and less often work with sensitive code that needs to be tightly controlled.
NASA has a pile of policy documentation. Boundaries between these policies are complex.
NASA policy and regulations are written into NASA Procedural Requirements (NPRs) or NASA Policy Directives (NPDs). You can read them all online at the NODIS Library. You can think of NODIS as a gigantic file cabinet filled with rules written a little bit at a time over decades and decades. If you read the Public Access Plan closely, you'll see NPRs and NPDs mentioned in the sense that it acknowledges their existence and says they still apply but positions the Public Access Plan as additional and specific to research software. I interpret this fuzzily phrased statement as trying to create public access policies that do not require changing large amounts of existing NPRs and NPDs or require buy-in from every single sub-organization within NASA. Other sub-organizations have more sensitive code or data more often, so they might want other policies as they have different needs. Other sub-organizations are less motivated by public release. Additionally, NPRs and NPDs are typically changed in small bits every several years, so anything that requires large rewrites to many existing NPRs or NPDs becomes a very heavy boulder to roll up the hill.
Focus of my feedback on the Public Access Plan: Software Sharing and Archiving
This blog post will focus on the fifth question in NASA's Request for Information for feedback on the NASA Public Access Plan. The question in full is:
- Suggestions on sharing and archiving of software. Sites like GitHub and Zenodo offer ways to distribute and manage software. NASA is seeking suggestions on improving the archiving, sharing, and maintenance of software for reuse.
++++++ What follows is a tentative draft of my response to the request for information due on August 18th, 2023 focused on software sharing. Go to the bottom for a list of other areas that could be the focus of feedback. ++++++
Response to Request for Information Regarding NASA Proposed Public Access Plan
This document has been written in response to the request for information regarding the proposed May 2023 NASA Public Access Plan. While these are my personal opinions, I should note that I previously worked as a NASA contractor supporting code.nasa.gov and currently work for Microsoft, which owns GitHub. My comments are in response to question 5 : “Suggestions on sharing and archiving of software. Sites like GitHub and Zenodo offer ways to distribute and manage software. NASA is seeking suggestions on improving the archiving, sharing, and maintenance of software for reuse.”
Plan should more explicitly define & explain differences between software archival and maintenance
In several places the proposed NASA Public Access Plan would benefit from more explicitly describing the differences between archival of a fixed version of the software versus maintaining an active and changing public code repository over time. For example, in the line below from page 20 discussing requirements that Software Management Plans must address, ”maintained” and “preserved” could be mistakenly read as the same thing.
Plans for archiving and preserving of the software, as appropriate (use of existing databases or public repositories will be strongly encouraged), including how long the software will be preserved or maintained
Archived software is an unchanging and fixed version of the code. As stated in the plan, an archived version of the software is important for reproducibility as without it someone else can not attempt to reproduce a study and hope to get the same results. They might get slightly or greatly different results and not know why. As such, it is reasonable to have code archival as a requirement However, archived code should be expressed as a minimum bar and not implied to be the only or best way to share software.
NASA should also encourage, although likely not require, use of free, public facing version control systems (GitHub, GitLab, etc.) that allow others besides the original authors to easily access software in a manner such that the software can evolve over time. As NASA likely do not want to mention any particular product in policy, the term “free, public-facing software version control systems” could be used instead.
Making the code available to the public in these systems enables several behaviors that increase the value and impact of the software not possible with an archived version of the software alone.
Placing it on these types of systems enables users to discover it from other public code that use it as a dependency or share a topic tag.
More people are likely to discover it there as “free, public-facing software version control systems” are where developers spend a lot of their time, not NASA run systems that fewer know about such as NASA Technical Report Server or code.nasa.gov.
As these systems also allow the public to read software without downloading it, the software can be more quickly evaluated without having to download and open large files.
This reduces friction and increases rates of reuse.
Code that exists on a “free, public-facing software version control systems” is also more likely to improve over time.
Users who are not the original authors can add an issue when a bug or security vulnerability is found or make pull requests to add new features.
Code for one study can be generalized with the insights of the community and turned into a tool that is reusable.
To summarize, software on “free, public-facing software version control systems” is more discoverable, reusable, secure, generalizable, and extendable. In contrast, software that only exists in archived form or as supplemental information to a publication is much less likely to be a part of these behaviors as it is by definition a static artifact.
NASA may want to only encourage rather than mandate code to be placed on a free, public-facing software control system as placing code where it can be easily reached, forked, bugs submitted, etc. opens up at least the possibility of a community growing around the software. A community demands time from the original authors, which they may not have. There are two options NASA could introduce in training materials to alleviate these concerns. One is to use built-in features of free, public-facing software version control systems and “archive” the repository. "Archived" is used here in the GitHub or GitLab meaning of archiving a software repository in place. An archived software repository is still visible and forkable by others who wish to continue its development but issues and pull requests can not be added meaning bug reports and feature improvements stop, at least on the repository owned by the original owner. On archived software repositories, there is a banner stating it is “archived” and therefore it is obvious that the software is no longer being developed by its original authors. The second option is to add text to a README that explains the software repository is no longer being actively maintained. People may leave a bug report in an issue, but there is no expectation of a response.
Suggest changing the previously mentioned text on page 20 to:
- Plans for preserving a fixed, archival version of the software that was used to produced the results in the publication, as appropriate (use of existing databases or archival systems will be strongly encouraged). - Optionally state where the software will also exist on a free, public-facing software version control platform where others can add issues, report bugs, new feature contributions, etc. If it will be, declare how the project will communicate to the public their availability to respond to contributions. No availability to respond may be a suitable answer. - For software that has a community of users beyond the context of the study being funded, discuss plans for community health and governance.
Please note that the term “repository” has been removed from the original text. The term “code repository” is commonly used in modern software development to mean a single project collection of code files. However, in the Public Access Plan it is referred to in a way that means a data system to hold static versions of many thousands of software projects, datasets, etc. To avoid confusion, suggest the term “repository” either be not used or be defined within the plan. The modern software development definition of the word is used in this document.
To further clarify these points regarding archival, the following bullet point under requirements should be modified from:
All proposals or project plans submitted to NASA for scientific research funding will be required to include a Software Management Plan (SMP) that describes whether and how software generated through the course of the proposed research will be shared and preserved (including timeframe), or explains why software sharing and/or preservation are not possible or scientifically appropriate. At a minimum, SMPs must describe how software sharing and preservation will enable validation of published results, or how such results could be validated if software are not shared or preserved.
Into a version that clearly separates archived software from non-archived software that can change over time.
All proposals or project plans submitted to NASA for scientific research funding will be required to include a Software Management Plan (SMP) that describes whether and how software generated through the course of the proposed research will be shared and preserved (including timeframe), or explains why software sharing and/or preservation are not possible or scientifically appropriate. At a minimum, SMPs must describe how software preservation will enable validation of published results, or how such results could be validated if software are not shared or preserved. SMPs may optionally describe where a version of the software that will change and evolve post publication will be publicly accessed in addition to the fixed software archive used in the publication.
Additionally, on page 20 under “implementation” sub-heading there is the following statement:
Require all researchers to share their scientific software developed to support a scholarly publication at the time of publication. This includes the scientific software that are displayed in charts and figures or needed to validate the scientific conclusions of the publication. This requirement could be met by including the software as supplementary information to the published article, or through other means. The published article should indicate how the software can be accessed.
This statement is very similar to the fifth bullet point on page 19 under the “requirements” sub-heading. The statement puts too much emphasis on releasing code as supplementary information files where it is least likely to be discovered, accessed, reused, bug fixes submitted, etc. To maximize value of funded software creation for NASA and the larger community, suggest changing both bullet points on to:
Require all researchers to share their scientific software developed to support a scholarly publication at the time of publication or prior. This includes the scientific software that are displayed in charts and figures or needed to validate the scientific conclusions of the publication. This requirement could be met by including the software as supplementary information to the published article, linking to archived versions of the software on a NASA-recognized archive service (such as NTRS or Zenodo), or through other means. The publication should indicate how the software can be accessed. Stating “upon request” is not a valid means of public release. If the software is available not just as an archive but also as a publicly accessible software repository that changes over time, both humans and machines should be able to discover it from the publication.
It is important that going from a publication to both the fixed, archived version of the software and the live, updating, or changing version of the software repository is as frictionless for users as possible to maximize the value of NASA’s funded software creation.
Educational Needs: Evolving Software, Archival Software, and DOIs
The plan discusses educational needs. There is a strong need to educate scientific software developers on the benefits and methods of both archives and having a live software repository on a public-facing version control system that can reflect changes over time. When asked for methods several times in the past, I have recommended using both Zenodo and GitHub. However, any options that enable the same behaviors, software evolution other time and DOIs tagged to fixed releases of said software, would be sufficient. Zenodo is foremost an archive service, but it has integrations with Github that makes it easy to package a fixed version of a software repository into a GitHub Release and then turning that release artifact into an archived version on Zenodo with a unique DOI. The process is relatively simple but the instructions to do so on Zenodo and GitHub are easy to miss and at times confusing. NASA has a role to play in making more scientific software developers aware of these processes and establishing them as at least best practice if not required.
In terms of the Public Access Plan, there are a few minor changes that could be made to clarify when creating a software archive with a DOI has value.
On page 20, it could be interpreted that only software not related to a paper needs a DOI.
Software released independently from the peer-reviewed manuscript must be assigned a Unique Digital Object Identifier (DOI) to enable preservation, discovery, and citation of the software.
I believe this is as wrong as (1) people may hesitate to cite a paper DOI if the software is only a small part of the paper. (2) people may need to cite a different version of the software than was released with the publication. The plan should instead add the following additional text as shown below:
Software released independently from the peer-reviewed manuscript must be assigned a Unique Digital Object Identifier (DOI) to enable preservation, discovery, and citation of the software. Software released at the same time as a publication with a DOI can optionally have its own DOI. DOIs may optionally be created for any additional versions of the software, commonly referred to in software development as a release.
GitHub or GitLab releases are fixed versions of software, so they meet some of the needs for archival, but the fact that they do not automatically get a DOI limits their use as archives without linking them to Zenodo, which gives them a DOI.
Educational Needs: Code Citation
Within academia, software is to some degree treated as less valuable than publications. Part of the reason for this is papers get cited, code often does not. In addition to DOIs, NASA should encourage the use of citation files in code repositories such as CITATION.cff and push for community standards around code citation to enable more effective programmatic tracking of code, data, and publications across different types of systems. As an example, Zenodo consumes CITATION.cff files found in linked GitHub repositories. Suggest adding the following bullet point to the list of requirements on page 18 and 19
- NASA employees, contractors, and grantees are encouraged to use community standard citation files in their software projects. - No matter where their code, data, and publications are stored NASA employees, contractors, and grantees are encouraged to fill out system metadata to ensure maximum linkages between them.
Other areas of NASA's public release status quo that might be commented on:
- Make it easy for a wide variety of user types to stumble upon one of the many NASA websites and get an introduction to all the different places NASA data can be searched for. Currently, there is no single description of the NASA data and code ecosystem. This results in a lot of user frustration and lost time.
- What actions could be taken to improve discovery of NASA funded software outside of NASA operated software registry systems?
- While there are a variety of well funded data archives for Earth Science. Other types of scientific fields lack NASA funded data archives, especially engineering related fields, making it difficult for researchers in those fields to release open data as existing NASA data archives will decline them as out of scope.
- Figure out how to make progress on ensuring data, code, and publications can be programmatically connected both by users and by NASA itself. The status quo is that this is impossible.