Why start our Infrastructure Series at the far end of downstream publishing with digital preservation? For one reason, the very nature of so-called dark archives — organizations such as CLOCKSS/LOCKSS and Portico that ensure long-term availability of scholarly content — are by their very nature deep infrastructure. Another reason is because digital preservation is widely accepted within the scholarly community as fundamental.
This blog series aims to surface the often hidden work of infrastructure and digital preservation is a great example of a robust and ongoing function that is by definition mostly ‘behind the scenes.’ In subsequent posts, we will work backwards, in a sense, from preserving published scholarly content to many of the issues, technologies and functions along the way. For more information on other established digital preservation organizations, please refer to the Keepers Registry.
Interview with Craig Van Dyck
Interview by Jennifer Kemp
What does the term ‘infrastructure’ mean to you/your organization, in the context of research communications?
For research communications, infrastructure means the shared systems that multiple parties rely upon to conduct their activities. This includes hardware, software, standards, best practices, and social contracts, as well as shared values, goals, and understandings. Many of these are specific to research communications, such as reference-linking via Crossref, or COUNTER usage reporting. Of course the scholarly community also relies upon the basic infrastructure of society such as highways, plumbing, and the Internet.
How do you describe what you do/how digital preservation works to people unfamiliar with it?
Long-term digital preservation of the scholarly literature is necessary because online content can be at-risk of disappearing, and scholars require ongoing access to resources that they use to conduct their own research. If, for whatever reason, scholarly content will disappear from the Web, and thus become no longer available to researchers, then a preservation system like the CLOCKSS Archive can step in and provide access, in the cases when no one else will. It should also be noted that digital preservation is more and more crucial also outside of research communications. For example, nations and small communities are digitizing their cultural heritage, which needs to be preserved for the long term. My favorite example is a Mexican American neighborhood that is being displaced by gentrification; leaders in this community are digitizing artefacts such as flyers and photos and documents, so the centuries-old history of the community will be preserved, even as it is disappearing.
What is the one thing you wish ‘Silicon Valley’ would do or do differently to better support digital preservation?
Cloud computing has become ubiquitous in many environments. However, for long-term digital preservation, the leading providers are not necessarily the best options. Their commercial nature makes them less than perfectly trustworthy, and their costs for repeated access to content (which is important for ensuring the validity of the bits) are too high. ‘Silicon Valley’ could support a consortium that could offer a non-profit cloud solution that is suited for the academic community.
What is the one thing you wish non-technical people understood better about the challenges of digital preservation?
Just because a resource is available today on the Web, does not mean that it will be available tomorrow. And scholarly research is very specific; not just “any old” resource will do. Researchers need ongoing access to specific articles, each of which reports on a specific set of scholarly activities that bear upon the highly specialized pursuits of researchers. It is easy to find resources on the Web that seem to provide information on any topic; but the scholarly record is the source for peer-reviewed, validated information.
How, if at all, does digital preservation differ when it comes to text vs data or journals, books, etc.?
The scholarly literature is diversifying and becoming more dynamic. It is not only about journals and books. Today scholars are finding new forms to express their findings. Multimedia, user-driven pathways, apps, data sets, 3D imaging, even virtual reality have all become elements of scholarly works. And there is more ancillary content such as preprints, post-publication comments and annotations, videos, and podcasts. What are the boundaries of “the scholarly literature”? What content must be preserved for the long-term? What are the technical solutions for capturing, preserving, and replaying dynamic content, and if there are compromises, how to make the trade-offs clear to authors and publishers? For a digital preservation service, we must embrace the evolution of the literature, and add capabilities to ensure the long-term availability of these new forms of scholarly outputs.
What other areas of infrastructure do you work most closely with/are most dependent on (& how)?
CLOCKSS works across multiple publishers’ platforms, each of which has its own special features and idiosyncrasies. We rely upon publishers’ best practices to provide us with efficiency in working with the content of hundreds of different publishers, e.g. DOIs, ISSNs, ISBNs, the JATS XML format, ONIX, ORCID. We find that practices in journal publishing are quite well established and predictable, while for books, less so, but still with a good bit of uniformity. However, when we work with the aforementioned new forms of content, norms have not been established, and economies of scale have not yet been achievable.
Explain in some detail the issue you think is the most vexing/interesting/consequential/etc.
In the immediate term, probably the most interesting issue is to understand which new forms of content to preserve, and how. For example, if an online book is ever-changing, which version do we preserve, or, if we are to preserve all versions, how can we do that in a scalable, sustainable way? In the longer term, we all need to be concerned about the funding support for long-term digital preservation, which is rarely a first-priority, and sometimes is forgotten about, misunderstood, or taken for granted. One other aspect that bears mention is that many digital resources that academic libraries make available to their users are not covered by a preservation system, such as “gray literature,” or general interest newspapers and magazines.
In a perfect world, how would digital preservation be funded and governed?
It is best for any endeavor to be funded and governed by those who benefit from the endeavor. For research communications, that means researchers. However, it is probably not practical or even desirable for individual researchers to directly fund and govern digital preservation. Fortunately, research libraries and scholarly publishers are well-positioned to act as proxies for researchers, as they already do for collection development and for journal publishing, for example. CLOCKSS is governed by a Board of 12 research libraries and 12 scholarly publishers, and CLOCKSS is funded by 300 academic libraries and 300 scholarly publishers. This is a sustainable model.
What are your favorite blogs, conferences, Twitter accounts, etc. to keep up on digital preservation?
The Scholarly Kitchen blog by the Society for Scholarly Publishing (SSP), the annual Charleston Library Conference, and the biannual meetings of the Coalition for Networked Information (CNI) are three estimable primary resources for keeping up with digital preservation.
Favorite topical little-known fact or unsung hero?
The LOCKSS software (LOCKSS stands for Lots of Copies Keep Stuff Safe) was founded at the Stanford University Libraries in the late 1990s by Vicky Reich and David Rosenthal. They are not unsung; justifiably, they are sung! What is little appreciated is the similarity of LOCKSS to blockchain. Some argue that LOCKSS is the first at-scale implementation of the blockchain concept. The CLOCKSS Archive uses the LOCKSS software.
What question do you wish we asked but didn’t and why?
What are the best opportunities for significantly improving the state-of-play in digital preservation? A Best Practice for embedding metadata in content so that web crawlers have more information about what they are encountering. Also, improved tools for capturing and replaying dynamic content. And, economies of scale for storage costs, rather than linearly escalating costs. The Mellon Foundation has been supportive with grant money, which allows multiple parties such as university presses and preservation agencies to collaborate to grapple with the challenges.
More information: Craig Van Dyck and CLOCKSS
The CLOCKSS Archive is a not-for-profit joint venture between the world’s leading academic publishers and research libraries, whose mission is to build a sustainable, international, and geographically distributed dark archive with which to ensure the long-term survival of Web-based scholarly publications for the benefit of the greater global research community. https://www.clockss.org.
Craig Van Dyck is Executive Director of the CLOCKSS Archive, and has been working in scholarly communications since 1978. Before CLOCKSS, he was at Wiley for 18 years, as Vice President, Content Management, and at Springer-Verlag New York for 10 years, as Senior VP and Chief Operating Officer.
Craig served as Chairman of the Association of American Publishers Enabling Technologies Committee from 1995-1998, and was instrumental in the development of the Digital Object Identifier (DOI) system, and of Crossref. He has served on the Boards of Directors of the International DOI Foundation, CLOCKSS, ORCID, Crossref, and the Society for Scholarly Publishing, and was a member of the Portico Advisory Committee.
Craig’s portfolio has always included industry collaboration to improve the infrastructure of scholarly communications.