Preservation risk management for Web resources: preserving Web content requires substantial resource commitments and flexible and innovative approaches to new technologies, organizational missions, and user expectations.

AuthorKenney, Anne R.

At the Core

This article:

* Discusses current Web preservation efforts

* Defines a risk-based preservation management program

* Introduces Cornell University's Project Prism

Actuaries spend their careers figuring out what benefits a company should offer, at what price, and for how long. Their job is to make sense of all the empirical and statistical evidence of age, gender, health, heredity, life styles, physical habits, and living and working conditions that serve as indicators of longevity, productivity, and obligation. How well they do their job depends on how good their evidence is, how skilled they are at reading it, and how risk tolerant their customers are.

Archivists and research librarians interested in preserving Web resources face a similar challenge. Libraries increasingly depend on digital assets they neither own nor manage. Academic libraries have dramatically increased their offerings of online resources. A 2001 survey of the 21 members of the Digital Library Federation revealed that 40 percent of their costs for digital libraries in 2000 went for commercial content. The big-ticket items were electronic scholarly journals that libraries license rather than own. Yet little direct evidence shows that publishers have developed full-scale digital preservation capabilities to protect this material, and research libraries continue to purchase the print versions for preservation purposes. However, none appears ready to forgo access to the licensed content just because its long-term accessibility might be in question.

Research libraries also are including in their catalogs and gateways more open-access Web resources that are not covered by licenses or other formal arrangements. A spring 2001 survey of Cornell University's and Michigan University's Making of America (MOA) collections revealed that nearly 250 academic institutions link directly to the MOA collections, although neither university has committed to provide other entities with long-term access. Similarly, a review of the holdings of several research library gateways over the past few years indicates growth in the number of links to open-access Web resources that are managed with varying degrees of control. Approximately 65 percent of the electronic resources on Cornell's gateway are unrestricted, and additional open resources are included in aggregated sets that are available only to the campus community. In contrast, only six percent of Michigan's electronic resources are open-access materials.

Current Web Preservation Efforts

Estimates put the average life expectancy of a Web page between 44 days and two years, and a significant proportion of those that survive undergo some change in content within a year. Since 1998, Online Computer Library Center's (OCLC) Web Characterization Project has tracked trends in growth and content of the publicly available Web space. One of the more revealing statistics, IP address volatility, identifies the percent of extant IP [Internet Provider] addresses from one year to the next. In a fairly consistent trend since 1998, slightly over half (55-56 percent) the IP addresses identified in one year are still available the next. Within two years, a little over a third (35-37 percent) remain. Four years later, only 25 percent of the sample 1998 IP addresses could be located, according to OCLC.

OCLC's annual review points to the instability of Web resources; it doesn't indicate whether those resources still exist elsewhere on the Web or whether the content has changed. While some resources disappear, others become unfindable due to the well-known problem that URLs change. A recent preservation review of the 75 Smithsonian Institution Web sites noted that an exhaustive search could not locate a copy of the first Smithsonian Web site, created in 1995. A URL may persist while content changes wildly: the editors of RLG DigiNews discovered that links in several past issues pointed to lapsed domain names that had been converted by others into pornography sites.

Much attention has been paid to unstable URLs and to creating administrative/preservation metadata, but to date no evidence suggests that research libraries are privileging open access sites that utilize some form of URN [Uniform Resource Name] or that document content change.

With the growing dependence on external digital assets, libraries and archives are undertaking some measures to protect their continued use of these resources. Efforts can be grouped into three areas: collaborating with publishers to preserve licensed content, developing policies and guidelines for creating and maintaining Web sites, and assuming archival custody for Web resources of interest.

Licensed Content

Publishers are developing their own preservation strategies as they realize the commercial benefits of creating deep content databases. Several are working with third parties to back up, store, and refresh digital content. OCLC recently announced the formation of the Digital and Preservation Resources Division to provide integrated solutions for creating, accessing, and preserving digital collections. With planning grants received in 2001 from The Andrew W. Mellon Foundation, seven research libraries and key commercial and scholarly publishers began exploring formal archiving arrangements for e-journals and developing plans for moving toward implementation.

Creating and Maintaining Sites

The World Wide Web Consortium's (W3C) "Web Content Accessibility Guidelines, Techniques, and Checklist" provides some recommendations for good resource management (e.g., use of standard formats and backward-compatible software) and have had a major impact on the development of Web materials worldwide. However, the W3C guidelines do not expressly address content stability, documentation of change, or good database management. In fact, preservation and records management issues are noticeably absent.

In the United States, Web preservation is more directly supported through government policies and guidelines to promote accountability, spurred in part by such legislation as the Paperwork Reduction Act. Governments also are promulgating specific policies and recommendations for preserving government-supported Web content. In January 2001, the U.S. National Commission on Libraries and Information Science published "A Comprehensive Assessment of Public Information Dissemination," which recommended legislation that would "formally recognize and affirm the concept that public information is a strategic national resource." Another recommendation is to "partner broadly, in and outside of government, to ensure permanent public availability of public information resources."

The archivist's perspective has been quite influential, as arguments are advanced to treat Web sites as important records in their own right. National archives in many countries are developing policies and guidelines. The U.S. Federal Records Act, as amended, requires that agencies identify and transfer Web site records to agency recordkeeping systems, including the National Archives and Records Administration (NARA), for permanent retention. NARA has issued several bulletins on the disposition of electronic records that include Web sites. It has also slowly begun to respond to this new form of recordkeeping and has appraised at least one federal Web site as a permanent record. In late 2000, NARA established an initiative to capture a snapshot of all federal Web sites at the end of the Clinton Administration. NARA also has contracted with the San Diego Supercomputer Center for a project to investigate the preservation of presidential Web sites.

The National Library of Australia (NLA) has been a world leader in promulgating guidelines for preservation. In December 2000 the NLA issued "Safeguarding Australia's Web Resources," which provides advice on creating, describing, naming, and managing Web resources. The Council on Library and Information Resources funded NLA's Safekeeping Project, which targets 170 key items accessible through Preserving Access to Digital Information (PADI). NLA staff wrote to the resource managers, encouraging them to voluntarily preserve these materials and outlined nine strategies for long-term access. According to Susan Thomas, PADI administrator, 116 resource owners responded and safekeeping arrangements have been made for 77 items to date. Negotiations are in progress for an additional 33 resources. Eight resource owners lacked the appropriate infrastructures to comply with the recommendations...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT