Archiving WWW Sites
Union College, Schenectady, NY
May 12-14, 2002

Preliminary Report on the May 2002 Workshop 
(click here for the flipchart pages from the workshop)

This workshop investigated the feasibility and mechanisms for archiving, storing, and retrieving an institutional WWW site for long-term preservation and access.  The applicability of the results to the more general problem of archiving digital business records, communications, and student records of a college could follow from this discussion.

The initial plan for the 2-day workshop was the identification of the major issues in web archiving, including but not limited to the following:

  • Identify the various reasons for wanting to archive web-based materials; 

  • Identify archival storage and retrieval mechanisms: 
      i. archival media; 
      ii. archival file types; 
      iii. file type descriptions; 
      iv. method of transfer and backup; 
      v. retrieval and read software and hardware; 
      vi. duplicating software and hardware.

  • Identify the nature and content of WWW site documents and other files that should be considered for archiving; 

  • Identify types of linked subsections to be routinely archived, and the degree of functionality (i.e. hyperlinks) desired or needed in the archive; 

  • Investigate longevity and stability of available storage media; 

  • Investigate longevity and stability of available read/write mechanisms. 

  • Address copyright and intellectual property issues.

A very useful introduction to the issues of web archiving can be found in "Web-archiving: an introduction to the issues," presented by Catherine Redfern to the DPC Forum: Web-archiving: managing and archiving online documents and records on March 25, 2002. 

Outline of the workshop

The first morning was spent defining what is a web site, from both a social (information) standpoint, and a technical standpoint.  There is no such thing as "the" college web site.  There are many sites, folded in and through each other, and having many different levels of software and hardware dependency.  Linking also complicates this picture.  So, when the question is asked whether and how we can archive the college web site, we need to spend some time and thought on just exactly what this means, and which parts really deserve to be, or even can be, archived for long-term preservation.

In further discussion, we identified three primary reasons for archiving web-based materials:

  • For historical purposes: so that people in the future can understand what the web was and how we used it.  For this type of archive, we want to keep not just content, but also the experience of how the web was used.

  • For institutional purposes: As a record of what the institution was like, what were institutional policies, what courses were taught.  Much of this information is now kept both in print and on the web, but if important documents of record are ever transferred exclusively to the web, the institution should find ways of preserving the information.

  • For legal purposes: Many web pages are used to serve information important to personal and financial decisions.  More and more transaction-based web sites are being created, and as the web plays a more important role in the business of an institution, it becomes more important to be able to track those transactions to resolve future disputes.  It becomes important to know not only what information was available, but who entered it, who looked at it, and when.

These reasons for archiving lead to different archive strategies. These decisions are also largely policy decisions, rather than technical.

Much of the rest of the first day discussed the transactions-logging model, suitable for the last of these archiving requirements.  In this model, a "gate-keeper" (computer) intercepts and makes an archival record of each transaction involving the institution's web site, understanding that this might involve many different servers and their underlying database machines.

A crucial element in this model, and as it turns out, other archiving models, is the presence of metadata tags within each web page.  Briefly, meta-tags include fundamental information describing the page: date, author, owner, access restrictions, type of data, software used to create the page, software needed to view the page, etc. (See the workshop Links page for references to meta-tag standards).

Because of this, questions about metadata and classification of digital records will also merit consideration.  Standard archival and library methods for classifying and indexing these documents may no longer be adequate, especially since many of these documents will carry their own metadata as assigned by the site authors (and which may be terrible for information retrieval purposes).  Creation of at least institution-wide if not even wider standards for metadata will be necessary for archiving to be useful. 

On the last morning of the workshop, we discussed more practical issues in archiving, including archiving for historical, rather than legal reasons.  We discussed the "snapshot" method of archiving, in which rather than capturing every transaction, and every change in a college web site, we regularly (quarterly, annually) capture as much of the web as possible, and store it on a relatively safe medium, with instructions (and perhaps software) for viewing the pages.  The rapid advance in the hardware and software of the web, and the rapid evolution of storage media, mean that questions of logical and physical readability will instantly arise, as will questions of long-term formats (logical/physical).   Nevertheless, the "snapshot" method seems to come closest to what the librarians in the group had in mind for a web archive.

Conclusions and next steps

The transaction-logging methodology discussed at the workshop seems challenging to implement, at best, and concerns were expressed that even if accessed pages were systematically logged with complete metadata tags describing both the technical dependencies and the content of the pages, we still might not have anything usable at the end of the day.  

A more realistic method seemed to be to develop some best practices for archiving web sites based on the "snapshot" model. While the our facilitator cautioned that this practice would have little value from an information management standpoint, it seemed to have a better likelihood of succeeding, and a good likelihood of producing something that's usable down the road.

We began to identifying some existing methods of taking web-snapshots, and have built a considerable bibliography on the general subject of web archiving.  The next step will be a somewhat formal evaluation of some of these tools, from both IT and Librarians' perspectives, with an eye toward using one of these applications as the basis for an ongoing larger system of best practices for archiving college web sites.

There are still many issues left to solve, among them insuring systematic migration to media (and software) that continue to be viewable; another is the thorny issues of web pages generated on-the-fly from underlying databases.

Nevertheless, the first lesson is:  Think about what it is that you need to preserve, and why.  Then start asking the technical questions.  Not the other way around.  The solutions are not one-size-fits-all, because the problems are not.

The second lesson is:  The chances are good that nothing you do now, if you do not rethink, refresh, and migrate to newer media, will last more than a few decades at best.

  Archiving the Web Workshop Home
  Return to the Archiving Home Page

Sponsored by Union College and the Center for Educational Technology