Preliminary Report on the May
2002 Workshop This workshop investigated the feasibility and mechanisms for archiving, storing, and retrieving an institutional WWW site for long-term preservation and access. The applicability of the results to the more general problem of archiving digital business records, communications, and student records of a college could follow from this discussion. The initial plan for the 2-day workshop was the identification of the major issues in web archiving, including but not limited to the following:
A very useful introduction to the issues of web archiving can be found in "Web-archiving: an introduction to the issues," presented by Catherine Redfern to the DPC Forum: Web-archiving: managing and archiving online documents and records on March 25, 2002. Outline of the workshop The first morning was spent defining what is a web site, from both a social (information) standpoint, and a technical standpoint. There is no such thing as "the" college web site. There are many sites, folded in and through each other, and having many different levels of software and hardware dependency. Linking also complicates this picture. So, when the question is asked whether and how we can archive the college web site, we need to spend some time and thought on just exactly what this means, and which parts really deserve to be, or even can be, archived for long-term preservation. In further discussion, we identified three primary reasons for archiving web-based materials:
These reasons for archiving lead to different archive strategies. These decisions are also largely policy decisions, rather than technical. Much of the rest of the first day discussed the transactions-logging model, suitable for the last of these archiving requirements. In this model, a "gate-keeper" (computer) intercepts and makes an archival record of each transaction involving the institution's web site, understanding that this might involve many different servers and their underlying database machines. A crucial element in this model, and as it turns out, other archiving models, is the presence of metadata tags within each web page. Briefly, meta-tags include fundamental information describing the page: date, author, owner, access restrictions, type of data, software used to create the page, software needed to view the page, etc. (See the workshop Links page for references to meta-tag standards). Because of this, questions about metadata and classification of digital records will also merit consideration. Standard archival and library methods for classifying and indexing these documents may no longer be adequate, especially since many of these documents will carry their own metadata as assigned by the site authors (and which may be terrible for information retrieval purposes). Creation of at least institution-wide if not even wider standards for metadata will be necessary for archiving to be useful. On the last morning of
the workshop, we discussed more practical issues in archiving, including
archiving for historical, rather than legal reasons. We discussed
the "snapshot" method of archiving, in which rather than
capturing every transaction, and every change in a college web site, we
regularly (quarterly, annually) capture as much of the web as possible,
and store it on a relatively safe medium, with instructions (and perhaps
software) for viewing the pages. The rapid advance in the hardware
and software of the web, and the rapid evolution of storage media, mean
that questions of logical and physical readability will instantly arise,
as will questions of long-term formats (logical/physical). Conclusions and next steps The transaction-logging methodology discussed at the workshop seems challenging to implement, at best, and concerns were expressed that even if accessed pages were systematically logged with complete metadata tags describing both the technical dependencies and the content of the pages, we still might not have anything usable at the end of the day. A more realistic method
seemed to be to develop some best practices for archiving web sites
based on the "snapshot" model. While the our facilitator
cautioned that this practice would have little value from an information
management standpoint, it seemed to have a better likelihood of
succeeding, and a good likelihood of producing something that's usable
down the road. We began to identifying some existing methods of taking web-snapshots, and have built a considerable bibliography on the general subject of web archiving. The next step will be a somewhat formal evaluation of some of these tools, from both IT and Librarians' perspectives, with an eye toward using one of these applications as the basis for an ongoing larger system of best practices for archiving college web sites. There are still many issues left to solve, among them insuring systematic migration to media (and software) that continue to be viewable; another is the thorny issues of web pages generated on-the-fly from underlying databases. Nevertheless, the first
lesson is: Think about what it is that you need to preserve, and
why. Then start asking the technical questions. Not the
other way around. The solutions are not one-size-fits-all, because
the problems are not. The second lesson is: The chances are good that nothing you do now, if you do not rethink, refresh, and migrate to newer media, will last more than a few decades at best.
Archiving the Web Workshop Home Sponsored by Union College and the Center for Educational Technology |