Thursday, February 26, 2015

The Wayback Machine and the Cornell Web Lab

The D-Lib article "A Research Library Based on the Historical Collections of the Internet Archive," published in February 2006, describes an initiative at Cornell University to create a web archive for social science research in partnership with the Internet Archive.

While the Wayback Machine, described in this excellent January 2015 New Yorker article, archives as much of the web as it can (450 billion pages as of this writing), it is not indexed or readily searchable other than by URL or date.  For social scientists looking to do serious analysis of social trends, more manageable (though still enormous), indexed collections are needed. Cornell's project aimed to harvest collections of archived web pages--approximately 10 billion at the time of the article--and develop methods of automated indexing to make them useful to researchers.  Access would be through scripts or APIs. Designers envisioned researchers conducting projects to trace the development of ideas across the internet, follow the spread of rumors and news, and investigate the influence of social networks.

As William Arms, one of the developers of the Cornell project notes in this article from 2008, researchers were spending as much as 90% of their time simply obtaining and cleaning up their data. "The Web Lab's strategy," he explains, "is to copy a large portion of the Internet Archive's Web collection to Cornell, mount it on a powerful computer system, organize it so that researchers have great flexibility in how they use it, and provide tools and services that minimize the effort required to use the data in research."

The project exists today as the Cornell Web Lab, and though direct links to its site were returning errors as of this writing, a detailed description of current activities can be found on this faculty page, and the tools and services suite is still available through Sourceforge.

No comments:

Post a Comment