Crawled URL Index - JISC UK Web Domain Dataset (1996-2013)

& Andrew Jackson
The dataset comprises original compound index (CDX) files that have been re-assembled into 18 separate CDX files for each year of crawling activity represented (1996 - 2013). Please note that the individual CDX files are not sorted. In order to enable access to web archives, UKWA uses CDX files to act as indexes so that it is possible to look up which ARC or WARC files contain which URLs and responses. In partnership with the...
This data repository is not currently reporting usage information. For information on how your repository can submit usage information, please see our documentation.