The WARCnet Code Book of web archive data formats

WG5: The WARCnet Code Book of web archive data formats
June 2022, London
Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin

The aim of Working Group 5 is to discuss and formulate possible data formats
and thereby create a shared language between web archiving institutions and
research communities.
Projects:
1. A shared data vocabulary to request archived Web data
2. A glossary with terms used in web archive research (WG3)

Project 1:
Shared data vocabulary for requesting
archived Web data

Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.

Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
• Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.)
• CSV export of search results (with custom field selection).
• Image search (similar to google images).
• Visualization of search results such as:
- Interactive network graph (ingoing/outgoing)
- Statistics over time (e.g. size, number of in and out going links, etc)

Ulrich Have (in an email when WG5 was established):
“a standard data format would be interesting as a kind of
future requirements document for researcher-ready-data”

Niels Brügger, a systematic description of data formats for web archive studies

Actions:
Existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions

The WARCnet Code Book of web archive data formats

Interoperability
Using controlled vocabularies offers the potential to request and interlink (machine-readable) data across digital heritage
collections.

Request #1
CDX (listing of all the resources within the Web archive)
• Domain
• Host
• Full resource URL
• Crawl date
• Hash
• Resource format (PDF/html etc)
• Link to instance in Wayback

Request #2
Seeds and crawl policies
• seed URL
• crawl frequency (daily, weekly etc)
• capped? (yes/no)
• first crawl date
• last crawl date (or ongoing)
• crawl depth

Request #3
Links
• Source URL (full)
• Source URL (host)
• Source URL (domain)
• Source File Format (.html etc)
• Target URL (full)
• Target Host
• Target Domain
• Capture Date
• Link to source resource in Wayback

Request #5
Metadata on a dataset
• Web archive
• Provenance
(based on W3C-PROV)

Next steps:
Collect existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions

Project 2:
Glossary for Web archival studies

Next steps:
• Add terms to the reference list in Zotero
• Add definitions to the reference list in Zotero
• Select terms for a glossary for early career researchers using Web archives

The WARCnet Code Book of web archive data formats

More Related Content

Similar to The WARCnet Code Book of web archive data formats (20)

More from WARCnet (20)

Recently uploaded (20)

The WARCnet Code Book of web archive data formats