SlideShare a Scribd company logo
WG5: The WARCnet Code Book of web archive data formats
June 2022, London
Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin
The aim of Working Group 5 is to discuss and formulate possible data formats
and thereby create a shared language between web archiving institutions and
research communities.
Projects:
1. A shared data vocabulary to request archived Web data
2. A glossary with terms used in web archive research (WG3)
Project 1:
Shared data vocabulary for requesting
archived Web data
Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
Solr wayback
The Solr wayback is a search engine that can retrieve data from WARC’s.
• Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.)
• CSV export of search results (with custom field selection).
• Image search (similar to google images).
• Visualization of search results such as:
- Interactive network graph (ingoing/outgoing)
- Statistics over time (e.g. size, number of in and out going links, etc)
Ulrich Have (in an email when WG5 was established):
“a standard data format would be interesting as a kind of
future requirements document for researcher-ready-data”
Niels Brügger, a systematic description of data formats for web archive studies
Actions:
Existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions
The WARCnet Code Book of web archive data formats
Wiki data
Interoperability
Using controlled vocabularies offers the potential to request and interlink (machine-readable) data across digital heritage
collections.
Request #1
CDX (listing of all the resources within the Web archive)
• Domain
• Host
• Full resource URL
• Crawl date
• Hash
• Resource format (PDF/html etc)
• Link to instance in Wayback
Request #2
Seeds and crawl policies
• seed URL
• crawl frequency (daily, weekly etc)
• capped? (yes/no)
• first crawl date
• last crawl date (or ongoing)
• crawl depth
Request #3
Links
• Source URL (full)
• Source URL (host)
• Source URL (domain)
• Source File Format (.html etc)
• Target URL (full)
• Target Host
• Target Domain
• Capture Date
• Link to source resource in Wayback
Request #4
Webpages
To do.
Request #5
Metadata on a dataset
• Web archive
• Provenance
(based on W3C-PROV)
Next steps:
Collect existing data vocabularies
• Web archives
• Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.)
Datathons
• Identify data requests
• Identify terms / variables
• Write / improve definitions
Project 2:
Glossary for Web archival studies
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
Next steps:
• Add terms to the reference list in Zotero
• Add definitions to the reference list in Zotero
• Select terms for a glossary for early career researchers using Web archives

More Related Content

PDF
Time -Travel on the Internet
PPT
Tool Academy: Web Archiving
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PPTX
Tuesday 5 May 2020: Contextualizing and engaging with Web domains, Valérie Sc...
PPT
A researcher driven data description for the archived web: Why and how?
PDF
Internet content as research data
PPTX
Archive What I See Now - Archive-It Partner Meeting 2013 2013
PDF
Slides anu talkwebarchivingaug2012
Time -Travel on the Internet
Tool Academy: Web Archiving
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Tuesday 5 May 2020: Contextualizing and engaging with Web domains, Valérie Sc...
A researcher driven data description for the archived web: Why and how?
Internet content as research data
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Slides anu talkwebarchivingaug2012

Similar to The WARCnet Code Book of web archive data formats (20)

PPTX
Webber Presentation
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
SAA 2014 session 703
PDF
Ruest and Milligan - The Great WARC Adventure
PPTX
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
PPT
The development of web archiving 3
PPTX
Best Practices for Descriptive Metadata
PPT
Web Archiving Intro (circa 2015)
PDF
Web Archiving in the Year eaee1902f186819154789ee22ca30035
PDF
web archiving tools and technologies
PPT
Filling in the Blanks: Capturing Dynamically Generated Content
PDF
Resource Discovery Paper.PDF
PDF
Discovering Heterogeneous Resources in the Internet
PPTX
2015-odu-ece-tools-for-past-web
PDF
Linking Library Data on the Web
PDF
Radically Open at the National Archives
ZIP
Intro to Linked Open Data in Libraries, Archives & Museums
PDF
Peter Webster - Digital History - 11 June 2013
PPTX
"Web Archive services framework for tighter integration between the past and ...
PDF
Analyzing Web Archives
Webber Presentation
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAA 2014 session 703
Ruest and Milligan - The Great WARC Adventure
Semantic Web (IS 535 presentation) by ITRL students Deborah Ratliff and Maril...
The development of web archiving 3
Best Practices for Descriptive Metadata
Web Archiving Intro (circa 2015)
Web Archiving in the Year eaee1902f186819154789ee22ca30035
web archiving tools and technologies
Filling in the Blanks: Capturing Dynamically Generated Content
Resource Discovery Paper.PDF
Discovering Heterogeneous Resources in the Internet
2015-odu-ece-tools-for-past-web
Linking Library Data on the Web
Radically Open at the National Archives
Intro to Linked Open Data in Libraries, Archives & Museums
Peter Webster - Digital History - 11 June 2013
"Web Archive services framework for tighter integration between the past and ...
Analyzing Web Archives
Ad

More from WARCnet (20)

PPTX
Gauditz & Kunze, Web archives as research data FINAL.pptx
PPTX
Gauditz & Kunze, Web archives as research data FINAL.pptx
PDF
2022 Visit Royal Danish Library Ditte Laursen.pdf
PDF
20221015 introduction to panel Ditte Laursen.pdf
PPTX
WARCnet_2022.pptx
PPTX
WARCnet conference - Mapping social media archiving initiatives.pptx
PPTX
Warcnet 2022_final.pptx
PDF
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
PDF
Hegarty-WARCNet2022-slides.pdf
PDF
20221018_Panel_Covid_WARCnet_closing_conference.pdf
PPTX
Millward - We cannot put this off any longer - upload.pptx
PPTX
Balbi_Keynote_AarhusWARCnet.pptx
PPTX
Reporting from a Short-Term Network Stay at the BnF and INA
PPTX
Post WARCnet
PDF
Web scraping using semi-automated browsing
PPTX
Working Group 6 discussion
PPTX
WG5: A data wrangling experiment
PPTX
What’s in a URL? Analysing COVID-19 web archive collections
PPTX
Working Group 2 on transnational events
PDF
Web Archive Research Skills and Tools Survey (WARST)
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
2022 Visit Royal Danish Library Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
WARCnet_2022.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
Warcnet 2022_final.pptx
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Hegarty-WARCNet2022-slides.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
Millward - We cannot put this off any longer - upload.pptx
Balbi_Keynote_AarhusWARCnet.pptx
Reporting from a Short-Term Network Stay at the BnF and INA
Post WARCnet
Web scraping using semi-automated browsing
Working Group 6 discussion
WG5: A data wrangling experiment
What’s in a URL? Analysing COVID-19 web archive collections
Working Group 2 on transnational events
Web Archive Research Skills and Tools Survey (WARST)
Ad

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
GDM (1) (1).pptx small presentation for students
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Pre independence Education in Inndia.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Insiders guide to clinical Medicine.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Types and Its function , kingdom of life
PDF
VCE English Exam - Section C Student Revision Booklet
Microbial diseases, their pathogenesis and prophylaxis
O7-L3 Supply Chain Operations - ICLT Program
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Supply Chain Operations Speaking Notes -ICLT Program
GDM (1) (1).pptx small presentation for students
O5-L3 Freight Transport Ops (International) V1.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Final Presentation General Medicine 03-08-2024.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
01-Introduction-to-Information-Management.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pre independence Education in Inndia.pdf
Complications of Minimal Access Surgery at WLH
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPH.pptx obstetrics and gynecology in nursing
Insiders guide to clinical Medicine.pdf
Microbial disease of the cardiovascular and lymphatic systems
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet

The WARCnet Code Book of web archive data formats

  • 1. WG5: The WARCnet Code Book of web archive data formats June 2022, London Sharon Healy, Karin de Wild, Niels Brügger, Peter Webster, Márton Németh, Vladimir Tybin
  • 2. The aim of Working Group 5 is to discuss and formulate possible data formats and thereby create a shared language between web archiving institutions and research communities. Projects: 1. A shared data vocabulary to request archived Web data 2. A glossary with terms used in web archive research (WG3)
  • 3. Project 1: Shared data vocabulary for requesting archived Web data
  • 4. Solr wayback The Solr wayback is a search engine that can retrieve data from WARC’s.
  • 5. Solr wayback The Solr wayback is a search engine that can retrieve data from WARC’s. • Free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.) • CSV export of search results (with custom field selection). • Image search (similar to google images). • Visualization of search results such as: - Interactive network graph (ingoing/outgoing) - Statistics over time (e.g. size, number of in and out going links, etc)
  • 6. Ulrich Have (in an email when WG5 was established): “a standard data format would be interesting as a kind of future requirements document for researcher-ready-data”
  • 7. Niels Brügger, a systematic description of data formats for web archive studies
  • 8. Actions: Existing data vocabularies • Web archives • Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.) Datathons • Identify data requests • Identify terms / variables • Write / improve definitions
  • 11. Interoperability Using controlled vocabularies offers the potential to request and interlink (machine-readable) data across digital heritage collections.
  • 12. Request #1 CDX (listing of all the resources within the Web archive) • Domain • Host • Full resource URL • Crawl date • Hash • Resource format (PDF/html etc) • Link to instance in Wayback
  • 13. Request #2 Seeds and crawl policies • seed URL • crawl frequency (daily, weekly etc) • capped? (yes/no) • first crawl date • last crawl date (or ongoing) • crawl depth
  • 14. Request #3 Links • Source URL (full) • Source URL (host) • Source URL (domain) • Source File Format (.html etc) • Target URL (full) • Target Host • Target Domain • Capture Date • Link to source resource in Wayback
  • 16. Request #5 Metadata on a dataset • Web archive • Provenance (based on W3C-PROV)
  • 17. Next steps: Collect existing data vocabularies • Web archives • Controlled vocabularies (schema.org, Wiki data, CIDOC-CRM, Dublin Core, etc.) Datathons • Identify data requests • Identify terms / variables • Write / improve definitions
  • 18. Project 2: Glossary for Web archival studies
  • 23. Next steps: • Add terms to the reference list in Zotero • Add definitions to the reference list in Zotero • Select terms for a glossary for early career researchers using Web archives