SlideShare a Scribd company logo
Archiving the French Web:
the BnF web archiving workflow
Sara Aubry
Web Archiving Project Manager, IT department
Bibliothèque nationale de France
International Conference on Web archives and e-LD
Biblioteca Nacional de España, Madrid, July 9th 2013
Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-
2012
• Collections:
– 1996 - now
– 20 000 websites for focused crawls, 2.5 million .fr domains for broad
crawls
– 18.8 billion URLs, 370 TB, growing up +100TB / year
• Resources:
– 9 Full Time Employees (5 librarians, 4 engineers)
– many partners within and out of Library, both at the national and
international level
– 70 robots (648GB RAM, 144 CPUs 2.4GHz)
Digital curation is not different!
• « Actions, tools and practices defined
and applied to collect, identify, select,
organize and preserve digital contents
(…) in order to use them and make them
available (…) »
Definition of Digital Archiving in Wikipedia
BnF workflow overview
Selecting
Collecting
Indexing
Accessing
Preserving
nas_preload
Selecting with BCWeb
Selecting with BCWeb
• A form-based application, commonly called a
« curator tool »
– for content curators and researchers to nominate
websites to harvest
– giving basic information about them (content policies,
trends watch)
• Most important information for each website:
– Internet address/URL
– frequency (daily, monthly, yearly, once…)
– size/budget (small, medium, big)
– depth (entire domain, part of it) Content curators
The Web is made of HTML pages
1 HTML page, 48
URL
• 1 HTML
• 1 text/css
• 4 javascript
• 17 image/png
• 5 image/jpeg
• 21 image/gif
all links and
inclusions are URL
references
Harvesting with Heritrix
• A harvester is a piece of
software (crawler,
spider, robot)
• Simulates what a
person would do with a
browser but repeatedly
and very fast
• Follows a looping
process
• Repeated until new and
in-scope URL are found
and limits are not
reached (budget and
time)
WARC
Pick a
location
Make a
Request
Receive a
Response
Examine for
references
Save the
content
Assets:
- open source
- small and large scale
- textual or all-media formats
- data structures
Digital curators: legal
deposit department
Engineers : IT department
Challenges:
• rich media and ever-changing
environment
• social networks
• content beyond paywalls
(news sites, ebooks)
Piloting the crawls with
NetarchiveSuite
• Prepare, schedule, run and monitor harvests
of websites, perform QA
Digital curators: legal
deposit department
Engineers : IT department
Offering access with Wayback
• Give readers the ability to
browse the web “as it
was” with:
– a regular web browser
– a search and redisplay
software
• An application called
“Web archives”
– Wayback: for URL search,
display and browsing
– Nutch prototype for
keyword search
– Guided paths for collection
highlights
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Challenges:
• links with our main Catalogue and
open data repository
• “smart” URL search
• full text search and indexing
• small-scale data mining projects with
researchers
Questions ?
E-mail: sara.aubry@bnf.fr
Web site: http://www.bnf.fr
Twitter: http://twitter.com/DLWebBnF

More Related Content

PDF
Semantic web at the Bibliothèque nationale de France: another French revoluti...
PDF
Scaling up to archive the UK Web. Helen Hockx-Yu
PDF
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
PDF
The Spanish Legal Deposit Law: knitting the web for digital resources. Mar Pé...
PDF
The Biblissima Portal: Current state and future plans
PDF
An introduction to the International Internet Preservation Consortium. Mary Pitt
PPT
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
PPTX
Presentatie for "Studiemiddag Linked Data Archieven"
Semantic web at the Bibliothèque nationale de France: another French revoluti...
Scaling up to archive the UK Web. Helen Hockx-Yu
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
The Spanish Legal Deposit Law: knitting the web for digital resources. Mar Pé...
The Biblissima Portal: Current state and future plans
An introduction to the International Internet Preservation Consortium. Mary Pitt
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
Presentatie for "Studiemiddag Linked Data Archieven"

What's hot (20)

PDF
Datahub for museums (poster)
PDF
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
PPT
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
PDF
LoCloud: Local Content in a Europeana Cloud
PDF
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
PDF
20190304_shifting_minds_open_belgium_2019
PDF
20190304 shifting minds_open_belgium_2019
PDF
Local content in a Europeana cloud for small & medium content providers
PDF
Sam Donvil PACKED public domain day 2018
PDF
Digital Cultural Heritage and the new EU Framework Programme
PDF
The LoCloud lightweight digital library and alternative content sources, Adam...
PDF
Heeren pan-seadda-leiden-17mrt2020
PDF
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
PPTX
ALIADA Project. AtCult
ODP
Charper.lawdi.20130531
PDF
LoCloud: Local Cultural Heritage Online and in the Cloud
PPTX
Aquiles imlr seminar
PDF
Linked (open) data: het met elkaar verbinden van kennis en organisaties
PPT
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
PDF
Open Cultural Heritage Data @ the Rijksmuseum
Datahub for museums (poster)
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
LoCloud: Local Content in a Europeana Cloud
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
20190304_shifting_minds_open_belgium_2019
20190304 shifting minds_open_belgium_2019
Local content in a Europeana cloud for small & medium content providers
Sam Donvil PACKED public domain day 2018
Digital Cultural Heritage and the new EU Framework Programme
The LoCloud lightweight digital library and alternative content sources, Adam...
Heeren pan-seadda-leiden-17mrt2020
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
ALIADA Project. AtCult
Charper.lawdi.20130531
LoCloud: Local Cultural Heritage Online and in the Cloud
Aquiles imlr seminar
Linked (open) data: het met elkaar verbinden van kennis en organisaties
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
Open Cultural Heritage Data @ the Rijksmuseum
Ad

Similar to Archiving the French Web: the BnF web archiving workflow. Sara Aubry (20)

PPT
The development of web archiving 3
PDF
Internet content as research data
PPT
Web Archiving Intro (circa 2015)
PDF
Slides anu talkwebarchivingaug2012
PPT
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
PPT
Arcomem training Specifying Crawls Beginners
PDF
Time -Travel on the Internet
PPT
Arcomem training Specifying Crawls Advanced
PPTX
Archiving Web-Based #musetech for Institutional Memory
PDF
Introduction to Web Archiving
PPT
Creating and Maintaining Web Archives
PPTX
Can you save the web? Web Archiving!
PDF
SiteStory 2013
PDF
Web Crawler For Mining Web Data
PPTX
Browser-Based Digital Preservation
PDF
Web Archiving – Lessons and Potential
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PDF
Intelligent web crawling
PPT
Tool Academy: Web Archiving
PPTX
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
The development of web archiving 3
Internet content as research data
Web Archiving Intro (circa 2015)
Slides anu talkwebarchivingaug2012
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Arcomem training Specifying Crawls Beginners
Time -Travel on the Internet
Arcomem training Specifying Crawls Advanced
Archiving Web-Based #musetech for Institutional Memory
Introduction to Web Archiving
Creating and Maintaining Web Archives
Can you save the web? Web Archiving!
SiteStory 2013
Web Crawler For Mining Web Data
Browser-Based Digital Preservation
Web Archiving – Lessons and Potential
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Intelligent web crawling
Tool Academy: Web Archiving
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
Ad

More from Biblioteca Nacional de España (20)

PDF
La colección de relaciones de sucesos en la Biblioteca Nacional de España
PDF
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
PDF
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
PDF
Data privacy in library authority files: a survey
PDF
Perfil de RDA de la BNE. Resumen de cambios
PDF
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
PDF
RDA: el nuevo texto
PDF
Pleno del Real Patronato. Biblioteca Nacional de España
PDF
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
PDF
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
PDF
Evaluación actuaciones 2018. Planificación actuaciones 2019
PDF
Dirección Técnica. Objetivos 2019
PDF
Evaluación 2018. Objetivos 2019
PDF
Evaluación actuaciones 2018. Dirección Cultural
PDF
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
PDF
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
PDF
Renacer prensa historica
PDF
RDA y Linked data (Ricardo Santos Muñoz)
PDF
Desarrollo actual de RDA (Pilar Tejero López)
La colección de relaciones de sucesos en la Biblioteca Nacional de España
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
Data privacy in library authority files: a survey
Perfil de RDA de la BNE. Resumen de cambios
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA: el nuevo texto
Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Evaluación actuaciones 2018. Planificación actuaciones 2019
Dirección Técnica. Objetivos 2019
Evaluación 2018. Objetivos 2019
Evaluación actuaciones 2018. Dirección Cultural
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Renacer prensa historica
RDA y Linked data (Ricardo Santos Muñoz)
Desarrollo actual de RDA (Pilar Tejero López)

Recently uploaded (20)

PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hybrid model detection and classification of lung cancer
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Tartificialntelligence_presentation.pptx
Mushroom cultivation and it's methods.pdf
A Presentation on Artificial Intelligence
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A novel scalable deep ensemble learning framework for big data classification...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Getting Started with Data Integration: FME Form 101
Group 1 Presentation -Planning and Decision Making .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hybrid model detection and classification of lung cancer
Zenith AI: Advanced Artificial Intelligence
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
NewMind AI Weekly Chronicles - August'25-Week II
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Hindi spoken digit analysis for native and non-native speakers
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Enhancing emotion recognition model for a student engagement use case through...
Tartificialntelligence_presentation.pptx

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

  • 1. Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013
  • 2. Let’s start with some figures • Programme start in 2000, industrialisation in 2008- 2012 • Collections: – 1996 - now – 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls – 18.8 billion URLs, 370 TB, growing up +100TB / year • Resources: – 9 Full Time Employees (5 librarians, 4 engineers) – many partners within and out of Library, both at the national and international level – 70 robots (648GB RAM, 144 CPUs 2.4GHz)
  • 3. Digital curation is not different! • « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) » Definition of Digital Archiving in Wikipedia
  • 6. Selecting with BCWeb • A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate websites to harvest – giving basic information about them (content policies, trends watch) • Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
  • 7. The Web is made of HTML pages 1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
  • 8. Harvesting with Heritrix • A harvester is a piece of software (crawler, spider, robot) • Simulates what a person would do with a browser but repeatedly and very fast • Follows a looping process • Repeated until new and in-scope URL are found and limits are not reached (budget and time) WARC Pick a location Make a Request Receive a Response Examine for references Save the content
  • 9. Assets: - open source - small and large scale - textual or all-media formats - data structures
  • 11. Engineers : IT department Challenges: • rich media and ever-changing environment • social networks • content beyond paywalls (news sites, ebooks)
  • 12. Piloting the crawls with NetarchiveSuite • Prepare, schedule, run and monitor harvests of websites, perform QA Digital curators: legal deposit department Engineers : IT department
  • 13. Offering access with Wayback • Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay software • An application called “Web archives” – Wayback: for URL search, display and browsing – Nutch prototype for keyword search – Guided paths for collection highlights
  • 17. Challenges: • links with our main Catalogue and open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with researchers
  • 18. Questions ? E-mail: sara.aubry@bnf.fr Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF