SlideShare a Scribd company logo
Specifying crawls
France Lasfargues
Internet Memory Foundation
Paris, France
france.lasfargues@internememory.net

Slide 1
Training Goals
➔ Help user to specify properly the campaign
➔ Make user understanding what it is going on in
the back end of the ARCOMEM platform
➔ Set-up a campaign in the crawler cockpit

Slide 2
Plan
What is the Web ? Challenges and SOA
ARCOMEM platform
Crawler
Set-up a campaign in the Arcomen Crawler

Cockpit

Slide 3
Introduction : How does Web work ?

➔ The web is managed by protocols and standards :
• HTTP Hypertext Transfer Protocol
• HTML HyperText Markup Language
• URL Uniform Resource Locator
• DNS Domain Name System
➔ Each server has an address : IP address
• Example : http://213.251.150.222/ ->
http://collections.europarchive.org

4
WWW
The web is a large space of communication and information :
• managed by servers which talk together by convention (protocol) and

through applications in a large network.
• a naming space organized and controlled (ICANN)

 World Wide Web: abbreviated as WWW and commonly known

as the Web, is a system of interlinked hypertext documents
accessed via the internet

Slide 5
HTTP - Hypertext Transfer Protocol

➔ Notion client/server
•

request-response protocol in the client-server computing model

➔ How does it work ?
•

Client asks for a content

•

Server hosts the content and delivers it

•

The browser locates the DNS server, connects itself to the
server and sends a request to the server.

6
HTML - HyperText Markup Languag e
➔ Markup language for Web page
➔ Written in form of HTML elements
➔ Creates structured documents denoting structural
semantic elements for text as headings, paragraphs,
titles, links, quotes, and other items
➔ Allows text and embedded as images
➔ Example : http://www.w3.org/

7
URI - URL
➔ URL - Uniform resource Locator (URL) that specifies
where an identified resource is available and the mechanism for
retrieving it.
➔ Examples :
– http://host.domain.extension/path/pageORfile

– http://www.europarchive.org
– http://collections.europarchive.org/
– http://www.europarchive.org/about.php

Samos 2013 – Workshop : The ARCOMEM Platform

8
Domain name and extension
➔ Manage by l’ICANN, Internet Corporation for Assigned Names and
Numbers (ICANN), is non profit organization, allocated by registrar.
•

http://www.icann.org

➔ ICANN coordinates the allocation and assignment to ensure the
universal resolvability of :
•
•
•

Domain names (forming a system referred to as «
DNS»)
Internet protocol («
IP») addresses
Protocol port and parameter numbers.

➔ Several types of TLD
•

TLD first level : .com, .info, etc

•

gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro

•

ccTLD (country code Top Level Domains).fr

9
What kind of contents?
➔ Different type of contents : multimedia text, video, images
➔ Different type of producers :
• public : institution, government, museum, TV....
• private : foundation, company, press, people, blog...
http://ec.europa.eu/index_fr.htm
http://iawebarchiving.wordpress.com/
http://www.nytimes.com/
➔ Each producer is in charge of its content
• Information can disappear: fragility
• Size

10
Social web

➔ Focus on people’s socialization and interaction
• Characteristics :
•

Walled space in which users can interact
• Creation of social network

➔ WEB ARCHIVE -> challenges in term of content, privacy
and technique.
•

Examples:
• Share bookmark(Del.icio.us, Digg), videos (Dailymotion,
YouTube), photos (Flickr, Picasa)

• community (MySpace, Facebook)

11
Ex. of technical difficulties: Videos
➔ Standard HTTP protocol
• obfuscated links to the video files
• dynamic playlists and channels or configuration files loaded by
the player several hops and redirects to the server of the
video content
e.g.: YouTube
➔ Streaming protocols: RTSP, RTMP, MMS...
• real-time protocols implemented by the video players suited
for large video files (control commands) or live broadcasts
• sometimes proprietary protocols (e.g.: RTMP - Adobe)
available tools: MPlayer, FLVStreamer, VCL

12
Deep /Hidden Web
• Deep web: content accessible behind
password, database, payment... and hidden
to search engine

http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure
"Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet.

13
How do we archive it ?
➔ Challenges for archiving :
– dynamic websites

➔ Technical barriers:
•
•
•
•
•

some javascript
animation on Flash
pop-up
video and audio on streaming
restricted access

➔Traps : Spam and loop

14
What do user need to do some web archiving ?
➔ Define the target content (Website, URL, Topic…)
➔ A tool to manage its campaign
➔ Intelligent crawler to archive content

15
Management tools (1)
Several tools exist already developed by Libraries which are doing some Library.
➔Netarchivesuite (http://netarchive.dk/suite/)
➔The NetarchiveSuite software was originally developed by the two national deposit
libraries in Denmark, The Royal Library and The State and University Library and has
been running in production, harvesting the Danish world wide web since 2005. The
French National Library and the Austrian National Libraries joined the project in 2008.
➔Web curator tool: http://webcurator.sourceforge.net
Open-source workflow management application for selective web archiving
developed by the National Library of New Zealand and the British Library, initiated
by the International Internet Preservation Consortium
➔Archive-it http://www.archive-it.org/
A subscription service by Internet Archive to build and preserve collections: allows to
harvest, catalogue, manage and browse archived collections
➔Archivethe.net http://archivethe.net/fr/
Service provides by the Internet Memory Foundation.
➔Arcomem crawler cockpit

16
How does a crawler work ?
➔ A crawler is a bot parsing web pages in order to
index or and archive them. Robot navigates
following links
➔ Link in the center of crawl’s problematic
• Explicit links : source code is available and full path is
explicitly stated
• Variable link : source code is available but use
variables to encode the path
• Opaque links: source code not available

Example : http://www.thetimes.co.uk/tto/news/

17
Parameters
➔ Scoping function is used to define how depth the crawl will go
• Complete or specific content of a website
• Discovery or focus crawl
➔ Politeness
• Follow the common rules of politeness
➔ Robots.txt
• Follow
➔ Frequency
• How often I want to launch a crawl on this target ?

18
Arcomen Crawlers
• IMF Crawler
• Adaptative Heritrix
• API Crawler

19
IMF Crawler
•

Component Name: IMF Large Scale Crawler
– The large scale crawler retrieves content from the web and
stores it in an HBase repository. It aims at being scalable:
crawling at a fast rate from the start and slowing down as
little as possible as the amount of visited URLs grows to
hundreds of millions, all while observing politeness
conventions (rate regulation, robots.txt compliance, etc.).

•

Output:
– Web resources written to WARC files. We also have
developed an importer to load these WARC files into HBase.
Some metadata is also extracted: HTTP status code,
identified out links, MIME type, etc.

20
WARC: example

21
Adaptative Heritrix
➔ Component Name: Adaptive Heritrix
➔ Description: Adaptive Heritrix is a modified version of the
open source crawler Heritrix that allows the dynamic
reordering of queued URLs
➔ Application Aware Helper

22
ARCOMEM Crawler Cockpit
ARCOMEM Crawler Cockpit
• Requirements
described by
ARCOMEM user
partners (SWR – DW)
• Designed and
implemented by IMF
• A UI on top of the
ARCOMEM system
• Demo: Crawler cockpit
24
How does it work ?

25
Crawler Cockpit: Functionality
• Launch crawls following
scheduler specifications

•

Set-up a campaign by focusing,
event, keyword, entity and URL

• Monitor crawls and get realtime feedback on the progress
of the crawlers

•

Focus on target content in Social
Media Category (blog, forum,
video, photo...)

•

Run crawl by using API crawler
(Twitter, Facebook, YouTube,
Flickr)

•

Get a campaign overview with
qualified statistics

•

Do some refinement at crawls
time to have a better focus on the
target content

•

decide what content to archive

• Run crawl with HTML Crawler
(Heritrix and IMF Crawler)
• Export the crawled content to
a WARC file

26
Crawler Cockpit Navigation
• Set-up: A campaign is described by an intelligent crawl
definition, which associates content target to crawl
parameters (schedule and technical parameters).
• Monitor tab give access to statistics provide by the crawler
at running time
• Overview: global dashboard on a campaign. The
information is organized following different topics: general
description of the campaign, metadata, current status, crawl
activity, statistics and analysis
• Inspector: A tool to have access into the content as it is
stored into Hbase.

• Report: specifications and parameters of a campaign
27
Set-up a campaign
• General description
• Distinct named entities
(e.g. person, geo location,
and organization),Time
period Free keywords and
Language
• A selection of up to nine
SMC (Social Media
Categories)
• Schedule: Each campaign
has a start and end date.
Frequency of the craw is
defined by choosing an
interval.

28
Focus on Scoping function
Domain: entire web site
http://www.site.com
Path: only a specific directory of a
website
http://www.site.com/actu
Sub domain:
http://sport.site.com
Page + context:
http://www.site.comhome.html

29
Focus on scheduler

Frequency: weekly, monthly, quaterly …
Interval: 1 to 9
Calendar: a campaign has a start date and
an end date.

30
Campaign Overview

Global
dashboard
on
campaign:
• General description of the
campaign
• Crawl activity
• Keywords
• Statistics

a

• Refine Mode: User can give
more or less weight to a keyword.

31
CC Inspector Tab
Inspector tab allows user to
•Check the quality of the
content before indexing
•Access to the content (from
HBase), metadata and
triples directly related to a
resource
•Browse a list of URLs
ranked by on-line analysis
scores is provided.

32
CC Monitor Tab

The Monitor tab gives real
time statistics on the
running crawl.

33
Crawler cockpit demo
• Online demo
• Feedback

34

More Related Content

PPT
Arcomem training Specifying Crawls Advanced
PDF
Minerva: Drill Storage Plugin for IPFS
PPTX
TTL Alfresco Product Security and Best Practices 2017
PPT
Arcomem training simple-text-mining_beginner
PPTX
Arcomem training Cultural Analysis Advanced
PDF
Arcomem training – Enrichment Advanced (update)
PDF
Arcomem training – Enrichment Beginner (update)
PPT
Arcomem training entities-and-events_advanced
Arcomem training Specifying Crawls Advanced
Minerva: Drill Storage Plugin for IPFS
TTL Alfresco Product Security and Best Practices 2017
Arcomem training simple-text-mining_beginner
Arcomem training Cultural Analysis Advanced
Arcomem training – Enrichment Advanced (update)
Arcomem training – Enrichment Beginner (update)
Arcomem training entities-and-events_advanced

Similar to Arcomem training Specifying Crawls Beginners (20)

PPT
Arcomem training specifying-crawls
PPTX
Evolution Of The Web Platform & Browser Security
PPTX
Basics of the Web Platform
PPTX
Arcomem training system-overview_advanced
PDF
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
PPTX
HTML5 Programming
PPT
PDF
VoxxedDays Luxembourg - Abuse web browsers for fun & profits - Dominique Righ...
PPTX
Eba ppt rajesh
PPTX
05.m3 cms list-ofwebserver
PPTX
Week two lecture
PPTX
web course focus on main informantion of bukifing websitech1.pptx
PPT
Introduction to web technology
PPTX
Web technologies course, an introduction
PDF
Html5 Application Security
PPT
WEB-DBMS A quick reference
PPT
Html
PDF
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 ...
PDF
DevOps Roadmap for freshers great guide.pdf
Arcomem training specifying-crawls
Evolution Of The Web Platform & Browser Security
Basics of the Web Platform
Arcomem training system-overview_advanced
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
HTML5 Programming
VoxxedDays Luxembourg - Abuse web browsers for fun & profits - Dominique Righ...
Eba ppt rajesh
05.m3 cms list-ofwebserver
Week two lecture
web course focus on main informantion of bukifing websitech1.pptx
Introduction to web technology
Web technologies course, an introduction
Html5 Application Security
WEB-DBMS A quick reference
Html
SophiaConf2010 Présentation des Retours d'expériences de la Conférence du 08 ...
DevOps Roadmap for freshers great guide.pdf
Ad

More from arcomem (19)

PPT
Arcomem training Topic Analysis Models advanced
PPT
Arcomem training Topic Analysis Models beginners
PPT
Arcomem training Twitter Domain Experts advanced
PPTX
Arcomem training Cultural Analysis Beginner
PPT
Arcomem training twitter-dynamics_advanced
PPT
Arcomem training opinions_advanced
PPTX
Arcomem training neer_beginner
PPTX
Arcomem training neer_advanced
PPT
Arcomem training heritrix_beginner
PPT
Arcomem training heritrix_advanced
PPTX
Arcomem training enrichment_beginner
PPTX
Arcomem training enrichment_advanced
PPT
Arcomem training diversification
PPT
Arcomem training twitter-dynamics_beginner
PDF
Arcomem TPDL poster
PPTX
Diata12 ARCOMEM
PDF
Arcomem ar FIAT-IFTA 2011
PDF
ARCOMEM Poster
PDF
ARCOMEM Flyer
Arcomem training Topic Analysis Models advanced
Arcomem training Topic Analysis Models beginners
Arcomem training Twitter Domain Experts advanced
Arcomem training Cultural Analysis Beginner
Arcomem training twitter-dynamics_advanced
Arcomem training opinions_advanced
Arcomem training neer_beginner
Arcomem training neer_advanced
Arcomem training heritrix_beginner
Arcomem training heritrix_advanced
Arcomem training enrichment_beginner
Arcomem training enrichment_advanced
Arcomem training diversification
Arcomem training twitter-dynamics_beginner
Arcomem TPDL poster
Diata12 ARCOMEM
Arcomem ar FIAT-IFTA 2011
ARCOMEM Poster
ARCOMEM Flyer
Ad

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Arcomem training Specifying Crawls Beginners

  • 1. Specifying crawls France Lasfargues Internet Memory Foundation Paris, France france.lasfargues@internememory.net Slide 1
  • 2. Training Goals ➔ Help user to specify properly the campaign ➔ Make user understanding what it is going on in the back end of the ARCOMEM platform ➔ Set-up a campaign in the crawler cockpit Slide 2
  • 3. Plan What is the Web ? Challenges and SOA ARCOMEM platform Crawler Set-up a campaign in the Arcomen Crawler Cockpit Slide 3
  • 4. Introduction : How does Web work ? ➔ The web is managed by protocols and standards : • HTTP Hypertext Transfer Protocol • HTML HyperText Markup Language • URL Uniform Resource Locator • DNS Domain Name System ➔ Each server has an address : IP address • Example : http://213.251.150.222/ -> http://collections.europarchive.org 4
  • 5. WWW The web is a large space of communication and information : • managed by servers which talk together by convention (protocol) and through applications in a large network. • a naming space organized and controlled (ICANN)  World Wide Web: abbreviated as WWW and commonly known as the Web, is a system of interlinked hypertext documents accessed via the internet Slide 5
  • 6. HTTP - Hypertext Transfer Protocol ➔ Notion client/server • request-response protocol in the client-server computing model ➔ How does it work ? • Client asks for a content • Server hosts the content and delivers it • The browser locates the DNS server, connects itself to the server and sends a request to the server. 6
  • 7. HTML - HyperText Markup Languag e ➔ Markup language for Web page ➔ Written in form of HTML elements ➔ Creates structured documents denoting structural semantic elements for text as headings, paragraphs, titles, links, quotes, and other items ➔ Allows text and embedded as images ➔ Example : http://www.w3.org/ 7
  • 8. URI - URL ➔ URL - Uniform resource Locator (URL) that specifies where an identified resource is available and the mechanism for retrieving it. ➔ Examples : – http://host.domain.extension/path/pageORfile – http://www.europarchive.org – http://collections.europarchive.org/ – http://www.europarchive.org/about.php Samos 2013 – Workshop : The ARCOMEM Platform 8
  • 9. Domain name and extension ➔ Manage by l’ICANN, Internet Corporation for Assigned Names and Numbers (ICANN), is non profit organization, allocated by registrar. • http://www.icann.org ➔ ICANN coordinates the allocation and assignment to ensure the universal resolvability of : • • • Domain names (forming a system referred to as « DNS») Internet protocol (« IP») addresses Protocol port and parameter numbers. ➔ Several types of TLD • TLD first level : .com, .info, etc • gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro • ccTLD (country code Top Level Domains).fr 9
  • 10. What kind of contents? ➔ Different type of contents : multimedia text, video, images ➔ Different type of producers : • public : institution, government, museum, TV.... • private : foundation, company, press, people, blog... http://ec.europa.eu/index_fr.htm http://iawebarchiving.wordpress.com/ http://www.nytimes.com/ ➔ Each producer is in charge of its content • Information can disappear: fragility • Size 10
  • 11. Social web ➔ Focus on people’s socialization and interaction • Characteristics : • Walled space in which users can interact • Creation of social network ➔ WEB ARCHIVE -> challenges in term of content, privacy and technique. • Examples: • Share bookmark(Del.icio.us, Digg), videos (Dailymotion, YouTube), photos (Flickr, Picasa) • community (MySpace, Facebook) 11
  • 12. Ex. of technical difficulties: Videos ➔ Standard HTTP protocol • obfuscated links to the video files • dynamic playlists and channels or configuration files loaded by the player several hops and redirects to the server of the video content e.g.: YouTube ➔ Streaming protocols: RTSP, RTMP, MMS... • real-time protocols implemented by the video players suited for large video files (control commands) or live broadcasts • sometimes proprietary protocols (e.g.: RTMP - Adobe) available tools: MPlayer, FLVStreamer, VCL 12
  • 13. Deep /Hidden Web • Deep web: content accessible behind password, database, payment... and hidden to search engine http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure "Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet. 13
  • 14. How do we archive it ? ➔ Challenges for archiving : – dynamic websites ➔ Technical barriers: • • • • • some javascript animation on Flash pop-up video and audio on streaming restricted access ➔Traps : Spam and loop 14
  • 15. What do user need to do some web archiving ? ➔ Define the target content (Website, URL, Topic…) ➔ A tool to manage its campaign ➔ Intelligent crawler to archive content 15
  • 16. Management tools (1) Several tools exist already developed by Libraries which are doing some Library. ➔Netarchivesuite (http://netarchive.dk/suite/) ➔The NetarchiveSuite software was originally developed by the two national deposit libraries in Denmark, The Royal Library and The State and University Library and has been running in production, harvesting the Danish world wide web since 2005. The French National Library and the Austrian National Libraries joined the project in 2008. ➔Web curator tool: http://webcurator.sourceforge.net Open-source workflow management application for selective web archiving developed by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium ➔Archive-it http://www.archive-it.org/ A subscription service by Internet Archive to build and preserve collections: allows to harvest, catalogue, manage and browse archived collections ➔Archivethe.net http://archivethe.net/fr/ Service provides by the Internet Memory Foundation. ➔Arcomem crawler cockpit 16
  • 17. How does a crawler work ? ➔ A crawler is a bot parsing web pages in order to index or and archive them. Robot navigates following links ➔ Link in the center of crawl’s problematic • Explicit links : source code is available and full path is explicitly stated • Variable link : source code is available but use variables to encode the path • Opaque links: source code not available Example : http://www.thetimes.co.uk/tto/news/ 17
  • 18. Parameters ➔ Scoping function is used to define how depth the crawl will go • Complete or specific content of a website • Discovery or focus crawl ➔ Politeness • Follow the common rules of politeness ➔ Robots.txt • Follow ➔ Frequency • How often I want to launch a crawl on this target ? 18
  • 19. Arcomen Crawlers • IMF Crawler • Adaptative Heritrix • API Crawler 19
  • 20. IMF Crawler • Component Name: IMF Large Scale Crawler – The large scale crawler retrieves content from the web and stores it in an HBase repository. It aims at being scalable: crawling at a fast rate from the start and slowing down as little as possible as the amount of visited URLs grows to hundreds of millions, all while observing politeness conventions (rate regulation, robots.txt compliance, etc.). • Output: – Web resources written to WARC files. We also have developed an importer to load these WARC files into HBase. Some metadata is also extracted: HTTP status code, identified out links, MIME type, etc. 20
  • 22. Adaptative Heritrix ➔ Component Name: Adaptive Heritrix ➔ Description: Adaptive Heritrix is a modified version of the open source crawler Heritrix that allows the dynamic reordering of queued URLs ➔ Application Aware Helper 22
  • 24. ARCOMEM Crawler Cockpit • Requirements described by ARCOMEM user partners (SWR – DW) • Designed and implemented by IMF • A UI on top of the ARCOMEM system • Demo: Crawler cockpit 24
  • 25. How does it work ? 25
  • 26. Crawler Cockpit: Functionality • Launch crawls following scheduler specifications • Set-up a campaign by focusing, event, keyword, entity and URL • Monitor crawls and get realtime feedback on the progress of the crawlers • Focus on target content in Social Media Category (blog, forum, video, photo...) • Run crawl by using API crawler (Twitter, Facebook, YouTube, Flickr) • Get a campaign overview with qualified statistics • Do some refinement at crawls time to have a better focus on the target content • decide what content to archive • Run crawl with HTML Crawler (Heritrix and IMF Crawler) • Export the crawled content to a WARC file 26
  • 27. Crawler Cockpit Navigation • Set-up: A campaign is described by an intelligent crawl definition, which associates content target to crawl parameters (schedule and technical parameters). • Monitor tab give access to statistics provide by the crawler at running time • Overview: global dashboard on a campaign. The information is organized following different topics: general description of the campaign, metadata, current status, crawl activity, statistics and analysis • Inspector: A tool to have access into the content as it is stored into Hbase. • Report: specifications and parameters of a campaign 27
  • 28. Set-up a campaign • General description • Distinct named entities (e.g. person, geo location, and organization),Time period Free keywords and Language • A selection of up to nine SMC (Social Media Categories) • Schedule: Each campaign has a start and end date. Frequency of the craw is defined by choosing an interval. 28
  • 29. Focus on Scoping function Domain: entire web site http://www.site.com Path: only a specific directory of a website http://www.site.com/actu Sub domain: http://sport.site.com Page + context: http://www.site.comhome.html 29
  • 30. Focus on scheduler Frequency: weekly, monthly, quaterly … Interval: 1 to 9 Calendar: a campaign has a start date and an end date. 30
  • 31. Campaign Overview Global dashboard on campaign: • General description of the campaign • Crawl activity • Keywords • Statistics a • Refine Mode: User can give more or less weight to a keyword. 31
  • 32. CC Inspector Tab Inspector tab allows user to •Check the quality of the content before indexing •Access to the content (from HBase), metadata and triples directly related to a resource •Browse a list of URLs ranked by on-line analysis scores is provided. 32
  • 33. CC Monitor Tab The Monitor tab gives real time statistics on the running crawl. 33
  • 34. Crawler cockpit demo • Online demo • Feedback 34

Editor's Notes

  • #2: {"16":"Netarchivesuite (http://netarchive.dk/suite/) developed by the two national deposit libraries in Denmark, The Royal Library and The State and University Library\nto plan, schedule and run web harvests for selective and broad crawl\nbuilt-in bit preservation functionality\nWeb curator tool: http://webcurator.sourceforge.net\nOpen-source workflow management application for selective web archiving developped by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium\nArchive-it http://www.archive-it.org/\nA subscription service by Internet Archive to build and preserve collections: allows to\nharvest, catalog, manage and browse archived collections\nArcomem crawler cokpit\n","33":"On the top of the page, a progression bar gives an estimation of crawl progress until completion. It is a ratio between seen and unseen URL recorded by the crawler. Seen URLs are all the URLs which have been already crawled. Unseen are the URLs, which have been discovered but are waiting to be crawled \n","22":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","28":"For each campaign, the archivist can select which SMC, he wants to focus on (blogs, video, discussion) and he does the same for the API crawler (Facebook, Twitter, Flickr, YouTube…).\n","6":"There is several protocol : \nMai protocol as \nPOP3 (post office protocol version 3)\nSMTP (simple mail transfer protocol\nDNS Domain name service\nDHCP Dynamic Host configuration \nFTP File transfer Protocole\nIMAP Internet Message Access Protocole\n","13":"A lot of data are stored in DB hidden to search engine like google are not available for such engine,\nmoreover many pages are created dynamicaly to answer to queries so hey do not existbefor user requested information. \nThis enorme reservoir \nhttp://www.dailymotion.com/video/x9udyo_the-virtual-private-library-and-dee_news\n","8":"URI Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource on the Internet.\n","25":"A crawl is guided by the crawl specifications defined by the user. The crawl specification contains URLs to start the discovery from seeds, keywords to look for in web pages, social web sites APIs to query (and with which keywords) and Social Media Categories (SMC) to focus the crawl on. The seeds get fetched, and the corresponding content and the social sites API query responses are inserted into the document store. The insertion triggers the online analysis process. The Web resources and the links extracted from them are analyzed and scored by the Online Analysis Modules. The links get sent to the crawler’s URL queue, where their score is used to determine the order in which they should be crawled, thereby guiding the crawler. The newly crawled content gets written to the document store, completing the loop. On top of the prototype, a UI allows the user to target topics to archive and offers some analyses of collected data.\n","31":". The information is organized following different topics: general description of the campaign, metadata, current status, crawl activity, statistics and analysis \n","20":"http://www.arcomem.eu/wp-content/uploads/2012/05/D5_2.pdf\n","4":"To find an information online, I have to know is address. \nLe système de nom de domaine (Domain Name System - DNS) aide les utilisateurs à naviguer sur Internet. Chaque ordinateur relié à Internet a une adresse unique appelée “adresse IP” (adresse de protocole Internet). Étant donné que les adresses IP (qui sont des séries de chiffres) sont difficiles à mémoriser, le DNS permet d’utiliser à la place une série de lettres familières (le “nom de domaine”). Par exemple, au lieu de taper “192.0.34.163,” vous pouvez taper “www.icann.org.”\n","10":"On line information heterogeneous\nthere is copy online \n"}