Web Archive Retrieval Tools
Paul Doorenbosch Jaap Kamps Richard Rogers Arjen de Vries René Voorburg


       CATCH Meeting HiTime e-History, November 1, 2011
Information
               Access

                                          Paul Doorenbosch



Arjen de Vries                                         René Voorburg

                                                              Web
                                                             Archive
                         Jaap Kamps




                 New
                 Media
                         Richard Rogers
Unlimited ways to publish/access/share information
Our daily lives take place “on the Web”
Ease of publishing on the Web comes at a price



           Web content is ephemeral



Web archives preserve the heritage of the future
d to the information      defined. After the morning introduc-
  lieve that informa-     tory session, we split the workshop
               Focus on use: Web research(ers)
 search falls squarely
human-computer
                          into three new working groups, based
                          on the results of that discussion.
ome emphasis on
 val, rather than vice
 f the thrusts o f this
 attempt to character-
users engage in, to
ctivities, and to iden-
chniques and mea-
  appropriate insights
 or and performance.



participated in the
were chosen on the
ef submitted position
sented a broad spec-
and academia. Partic-
 France, Canada,
 U.S. After accep-
                                             J.
s were asked to sub-
age) position
                            -            ©
escribed relevant
pectives a few weeks
 hop. These papers
Search support has massively improved
Complex tasks are still painstaking!




Many queries, tabs, notes, cut-and-paste, ...
Exploratory and faceted search
Interactively construct a (hidden) query
Search strategy from building blocks
Each block = data or manipulations
           Strategy Builder




Build a dedicated search engine “on the fly”
Research methods become search strategies


    Store, refine, reuse, share strategies


      Researchers enrich the archive
Archival selection determines future use
Digital humanities is a paradigm switch
Supporting Complex Search Tasks
Nick Belkin Charlie Clarke Ning Gao Jaap Kamps Jussi Karlgren
                         Thanks!
              SIGIR 2011 Workshop, July 28, 2011

More Related Content

PDF
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
PDF
Finding Pages on the Unarchived Web (DL 2014)
PDF
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
PDF
Towards Multidimensional Web Archive Access (IIPC 2016)
PPT
An Open Context for Archaeology
PPTX
Software Repositories for Research-- An Environmental Scan
PPTX
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Finding Pages on the Unarchived Web (DL 2014)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
Towards Multidimensional Web Archive Access (IIPC 2016)
An Open Context for Archaeology
Software Repositories for Research-- An Environmental Scan
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...

What's hot (20)

PPTX
Web Archives and the dream of the Personal Search Engine
PDF
The Future of Finding: Resource Discovery @ The University of Oxford
PPTX
The library in the life of the user
PPTX
Multilingual presentation ifla 2013 08-19
PPTX
Gary Price, MIT Program on Information Science
PDF
Building and Managing Social Media Collections
PPTX
Data Designed for Discovery
PPTX
Exploring a world of networked information built from free-text metadata
PDF
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
PPTX
Best Practices for Descriptive Metadata
ZIP
Intro to Linked Open Data in Libraries Archives & Museums.
PDF
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PDF
ArchivePress Presentation (BL 21/7/2009)
PPTX
CST2560 Oct 2019
PPTX
Dulin PermaCC Talk for MIT PIS
PPTX
Let's Get Visible! with Karla Smith, Winnefox Library System
PPTX
Collection Directions - Research collections in the network environment
PPTX
Connecting the Dots: Linking Digitized Collections Across Metadata Silos
Web Archives and the dream of the Personal Search Engine
The Future of Finding: Resource Discovery @ The University of Oxford
The library in the life of the user
Multilingual presentation ifla 2013 08-19
Gary Price, MIT Program on Information Science
Building and Managing Social Media Collections
Data Designed for Discovery
Exploring a world of networked information built from free-text metadata
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
Best Practices for Descriptive Metadata
Intro to Linked Open Data in Libraries Archives & Museums.
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ArchivePress Presentation (BL 21/7/2009)
CST2560 Oct 2019
Dulin PermaCC Talk for MIT PIS
Let's Get Visible! with Karla Smith, Winnefox Library System
Collection Directions - Research collections in the network environment
Connecting the Dots: Linking Digitized Collections Across Metadata Silos
Ad

Similar to WebART in 10 minutes (20)

PDF
When Search becomes Research and Research becomes Search
PPTX
Looking for Data: Finding New Science
PPT
CSAFE CRE Presentation
PPT
JISC repositories and preservation programme: Plenary presentation 2009
PDF
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
PDF
Lightning Talks: All EartCube Funded Projects
PPTX
Web 2.0 Tools for Researchers
KEY
Labscope intro
PDF
Sgci iwsg-a-10-10-16
PDF
Revolutionizing scientific communication and collaboration
PPT
Introduction to Research Data Management for postgraduate students
PDF
OAI7 Research Objects
PDF
New Metaphors: Data Papers and Data Citations
PDF
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
PPT
Digital library services and the changing environment
PDF
OeRC Seminar
PPT
PPT
Reach Out to Research : library support services (R2R)
PPT
Ngsp
PPT
Moving the repository upstream
When Search becomes Research and Research becomes Search
Looking for Data: Finding New Science
CSAFE CRE Presentation
JISC repositories and preservation programme: Plenary presentation 2009
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
Lightning Talks: All EartCube Funded Projects
Web 2.0 Tools for Researchers
Labscope intro
Sgci iwsg-a-10-10-16
Revolutionizing scientific communication and collaboration
Introduction to Research Data Management for postgraduate students
OAI7 Research Objects
New Metaphors: Data Papers and Data Citations
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital library services and the changing environment
OeRC Seminar
Reach Out to Research : library support services (R2R)
Ngsp
Moving the repository upstream
Ad

More from Jaap Kamps (7)

PDF
ICTIR'17 Opening
PDF
From Finding to Discovering
KEY
Expose in 10 minutes
KEY
INEX 2010 Opening
PDF
Bachelor Cultural Information Science 2010-2011
KEY
IIiX 2012 Nijmegen Bid
KEY
Museum0610
ICTIR'17 Opening
From Finding to Discovering
Expose in 10 minutes
INEX 2010 Opening
Bachelor Cultural Information Science 2010-2011
IIiX 2012 Nijmegen Bid
Museum0610

Recently uploaded (20)

PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPTX
Modernising the Digital Integration Hub
PPTX
The various Industrial Revolutions .pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Architecture types and enterprise applications.pdf
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Configure Apache Mutual Authentication
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PDF
Flame analysis and combustion estimation using large language and vision assi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Modernising the Digital Integration Hub
The various Industrial Revolutions .pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Architecture types and enterprise applications.pdf
UiPath Agentic Automation session 1: RPA to Agents
Module 1.ppt Iot fundamentals and Architecture
OpenACC and Open Hackathons Monthly Highlights July 2025
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
2018-HIPAA-Renewal-Training for executives
Hindi spoken digit analysis for native and non-native speakers
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Configure Apache Mutual Authentication
A contest of sentiment analysis: k-nearest neighbor versus neural network
A comparative study of natural language inference in Swahili using monolingua...
Abstractive summarization using multilingual text-to-text transfer transforme...
Flame analysis and combustion estimation using large language and vision assi...

WebART in 10 minutes

  • 1. Web Archive Retrieval Tools Paul Doorenbosch Jaap Kamps Richard Rogers Arjen de Vries René Voorburg CATCH Meeting HiTime e-History, November 1, 2011
  • 2. Information Access Paul Doorenbosch Arjen de Vries René Voorburg Web Archive Jaap Kamps New Media Richard Rogers
  • 3. Unlimited ways to publish/access/share information
  • 4. Our daily lives take place “on the Web”
  • 5. Ease of publishing on the Web comes at a price Web content is ephemeral Web archives preserve the heritage of the future
  • 6. d to the information defined. After the morning introduc- lieve that informa- tory session, we split the workshop Focus on use: Web research(ers) search falls squarely human-computer into three new working groups, based on the results of that discussion. ome emphasis on val, rather than vice f the thrusts o f this attempt to character- users engage in, to ctivities, and to iden- chniques and mea- appropriate insights or and performance. participated in the were chosen on the ef submitted position sented a broad spec- and academia. Partic- France, Canada, U.S. After accep- J. s were asked to sub- age) position - © escribed relevant pectives a few weeks hop. These papers
  • 7. Search support has massively improved
  • 8. Complex tasks are still painstaking! Many queries, tabs, notes, cut-and-paste, ...
  • 10. Interactively construct a (hidden) query
  • 11. Search strategy from building blocks
  • 12. Each block = data or manipulations Strategy Builder Build a dedicated search engine “on the fly”
  • 13. Research methods become search strategies Store, refine, reuse, share strategies Researchers enrich the archive
  • 15. Digital humanities is a paradigm switch
  • 16. Supporting Complex Search Tasks Nick Belkin Charlie Clarke Ning Gao Jaap Kamps Jussi Karlgren Thanks! SIGIR 2011 Workshop, July 28, 2011

Editor's Notes

  • #2: Good afternoon. My name is Jaap Kamps and it is my pleasure to introduce the WebART (Web Archive Retrieval Tools) project.\n
  • #3: \n
  • #4: The project is a collaboration of three groups of researchers: \n1. Specialists working on Information Access (Computer Science, Arjen de Vries);\n2. New media scholars working on the Web and the Web Archive (Humanities, Richard Rogers); and\n3. Web Archivists from the Dutch Web Archive (Heritage Sector, Rene Voorburg en Paul Doorenbosch).\nWhat is special is that all three groups are actively building technical tools -- the Koninklijke Bibliotheek does large scale crawling; the new media scholar build dedicated crawlers/screen-scrapers and analysis tools; and the computer scientists think they know the next generation of search tools.\n
  • #5: The Web is a unique object with an unprecedented size and growth curve, and with distance the largest source of information on -- basically -- everything. The Web has had a revolutionary impact on how we publish, access, and share information. \n
  • #6: In fact, it has a fundamental impact on our daily lives that increasingly take place “on the Web.”\n
  • #7: However, this increasing dependence on the Web comes at a price: the ease of publishing on the Web also results in the easy loss of information—Web content tends to be ephemeral. This project addresses the problem of our future cultural heritage. Globally this has been addressed head on by the Internet Archive, now supplemented by many national initiatives.\n
  • #8: \n
  • #9: We don’t want to focus on preservation, but on its use. That is, we critically assess the value of Web archives for realistic research scenarios, and develop information access tools and methods that maximize the archive’s utility for research. Web research tends to require complex selections and manipulations of the data.\n
  • #10: Search technology has advanced at an insane rate over the last decade. Who is still old enough to remember the early days of the Web, when people spent considerable parts of their time to collect and organize bookmarks.\n
  • #11: Despite the progress, complex tasks are still poorly supported by a modern search engines! The best strategy is to slice-and-dice the complex information need into many small sub-requests, and combine all the information post-hoc and outside the search engine into a coherent answer.\n
  • #12: Some systems allow for more complex interaction -- for example systems catering for exploratory or faceted search.\n
  • #13: Such systems are creating complex search query in the back end -- and on restricted domains much of the complexity could be hidden from the searcher.\n
  • #14: What if we have a way to open up this box? -- and allow searchers to manipulate complex requests or search strategies directly by combining several building blocks in unconstrained ways. Modern structured DB/IR technology allows for powerful, declarative queries or search strategies turning a collection of Web pages into a high dimensional data space.\n
  • #15: Each building block corresponds to a particular data source or manipulation of the data. The search strategy builds effectively a dedicated search engine “on the fly”.\n
  • #16: What will happen if we put these tools in the hands of the Web researchers? We will develop the appropriate building blocks and incrementally let them construct complex search strategies. Effectively, this means they can on the fly do their research, rather than have a turn around time of weeks or months in developing the right kind of crawler, the right kind of analysis tool, and then executing it. Moreover, researchers can store the search strategies, reuse and refine them, and share them with colleagues. In essence, the research methods will evolve in parallel with the search strategies, at a much faster pace and scale than ever before...\n
  • #17: \n
  • #18: However, the chosen selection and archiving strategies of Web material will have a crucial impact on their future value as cultural heritage. What choices are made or enforced upon us? What is the missing Web? The broken Web? The banned Web? We will critically evaluate the state of the Web Archive the resulting recommendations may prevent the loss of digital heritage.\n
  • #19: Progress is particularly thorny since we combine radically different research paradigms -- the truth finding paradigm of the exact sciences and the interpretative paradigm of the humanities -- we are in a unique situation of three disciplines (Computer Science, Media Studies, Heritage Field) looking at the same object of study, although seeing it also in different ways.\n
  • #20: \n