SlideShare a Scribd company logo
Thumbnail Summarization
Techniques For Web Archives
Ahmed AlSum*
Stanford University Libraries
Stanford CA, USA
aalsum@stanford.edu
Michael L. Nelson
Old Dominion University
Norfolk VA, USA
mln@cs.odu.edu
The 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
* The research has been conducted while Ahmed AlSum was at Old Dominion University
ECIR 2014 Amsterdam, Netherlands
What is a Web Archive?
http://www.cs.odu.edu
2ECIR 2014 Amsterdam, Netherlands
Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
3ECIR 2014 Amsterdam, Netherlands
Thumbnails in Web Archive
Internet Archive UK Web Archive
4ECIR 2014 Amsterdam, Netherlands
Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail for each memento
using one hundred machines.
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
5ECIR 2014 Amsterdam, Netherlands
Thumbnail Usage Challenges
6
• This is partial view of the first 700 thumbnails out of
10,500 available mementos for www.apple.com
ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails.
7ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
8ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
9ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good.
10ECIR 2014 Amsterdam, Netherlands
METHODOLOGY
11ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity
SimilarDifferent
HTML Text
12ECIR 2014 Amsterdam, Netherlands
Correlation between
Visual Similarity and Text Similarity
• Text Similarity
• SimHash
• DOM Tree
• Embedded resources
• Memento Datetime (Capture time)
• Visual Similarity
• Number of different pixels
13ECIR 2014 Amsterdam, Netherlands
Text Similarity
SimHash
• Compute 64-bit SimHash fingerprints with k = 4 for two
pages, then Calculate the distance using Hamming
Distance
14ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
12 bits
Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
Text Similarity
DOM Tree
• Transfer each webpage to DOM tree
• Calculate the difference using Levenshtein Distance
• Levenshtein distance: is the number of operations to insert, update, and delete.
15ECIR 2014 Amsterdam, Netherlands
Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
Text Similarity
Embedded resources
• Extract the embedded resources from each page
• Calculate the total number of new resources that have
been added and the resources that have been removed.
16ECIR 2014 Amsterdam, Netherlands
Addition
Removal
Total 4 11
Images 1 9
JS 1 0
CSS 2 2
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Text Similarity
Memento datetime
• Calculate the difference between the record capture time
for both pages in seconds.
17ECIR 2014 Amsterdam, Netherlands
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Difference
70942 sec
Visual Similarity
• The number of different pixels between two thumbnails,
we resize them into different dimensions (e.g., 64x64 and
128x128). We calculate the Manhattan distance between
each pair
ECIR 2014 Amsterdam, Netherlands 18
12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
Distance
0.65
EXPERIMENT
Calculate the correlation between Visual Similarity and
Text Similarity
ECIR 2014 Amsterdam, Netherlands 19
Fortune 500
• 499,540 mementos from 488
TimeMaps.
• For each Memento, we download the
HTML and capture the thumbnail using
PhantomJS.
20
Dataset
Correlation between
Visual Similarity and Text Similarity
SimHash DOM tree
Embedded resources Memento Datetime
21
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
ECIR 2014 Amsterdam, Netherlands
SELECTION ALGORITHMS
Using text similarity features to predict the visual
similarity.
22ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
23ECIR 2014 Amsterdam, Netherlands
#1: Threshold Grouping
24ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
• Input:
• TimeMap with n mementos
• A set of features.
• For example, F = {SimHash, Memento-Datetime}
• Task:
• Cluster n mementos in K clusters.
25ECIR 2014 Amsterdam, Netherlands
#2: Clustering technique
SimHash Feature SimHash and Datetime Features
26
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
ECIR 2014 Amsterdam, Netherlands
#3: Time Normalization
27ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23%
Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
28ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive
• Summarize a website of n pages with only k thumbnails
29ECIR 2014 Amsterdam, Netherlands
Conclusions
• We explored the similarity between the text and visual
appearance of the web page.
• We found that SimHash difference between HTML text and
Levenshtein distance between HTML DOM tree have the highest
correlation
• We presented three algorithms to select k thumbnails
from n mementos per TimeMap.
30
aalsum@stanford.edu
@aalsum
ECIR 2014 Amsterdam, Netherlands

More Related Content

PPTX
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
PPTX
"Web Archive services framework for tighter integration between the past and ...
PPTX
Medical Heritage Library (MHL) on ArchiveSpark
PDF
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
PPTX
Improving long-term preservation of EOS data by independently mapping HDF4 da...
PPTX
Toward Semantic Sensor Data Archives on the Web
PDF
Nanopublications and Decentralized Publishing
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
"Web Archive services framework for tighter integration between the past and ...
Medical Heritage Library (MHL) on ArchiveSpark
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Improving long-term preservation of EOS data by independently mapping HDF4 da...
Toward Semantic Sensor Data Archives on the Web
Nanopublications and Decentralized Publishing

Similar to Thumbnail Summarization Techniques For Web Archives (20)

PPTX
DepositMOre: Applying tools to increase full-text content in institutional re...
PPT
On the need for a W3C community group on RDF Stream Processing
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
Mining and Managing Large-scale Linked Open Data
PDF
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
PDF
Cloud-native persistence in a serverless world
PDF
sample-resume
PDF
Apache Solr as a compressed, scalable, and high performance time series database
PDF
Service Integration to Enhance RDM
PDF
CLARIAH Toogdag 2018: A distributed network of digital heritage information
PPTX
RDM Programme @ Edinburgh
PPTX
Fontys Lecture - The Evolution of the Oracle Database 2016
PDF
RDM programme @ Edinburgh an institutional approach
PPTX
RDM@Edinburgh_interoperation_IDCC2015
PDF
Geospatial Sensor Networks and Partitioning Data
PDF
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
PDF
RDF Stream Processing Models (SR4LD2013)
PPTX
Duraspace Hot Topics Series 6: Metadata and Repository Services
PDF
Benefits of Hadoop as Platform as a Service
DepositMOre: Applying tools to increase full-text content in institutional re...
On the need for a W3C community group on RDF Stream Processing
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Cloud-native persistence in a serverless world
sample-resume
Apache Solr as a compressed, scalable, and high performance time series database
Service Integration to Enhance RDM
CLARIAH Toogdag 2018: A distributed network of digital heritage information
RDM Programme @ Edinburgh
Fontys Lecture - The Evolution of the Oracle Database 2016
RDM programme @ Edinburgh an institutional approach
RDM@Edinburgh_interoperation_IDCC2015
Geospatial Sensor Networks and Partitioning Data
ODIN 1st year Conference Oct 2013 Interoperability: connecting identifiers
RDF Stream Processing Models (SR4LD2013)
Duraspace Hot Topics Series 6: Metadata and Repository Services
Benefits of Hadoop as Platform as a Service
Ad

More from Ahmed AlSum (6)

PPTX
Restoring US First Website
PPTX
Web archiving challenges and opportunities
PPTX
Web Archiving Profile - WADL 2013
PPTX
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
PDF
ArcLink - IIPC GA 2013
PPTX
How Much of the Web is Archived? JCDL 2011
Restoring US First Website
Web archiving challenges and opportunities
Web Archiving Profile - WADL 2013
Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013
ArcLink - IIPC GA 2013
How Much of the Web is Archived? JCDL 2011
Ad

Recently uploaded (20)

PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
August Patch Tuesday
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Enhancing emotion recognition model for a student engagement use case through...
Web App vs Mobile App What Should You Build First.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
OMC Textile Division Presentation 2021.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
Mushroom cultivation and it's methods.pdf
A comparative analysis of optical character recognition models for extracting...
Building Integrated photovoltaic BIPV_UPV.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
A Presentation on Artificial Intelligence
August Patch Tuesday
Hindi spoken digit analysis for native and non-native speakers
A novel scalable deep ensemble learning framework for big data classification...
Assigned Numbers - 2025 - Bluetooth® Document
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Enhancing emotion recognition model for a student engagement use case through...

Thumbnail Summarization Techniques For Web Archives

  • 1. Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu Michael L. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 * The research has been conducted while Ahmed AlSum was at Old Dominion University ECIR 2014 Amsterdam, Netherlands
  • 2. What is a Web Archive? http://www.cs.odu.edu 2ECIR 2014 Amsterdam, Netherlands
  • 3. Memento Terminology URI-R, R URI-M, M URI-T, TM http://www.amazon.com http://web.archive.org/web/20110411070244/http://amazon.com Original Resource Memento TimeMap 3ECIR 2014 Amsterdam, Netherlands
  • 4. Thumbnails in Web Archive Internet Archive UK Web Archive 4ECIR 2014 Amsterdam, Netherlands
  • 5. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality 5ECIR 2014 Amsterdam, Netherlands
  • 6. Thumbnail Usage Challenges 6 • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands
  • 7. From 10,500 Mementos to 69 Thumbnails. 7ECIR 2014 Amsterdam, Netherlands
  • 8. How many thumbnails do we need? www.unfi.com on the live Web 8ECIR 2014 Amsterdam, Netherlands
  • 9. How many thumbnails do we need? www.unfi.com on the live Web 9ECIR 2014 Amsterdam, Netherlands
  • 10. 40 Thumbnails are good. 10ECIR 2014 Amsterdam, Netherlands
  • 12. Visual Similarity and Text Similarity SimilarDifferent HTML Text 12ECIR 2014 Amsterdam, Netherlands
  • 13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity • Number of different pixels 13ECIR 2014 Amsterdam, Netherlands
  • 14. Text Similarity SimHash • Compute 64-bit SimHash fingerprints with k = 4 for two pages, then Calculate the distance using Hamming Distance 14ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 12 bits Simhash: 147EDAA9977E9400 Simhash: 157EFAAC97189100
  • 15. Text Similarity DOM Tree • Transfer each webpage to DOM tree • Calculate the difference using Levenshtein Distance • Levenshtein distance: is the number of operations to insert, update, and delete. 15ECIR 2014 Amsterdam, Netherlands Pawlik, M., & Augsten, N. (2011). RTED: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4), 334–345.
  • 16. Text Similarity Embedded resources • Extract the embedded resources from each page • Calculate the total number of new resources that have been added and the resources that have been removed. 16ECIR 2014 Amsterdam, Netherlands Addition Removal Total 4 11 Images 1 9 JS 1 0 CSS 2 2 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05
  • 17. Text Similarity Memento datetime • Calculate the difference between the record capture time for both pages in seconds. 17ECIR 2014 Amsterdam, Netherlands 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Difference 70942 sec
  • 18. Visual Similarity • The number of different pixels between two thumbnails, we resize them into different dimensions (e.g., 64x64 and 128x128). We calculate the Manhattan distance between each pair ECIR 2014 Amsterdam, Netherlands 18 12 Sep 2012 - 00:12:27 12 Sep 2012 - 19:54:05 Distance 0.65
  • 19. EXPERIMENT Calculate the correlation between Visual Similarity and Text Similarity ECIR 2014 Amsterdam, Netherlands 19
  • 20. Fortune 500 • 499,540 mementos from 488 TimeMaps. • For each Memento, we download the HTML and capture the thumbnail using PhantomJS. 20 Dataset
  • 21. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime 21 SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands
  • 22. SELECTION ALGORITHMS Using text similarity features to predict the visual similarity. 22ECIR 2014 Amsterdam, Netherlands
  • 23. #1: Threshold Grouping 23ECIR 2014 Amsterdam, Netherlands
  • 24. #1: Threshold Grouping 24ECIR 2014 Amsterdam, Netherlands
  • 25. #2: Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. 25ECIR 2014 Amsterdam, Netherlands
  • 26. #2: Clustering technique SimHash Feature SimHash and Datetime Features 26 Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands
  • 27. #3: Time Normalization 27ECIR 2014 Amsterdam, Netherlands
  • 28. Selection Algorithms Comparison Threshold Grouping K clustering Time Normalization TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109 # Features 1 feature 1 or more 1 feature Preprocessing required Yes Yes No Efficient processing Medium Extensive Light Incremental Yes No Yes Online/offline Both Both Both 28ECIR 2014 Amsterdam, Netherlands
  • 29. Generalization outside the Web Archive • Summarize a website of n pages with only k thumbnails 29ECIR 2014 Amsterdam, Netherlands
  • 30. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash difference between HTML text and Levenshtein distance between HTML DOM tree have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. 30 aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

Editor's Notes

  • #29: Verbally show this is the endExplain this is an initial step in this area