SlideShare a Scribd company logo
Carbon Dating the Web:
Estimating the Age of Web Resources
Hany SalahEldeen & Michael Nelson Carbon Dating the Web
Hany M. SalahEldeen & Michael L. Nelson
Old Dominion University
Department of Computer Science
Web Science and Digital Libraries Lab.
Motivation
In our research in social media,
resource sharing, and user
intention a question emerged…
Hany SalahEldeen & Michael Nelson 1 Carbon Dating the Web
When did a certain resource first
appear on the web?
First thought: Last Modified
Response Header
Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web
$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 08 May 2013 14:18:49 GMT
Date: Wed, 08 May 2013 14:18:49 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 08 May 2013 08:03:02 GMT
ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
The server responds with the last
modified date …
Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web
$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 08 May 2013 14:18:49 GMT
Date: Wed, 08 May 2013 14:18:49 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 08 May 2013 08:03:02 GMT
ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Real Creation date
Current Server datetime
Last modified date (Incorrect)
Lacks accuracy
Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web
$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 08 May 2013 14:18:49 GMT
Date: Wed, 08 May 2013 14:18:49 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 08 May 2013 08:03:02 GMT
ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Real Creation date
Current Server datetime
Last modified date (Incorrect)
Problematic as it is inaccurate in a large percentage of cases.
08 May 2013 ≈ 2012-02-11
Last modified date header is not
available
Hany SalahEldeen & Michael Nelson 3 Carbon Dating the Web
% curl -I http://temporalweb.net/
HTTP/1.1 200 OK
Set-Cookie: 60gpBAK=R1224192509; path=/; expires=Sat, 11-May-2013 03:45:10 GMT
Date: Sat, 11 May 2013 02:37:55 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: 60gp=R152135972; path=/; expires=Sat, 11-May-2013 03:36:44 GMT
Server: Apache/2.2.X (OVH)
Accept-Ranges: bytes
Vary: Accept-Encoding
Sometimes it is not present in the response headers.
Second thought: Timestamp on
the page
Hany SalahEldeen & Michael Nelson 4 Carbon Dating the Web
But the timestamp is highly
inconsistent
Hany SalahEldeen & Michael Nelson 5 Carbon Dating the Web
… and dependent on the page’s
style/scheme.
Hany SalahEldeen & Michael Nelson 6 Carbon Dating the Web
So as its location on the page
Hany SalahEldeen & Michael Nelson 7 Carbon Dating the Web
Pages’ Timestamps Differ
Hany SalahEldeen & Michael Nelson 8 Carbon Dating the Web
Very dependent on the page’s scheme/style
Not consistent
Non-existent sometimes
Shortcomings of using timestamp
extraction
Hany SalahEldeen & Michael Nelson 9 Carbon Dating the Web
• M. Inoue and K. Tajima. Noise robust detection of the emergence and spread of topics
on the web. In Proceedings of the 2nd Temporal Web Analytics Workshop, TempWeb
'12, pages 9 {16, New York, NY, USA, 2012. ACM
M. Inoue and K. Tajima developed a technique of extracting
creation timestamps on web pages.
Shortcomings:
• Ambiguity (12/07 is it the 12th of July or the 7th of December?).
• Non generalizable.
• Highly dependent on the specific CMS
• Highly dependent on the most prominent timestamp patterns.
But what if the resource itself
doesn’t exist any more?
Hany SalahEldeen & Michael Nelson 10 Carbon Dating the Web
Third thought: First existence in
public archives
Hany SalahEldeen & Michael Nelson 11 Carbon Dating the Web
Timestamp of the first memento
Shortcomings:
Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web
1- The page is not archived
Shortcomings:
Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web
2- Delay between page creation and archive’s first crawl.
Shortcomings:
Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web
3- A page is published then deleted before it is archived.
Shortcomings:
Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web
4- The archive’s quarantine (12 month- 2 weeks)
Goal
Create a tool that estimates with
generality the creation date of the
resource without relying on
specific infrastructures
Hany SalahEldeen & Michael Nelson 13 Carbon Dating the Web
Target Specification
• Doesn’t rely on the infrastructure of the
hosting web server.
• Doesn’t rely on the state and template of the
resource.
• Highly generic.
• Fast response with no quarantine periods.
• High accuracy, getting close estimates to real
creation date.
Hany SalahEldeen & Michael Nelson 14 Carbon Dating the Web
Idea
Moving objects leave trails…
Hany SalahEldeen & Michael Nelson 15 Carbon Dating the Web
Idea
Moving objects leave trails…
Hany SalahEldeen & Michael Nelson 15 Carbon Dating the Web
Or:
Foo  If you were Aussie
Chad  if you were British
Idea
Web pages leave trails as well since
the day they were created…
Hany SalahEldeen & Michael Nelson 16 Carbon Dating the Web
Web Trails
A web page could leave a trail of one of the
following denoting its existence:
– References
– Links (anchors)
– Social media likes and interactions.
– URL shortening.
– Backlinks
Hany SalahEldeen & Michael Nelson 17 Carbon Dating the Web
The Assumptions
We can propose reasonable assumptions that:
1. We have no prior knowledge of the resource or
its hosting web server.
2. The creation date and the publishing date of a
resource coincide.
 Ex.: When you write a blog, you publish it as
soon as you create it.
Hany SalahEldeen & Michael Nelson 18 Carbon Dating the Web
Idea
The creation date of any of the associated
events/trails could be an estimate of the
creation date.
Hany SalahEldeen & Michael Nelson 19 Carbon Dating the Web
Web
Resource
Scenario
Let’s consider the following scenario, on
Saturday night on the 11th of February of last
year I wrote a blog post about my work on the
research group’s blog page.
Hany SalahEldeen & Michael Nelson 20 Carbon Dating the Web
After creating the post I tweeted
about it …
Hany SalahEldeen & Michael Nelson 21 Carbon Dating the Web
https://twitter.com/hanysalaheldeen/status/168704224488730625
Then it picked up some speed on
Twitter and Facebook …
Hany SalahEldeen & Michael Nelson 22 Carbon Dating the Web
http://topsy.com/http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
The timeline of the resource
Hany SalahEldeen & Michael Nelson 23 Carbon Dating the Web
Given the events linked to the
existence of the resource we will
examine ways to extract first
observations
Hany SalahEldeen & Michael Nelson 24 Carbon Dating the Web
Age Estimation Methods
1. Resource and server analysis.
2. Backlinks analysis.
a) Web page backlinks.
b) Social media backlinks.
3. Archiving analysis.
4. Search engine indexing analysis
Hany SalahEldeen & Michael Nelson 25 Carbon Dating the Web
Resource and Server Analysis
Hany SalahEldeen & Michael Nelson 26 Carbon Dating the Web
$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 08 May 2013 14:18:49 GMT
Date: Wed, 08 May 2013 14:18:49 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 08 May 2013 08:03:02 GMT
ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Examine the server response and extract the last
modified date from the header if exists.
Observations recorded:
Hany SalahEldeen & Michael Nelson 27 Carbon Dating the Web
1. Last modified date from the response header.
Age Estimation Methods
1. Resource and server analysis.
2. Backlinks analysis.
a) Web page backlinks.
b) Social media backlinks.
3. Archiving analysis.
4. Search engine indexing analysis
Hany SalahEldeen & Michael Nelson 28 Carbon Dating the Web
Backlinks Analysis
• We use Google search API to discover backlinks of A.
• B & C were created after A was created.
• But this assumption is not completely true.
• Page B or C could be modified later to its creation of A
Hany SalahEldeen & Michael Nelson 29 Carbon Dating the Web
A
(The resource)
B C
Time Magazine
Ex.: If the front page of Time magazine decided
to finally feature me as “Person of the Year”
In this case page B (Time magazine’s front page)
was modified to point to my page A
Hany SalahEldeen & Michael Nelson 30 Carbon Dating the Web
Hany’s
Website
Time
Magazine
When did the link first appear?
To solve this problem:
1. We extract the timemap of the archived mementos of B.
2. Perform binary search to allocate the first appearance of the
link to A in B.
3. Get the timestamp of that first memento.
Hany SalahEldeen & Michael Nelson 31 Carbon Dating the Web
time
I first appeared here!
Observations recorded:
Hany SalahEldeen & Michael Nelson 32 Carbon Dating the Web
1. Last modified date from the response header.
2. First Appearance of a backlink.
Social Media Backlinks
Hany SalahEldeen & Michael Nelson 33 Carbon Dating the Web
• Similarly, you create a social backlink when you
tweet about a page
Topsy Otter API
Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web
Upto500Tweets
Topsy Otter API
Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web
Different shortened versions
Topsy Otter API
Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web
Break ties via the API epoch
Observations recorded:
Hany SalahEldeen & Michael Nelson 35 Carbon Dating the Web
1. Last modified date from the response header.
2. First Appearance of a backlink.
3. First Tweet published.
URL Shortening
Hany SalahEldeen & Michael Nelson 36 Carbon Dating the Web
http://ws-dl.blogspot.com/2012/02/2012-02-11-
losing-my-revolution-year.html
http://bit.ly/losing_revolution
Extract number of clicks
Creation Date of the Bitly
Observations recorded:
Hany SalahEldeen & Michael Nelson 37 Carbon Dating the Web
1. Last modified date from the response header.
2. First Appearance of a backlink.
3. First Tweet published.
4. First Bitly Shortened URL created.
Age Estimation Methods
1. Resource and server analysis.
2. Backlinks analysis.
a) Web page backlinks.
b) Social media backlinks.
3. Archiving analysis.
4. Search engine indexing analysis
Hany SalahEldeen & Michael Nelson 38 Carbon Dating the Web
Archives Analysis
• Furthermore, if the original headers exist for the first memento we extract
the original last modified date.
Hany SalahEldeen & Michael Nelson 39 Carbon Dating the Web
Get timestamp of first memento
Download the memento timemaps of the resource
Observations recorded:
Hany SalahEldeen & Michael Nelson 40 Carbon Dating the Web
1. Last modified date from the response header.
2. First Appearance of a backlink.
3. First Tweet published.
4. First Bitly Shortened URL created.
5. Time stamp of first memento in the archives.
Age Estimation Methods
1. Resource and server analysis.
2. Backlinks analysis.
a) Web page backlinks.
b) Social media backlinks.
3. Archiving analysis.
4. Search engine indexing analysis
Hany SalahEldeen & Michael Nelson 41 Carbon Dating the Web
Search Engine Index Analysis
• We use Google’s search API to extract the last crawled date
• Relatively short time between resource creation and search engine discovery.
• Drawback: Granularity is by day not by time.
Hany SalahEldeen & Michael Nelson 42 Carbon Dating the Web
Last crawled dates
Observations recorded:
Hany SalahEldeen & Michael Nelson 43 Carbon Dating the Web
1. Last modified date from the response header.
2. First Appearance of a backlink.
3. First Tweet published.
4. First Bitly Shortened URL created.
5. Time stamp of first memento in the archives.
6. Date of the last crawl by the search engine.
Ok, now we have a collection of
sources that return creation dates,
what will we do next?
Hany SalahEldeen & Michael Nelson 44 Carbon Dating the Web
Timestamps Accumulation
• We collect the obtained dates and get the
leftmost creation date recorded.
Hany SalahEldeen & Michael Nelson 45 Carbon Dating the Web
Timestamps Accumulation
Hany SalahEldeen & Michael Nelson 46 Carbon Dating the Web
Next step: Verifying our methods
Hany SalahEldeen & Michael Nelson 47 Carbon Dating the Web
Estimated Age Verification
1. Collect a dataset of webpages of known
creation/publishing date.
2. Compare the estimated results from our
method and the actual dates recorded.
Hany SalahEldeen & Michael Nelson 48 Carbon Dating the Web
Gold Standard Data Collection
Hany SalahEldeen & Michael Nelson 49 Carbon Dating the Web
We collect the pages from 4 difference
categories of collections to ensure variation.
1. News Sites.
2. Social Media and Blogs.
3. Long Standing Domains.
4. Manual Random Extraction.
News Sites
Hany SalahEldeen & Michael Nelson 50 Carbon Dating the Web
Using RSS and Atom feeds or XML sitemaps we
extracted numerous pages along with their
respective publishing dates.
1. Google News (29,154 pages)
2. BBC (3,703 pages)
3. CNN (18,519 pages)
4. Yahoo News (34,588 pages)
5. The Hollywood Gossip (6,859 pages)
Social Sites
Hany SalahEldeen & Michael Nelson 51 Carbon Dating the Web
We randomly selected different resources with
no regard to popularity to avoid the inherent
bias:
1. Pinterest (55,463 posts)
2. Tumblr (52,513 posts)
3. Youtube (78,000 posts)
4. Word Press (2,405,901 posts)
5. Blogger (32,417 posts)
Long Standing Domains
Hany SalahEldeen & Michael Nelson 52 Carbon Dating the Web
• Extract the top 500
domains from
Alexa.com
• Query their DNS
registry dates.
• Were able to extract
167 dates.
Manual Random Extraction
Hany SalahEldeen & Michael Nelson 53 Carbon Dating the Web
• We extracted 90 different random URLs
obtained from random walks on the web,
visually inspected them to extract the creation
date.
• The 10 URLs analyzed by Jatowt et al.*
* A. Jatowt, Y. Kawai, and K. Tanaka. Detecting age of page content. In Proceedings of the 9th
annual ACM international workshop on Web information and data management, WIDM '07,
pages 137--144, New York, NY, USA, 2007. ACM.
Gold Standard Data Collection
Hany SalahEldeen & Michael Nelson 54 Carbon Dating the Web
 From each we randomly selected 100 unique URLs to create our gold standard
dataset
Evaluation
Hany SalahEldeen & Michael Nelson 55 Carbon Dating the Web
• Applied our 6 methods on 1200 resources.
• Get leftmost estimation.
Number of Resources Percentage
An estimation found 910 76%
Exact matching estimation 393 33%
No estimation found 290 24%
Total Resources 1200 100%
Evaluation
Hany SalahEldeen & Michael Nelson 56 Carbon Dating the Web
Actual Vs. Estimated Dates
Hany SalahEldeen & Michael Nelson 57 Carbon Dating the Web
So what happens if one of these 6
methods failed?
Hany SalahEldeen & Michael Nelson 58 Carbon Dating the Web
Isolation and Elimination
Hany SalahEldeen & Michael Nelson 59 Carbon Dating the Web
Hany SalahEldeen & Michael Nelson 61 Carbon Dating the Web
Carbon Date API
http://cd.cs.odu.edu/cd/<Your URL Here>
Hany SalahEldeen & Michael Nelson 62 Carbon Dating the Web
Carbon Date API on GitHub
Hany SalahEldeen & Michael Nelson 63 Carbon Dating the Web
• Due to the slow response we advise that you
download the module and install it on your
machine.
• https://github.com/HanySalahEldeen/CarbonDate
Extra Slides
Hany SalahEldeen & Michael Nelson Carbon Dating the Web
Without Bitly
Hany SalahEldeen & Michael Nelson 00 Carbon Dating the Web
Without Google
Hany SalahEldeen & Michael Nelson 00 Carbon Dating the Web

More Related Content

PPTX
A simple guide to the new facebook timeline
PPT
Dating Fossils And Rocks
PPT
Carbon 14 Dating
PPTX
Group 3: Logarithms (carbon dating)
PDF
Radiocarbon dating theoretical concepts & practical applications
PPT
Dating fossils and rocks
PPTX
Maths3 project
PDF
Logarithms (carbon dating 2)
A simple guide to the new facebook timeline
Dating Fossils And Rocks
Carbon 14 Dating
Group 3: Logarithms (carbon dating)
Radiocarbon dating theoretical concepts & practical applications
Dating fossils and rocks
Maths3 project
Logarithms (carbon dating 2)

Viewers also liked (16)

PPT
Geologic time primer & carbon dating review
PPTX
Science 7.2-Half-Life
PPT
Carbon Presentation
PPTX
Chapter 16.3: Absolute Age Dating
PDF
Indian Food Processing Industry Presentation 060109
PPTX
Uranium enrichment and extraction from ores
PDF
Food science basics 2 - Food Chemistry Basics
PPTX
Food chemistry 1 (m.s.m)
PPTX
Food Technology Carbohydrates
PPTX
Food Technology Proteins
PPTX
Food chemistry
PPT
Food processing
PPT
Pp Presentation Dating Rocks
PPTX
Food processing industry.
PPTX
7.2 Half-Life
Geologic time primer & carbon dating review
Science 7.2-Half-Life
Carbon Presentation
Chapter 16.3: Absolute Age Dating
Indian Food Processing Industry Presentation 060109
Uranium enrichment and extraction from ores
Food science basics 2 - Food Chemistry Basics
Food chemistry 1 (m.s.m)
Food Technology Carbohydrates
Food Technology Proteins
Food chemistry
Food processing
Pp Presentation Dating Rocks
Food processing industry.
7.2 Half-Life
Ad

More from heinestien (8)

PDF
MLEARN 210 B Autumn 2018: Lecture 1
PDF
Doctoral Defense: Hany SalahEldeen
PPTX
Zen & the art of data mining
PDF
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
PPTX
Tpdl Doctoral consortium 2012
PPTX
Losing My Revolution Long Paper TPDL2012
PDF
Hany's JCDL Doctoral Consortium
PPTX
Hany's Doctoral Consortium
MLEARN 210 B Autumn 2018: Lecture 1
Doctoral Defense: Hany SalahEldeen
Zen & the art of data mining
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
Tpdl Doctoral consortium 2012
Losing My Revolution Long Paper TPDL2012
Hany's JCDL Doctoral Consortium
Hany's Doctoral Consortium
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Assigned Numbers - 2025 - Bluetooth® Document
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Assigned Numbers - 2025 - Bluetooth® Document
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
sap open course for s4hana steps from ECC to s4
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...

Carbon Dating The Web: Estimating the Age of Web Resources

  • 1. Carbon Dating the Web: Estimating the Age of Web Resources Hany SalahEldeen & Michael Nelson Carbon Dating the Web Hany M. SalahEldeen & Michael L. Nelson Old Dominion University Department of Computer Science Web Science and Digital Libraries Lab.
  • 2. Motivation In our research in social media, resource sharing, and user intention a question emerged… Hany SalahEldeen & Michael Nelson 1 Carbon Dating the Web When did a certain resource first appear on the web?
  • 3. First thought: Last Modified Response Header Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web $ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Expires: Wed, 08 May 2013 14:18:49 GMT Date: Wed, 08 May 2013 14:18:49 GMT Cache-Control: private, max-age=0 Last-Modified: Wed, 08 May 2013 08:03:02 GMT ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block
  • 4. The server responds with the last modified date … Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web $ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Expires: Wed, 08 May 2013 14:18:49 GMT Date: Wed, 08 May 2013 14:18:49 GMT Cache-Control: private, max-age=0 Last-Modified: Wed, 08 May 2013 08:03:02 GMT ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Real Creation date Current Server datetime Last modified date (Incorrect)
  • 5. Lacks accuracy Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web $ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Expires: Wed, 08 May 2013 14:18:49 GMT Date: Wed, 08 May 2013 14:18:49 GMT Cache-Control: private, max-age=0 Last-Modified: Wed, 08 May 2013 08:03:02 GMT ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Real Creation date Current Server datetime Last modified date (Incorrect) Problematic as it is inaccurate in a large percentage of cases. 08 May 2013 ≈ 2012-02-11
  • 6. Last modified date header is not available Hany SalahEldeen & Michael Nelson 3 Carbon Dating the Web % curl -I http://temporalweb.net/ HTTP/1.1 200 OK Set-Cookie: 60gpBAK=R1224192509; path=/; expires=Sat, 11-May-2013 03:45:10 GMT Date: Sat, 11 May 2013 02:37:55 GMT Content-Type: text/html Connection: keep-alive Set-Cookie: 60gp=R152135972; path=/; expires=Sat, 11-May-2013 03:36:44 GMT Server: Apache/2.2.X (OVH) Accept-Ranges: bytes Vary: Accept-Encoding Sometimes it is not present in the response headers.
  • 7. Second thought: Timestamp on the page Hany SalahEldeen & Michael Nelson 4 Carbon Dating the Web
  • 8. But the timestamp is highly inconsistent Hany SalahEldeen & Michael Nelson 5 Carbon Dating the Web
  • 9. … and dependent on the page’s style/scheme. Hany SalahEldeen & Michael Nelson 6 Carbon Dating the Web
  • 10. So as its location on the page Hany SalahEldeen & Michael Nelson 7 Carbon Dating the Web
  • 11. Pages’ Timestamps Differ Hany SalahEldeen & Michael Nelson 8 Carbon Dating the Web Very dependent on the page’s scheme/style Not consistent Non-existent sometimes
  • 12. Shortcomings of using timestamp extraction Hany SalahEldeen & Michael Nelson 9 Carbon Dating the Web • M. Inoue and K. Tajima. Noise robust detection of the emergence and spread of topics on the web. In Proceedings of the 2nd Temporal Web Analytics Workshop, TempWeb '12, pages 9 {16, New York, NY, USA, 2012. ACM M. Inoue and K. Tajima developed a technique of extracting creation timestamps on web pages. Shortcomings: • Ambiguity (12/07 is it the 12th of July or the 7th of December?). • Non generalizable. • Highly dependent on the specific CMS • Highly dependent on the most prominent timestamp patterns.
  • 13. But what if the resource itself doesn’t exist any more? Hany SalahEldeen & Michael Nelson 10 Carbon Dating the Web
  • 14. Third thought: First existence in public archives Hany SalahEldeen & Michael Nelson 11 Carbon Dating the Web Timestamp of the first memento
  • 15. Shortcomings: Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web 1- The page is not archived
  • 16. Shortcomings: Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web 2- Delay between page creation and archive’s first crawl.
  • 17. Shortcomings: Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web 3- A page is published then deleted before it is archived.
  • 18. Shortcomings: Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web 4- The archive’s quarantine (12 month- 2 weeks)
  • 19. Goal Create a tool that estimates with generality the creation date of the resource without relying on specific infrastructures Hany SalahEldeen & Michael Nelson 13 Carbon Dating the Web
  • 20. Target Specification • Doesn’t rely on the infrastructure of the hosting web server. • Doesn’t rely on the state and template of the resource. • Highly generic. • Fast response with no quarantine periods. • High accuracy, getting close estimates to real creation date. Hany SalahEldeen & Michael Nelson 14 Carbon Dating the Web
  • 21. Idea Moving objects leave trails… Hany SalahEldeen & Michael Nelson 15 Carbon Dating the Web
  • 22. Idea Moving objects leave trails… Hany SalahEldeen & Michael Nelson 15 Carbon Dating the Web Or: Foo  If you were Aussie Chad  if you were British
  • 23. Idea Web pages leave trails as well since the day they were created… Hany SalahEldeen & Michael Nelson 16 Carbon Dating the Web
  • 24. Web Trails A web page could leave a trail of one of the following denoting its existence: – References – Links (anchors) – Social media likes and interactions. – URL shortening. – Backlinks Hany SalahEldeen & Michael Nelson 17 Carbon Dating the Web
  • 25. The Assumptions We can propose reasonable assumptions that: 1. We have no prior knowledge of the resource or its hosting web server. 2. The creation date and the publishing date of a resource coincide.  Ex.: When you write a blog, you publish it as soon as you create it. Hany SalahEldeen & Michael Nelson 18 Carbon Dating the Web
  • 26. Idea The creation date of any of the associated events/trails could be an estimate of the creation date. Hany SalahEldeen & Michael Nelson 19 Carbon Dating the Web Web Resource
  • 27. Scenario Let’s consider the following scenario, on Saturday night on the 11th of February of last year I wrote a blog post about my work on the research group’s blog page. Hany SalahEldeen & Michael Nelson 20 Carbon Dating the Web
  • 28. After creating the post I tweeted about it … Hany SalahEldeen & Michael Nelson 21 Carbon Dating the Web https://twitter.com/hanysalaheldeen/status/168704224488730625
  • 29. Then it picked up some speed on Twitter and Facebook … Hany SalahEldeen & Michael Nelson 22 Carbon Dating the Web http://topsy.com/http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
  • 30. The timeline of the resource Hany SalahEldeen & Michael Nelson 23 Carbon Dating the Web
  • 31. Given the events linked to the existence of the resource we will examine ways to extract first observations Hany SalahEldeen & Michael Nelson 24 Carbon Dating the Web
  • 32. Age Estimation Methods 1. Resource and server analysis. 2. Backlinks analysis. a) Web page backlinks. b) Social media backlinks. 3. Archiving analysis. 4. Search engine indexing analysis Hany SalahEldeen & Michael Nelson 25 Carbon Dating the Web
  • 33. Resource and Server Analysis Hany SalahEldeen & Michael Nelson 26 Carbon Dating the Web $ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Expires: Wed, 08 May 2013 14:18:49 GMT Date: Wed, 08 May 2013 14:18:49 GMT Cache-Control: private, max-age=0 Last-Modified: Wed, 08 May 2013 08:03:02 GMT ETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Examine the server response and extract the last modified date from the header if exists.
  • 34. Observations recorded: Hany SalahEldeen & Michael Nelson 27 Carbon Dating the Web 1. Last modified date from the response header.
  • 35. Age Estimation Methods 1. Resource and server analysis. 2. Backlinks analysis. a) Web page backlinks. b) Social media backlinks. 3. Archiving analysis. 4. Search engine indexing analysis Hany SalahEldeen & Michael Nelson 28 Carbon Dating the Web
  • 36. Backlinks Analysis • We use Google search API to discover backlinks of A. • B & C were created after A was created. • But this assumption is not completely true. • Page B or C could be modified later to its creation of A Hany SalahEldeen & Michael Nelson 29 Carbon Dating the Web A (The resource) B C
  • 37. Time Magazine Ex.: If the front page of Time magazine decided to finally feature me as “Person of the Year” In this case page B (Time magazine’s front page) was modified to point to my page A Hany SalahEldeen & Michael Nelson 30 Carbon Dating the Web Hany’s Website Time Magazine
  • 38. When did the link first appear? To solve this problem: 1. We extract the timemap of the archived mementos of B. 2. Perform binary search to allocate the first appearance of the link to A in B. 3. Get the timestamp of that first memento. Hany SalahEldeen & Michael Nelson 31 Carbon Dating the Web time I first appeared here!
  • 39. Observations recorded: Hany SalahEldeen & Michael Nelson 32 Carbon Dating the Web 1. Last modified date from the response header. 2. First Appearance of a backlink.
  • 40. Social Media Backlinks Hany SalahEldeen & Michael Nelson 33 Carbon Dating the Web • Similarly, you create a social backlink when you tweet about a page
  • 41. Topsy Otter API Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web Upto500Tweets
  • 42. Topsy Otter API Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web Different shortened versions
  • 43. Topsy Otter API Hany SalahEldeen & Michael Nelson 34 Carbon Dating the Web Break ties via the API epoch
  • 44. Observations recorded: Hany SalahEldeen & Michael Nelson 35 Carbon Dating the Web 1. Last modified date from the response header. 2. First Appearance of a backlink. 3. First Tweet published.
  • 45. URL Shortening Hany SalahEldeen & Michael Nelson 36 Carbon Dating the Web http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html http://bit.ly/losing_revolution Extract number of clicks Creation Date of the Bitly
  • 46. Observations recorded: Hany SalahEldeen & Michael Nelson 37 Carbon Dating the Web 1. Last modified date from the response header. 2. First Appearance of a backlink. 3. First Tweet published. 4. First Bitly Shortened URL created.
  • 47. Age Estimation Methods 1. Resource and server analysis. 2. Backlinks analysis. a) Web page backlinks. b) Social media backlinks. 3. Archiving analysis. 4. Search engine indexing analysis Hany SalahEldeen & Michael Nelson 38 Carbon Dating the Web
  • 48. Archives Analysis • Furthermore, if the original headers exist for the first memento we extract the original last modified date. Hany SalahEldeen & Michael Nelson 39 Carbon Dating the Web Get timestamp of first memento Download the memento timemaps of the resource
  • 49. Observations recorded: Hany SalahEldeen & Michael Nelson 40 Carbon Dating the Web 1. Last modified date from the response header. 2. First Appearance of a backlink. 3. First Tweet published. 4. First Bitly Shortened URL created. 5. Time stamp of first memento in the archives.
  • 50. Age Estimation Methods 1. Resource and server analysis. 2. Backlinks analysis. a) Web page backlinks. b) Social media backlinks. 3. Archiving analysis. 4. Search engine indexing analysis Hany SalahEldeen & Michael Nelson 41 Carbon Dating the Web
  • 51. Search Engine Index Analysis • We use Google’s search API to extract the last crawled date • Relatively short time between resource creation and search engine discovery. • Drawback: Granularity is by day not by time. Hany SalahEldeen & Michael Nelson 42 Carbon Dating the Web Last crawled dates
  • 52. Observations recorded: Hany SalahEldeen & Michael Nelson 43 Carbon Dating the Web 1. Last modified date from the response header. 2. First Appearance of a backlink. 3. First Tweet published. 4. First Bitly Shortened URL created. 5. Time stamp of first memento in the archives. 6. Date of the last crawl by the search engine.
  • 53. Ok, now we have a collection of sources that return creation dates, what will we do next? Hany SalahEldeen & Michael Nelson 44 Carbon Dating the Web
  • 54. Timestamps Accumulation • We collect the obtained dates and get the leftmost creation date recorded. Hany SalahEldeen & Michael Nelson 45 Carbon Dating the Web
  • 55. Timestamps Accumulation Hany SalahEldeen & Michael Nelson 46 Carbon Dating the Web
  • 56. Next step: Verifying our methods Hany SalahEldeen & Michael Nelson 47 Carbon Dating the Web
  • 57. Estimated Age Verification 1. Collect a dataset of webpages of known creation/publishing date. 2. Compare the estimated results from our method and the actual dates recorded. Hany SalahEldeen & Michael Nelson 48 Carbon Dating the Web
  • 58. Gold Standard Data Collection Hany SalahEldeen & Michael Nelson 49 Carbon Dating the Web We collect the pages from 4 difference categories of collections to ensure variation. 1. News Sites. 2. Social Media and Blogs. 3. Long Standing Domains. 4. Manual Random Extraction.
  • 59. News Sites Hany SalahEldeen & Michael Nelson 50 Carbon Dating the Web Using RSS and Atom feeds or XML sitemaps we extracted numerous pages along with their respective publishing dates. 1. Google News (29,154 pages) 2. BBC (3,703 pages) 3. CNN (18,519 pages) 4. Yahoo News (34,588 pages) 5. The Hollywood Gossip (6,859 pages)
  • 60. Social Sites Hany SalahEldeen & Michael Nelson 51 Carbon Dating the Web We randomly selected different resources with no regard to popularity to avoid the inherent bias: 1. Pinterest (55,463 posts) 2. Tumblr (52,513 posts) 3. Youtube (78,000 posts) 4. Word Press (2,405,901 posts) 5. Blogger (32,417 posts)
  • 61. Long Standing Domains Hany SalahEldeen & Michael Nelson 52 Carbon Dating the Web • Extract the top 500 domains from Alexa.com • Query their DNS registry dates. • Were able to extract 167 dates.
  • 62. Manual Random Extraction Hany SalahEldeen & Michael Nelson 53 Carbon Dating the Web • We extracted 90 different random URLs obtained from random walks on the web, visually inspected them to extract the creation date. • The 10 URLs analyzed by Jatowt et al.* * A. Jatowt, Y. Kawai, and K. Tanaka. Detecting age of page content. In Proceedings of the 9th annual ACM international workshop on Web information and data management, WIDM '07, pages 137--144, New York, NY, USA, 2007. ACM.
  • 63. Gold Standard Data Collection Hany SalahEldeen & Michael Nelson 54 Carbon Dating the Web  From each we randomly selected 100 unique URLs to create our gold standard dataset
  • 64. Evaluation Hany SalahEldeen & Michael Nelson 55 Carbon Dating the Web • Applied our 6 methods on 1200 resources. • Get leftmost estimation. Number of Resources Percentage An estimation found 910 76% Exact matching estimation 393 33% No estimation found 290 24% Total Resources 1200 100%
  • 65. Evaluation Hany SalahEldeen & Michael Nelson 56 Carbon Dating the Web
  • 66. Actual Vs. Estimated Dates Hany SalahEldeen & Michael Nelson 57 Carbon Dating the Web
  • 67. So what happens if one of these 6 methods failed? Hany SalahEldeen & Michael Nelson 58 Carbon Dating the Web
  • 68. Isolation and Elimination Hany SalahEldeen & Michael Nelson 59 Carbon Dating the Web
  • 69. Hany SalahEldeen & Michael Nelson 61 Carbon Dating the Web Carbon Date API
  • 70. http://cd.cs.odu.edu/cd/<Your URL Here> Hany SalahEldeen & Michael Nelson 62 Carbon Dating the Web
  • 71. Carbon Date API on GitHub Hany SalahEldeen & Michael Nelson 63 Carbon Dating the Web • Due to the slow response we advise that you download the module and install it on your machine. • https://github.com/HanySalahEldeen/CarbonDate
  • 72. Extra Slides Hany SalahEldeen & Michael Nelson Carbon Dating the Web
  • 73. Without Bitly Hany SalahEldeen & Michael Nelson 00 Carbon Dating the Web
  • 74. Without Google Hany SalahEldeen & Michael Nelson 00 Carbon Dating the Web