SlideShare a Scribd company logo
A Framework for Aggregating
Private and Public Web Archives
Mat Kelly
Old Dominion University, Norfolk, VA
Advisor: Michele C. Weigle
JCDL 2015 Doctoral Consortium
June 21, 2015
The Problem
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
2
private
archive
private
archive
other
private
archive
other
private
archive
All Archives Cannot Be Aggregated
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
3
private
archive
private
archive
other
private
archive
TimeMap
other
private
archive
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
4
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
5
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
6
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
7
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
8
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
9
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
10
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
11
t = k t = k-1≠
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
12
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
13
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
14
1 year ago 2 year ago 10 year ago
…
180 year ago
TimeMap
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
15
private
archive
Proactive Preservation
• Just-in-time WARC creation
• Personal and potentially private web archiving
• Mitigates deferral problem
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
16
Public vs. Private
Web Archiving
• Public Web Archiving
– Relies on deferred capture
– Uses WARC, Memento, etc.
– Integrates with other public web archives
• Private Web Archiving
– Same tools, less overhead, less bureaucracy
– Uses WARC, Memento, etc.
– Does not integrate
A Framework for Aggregating Private and
Public Web Archives
17
JCDL 2015 Doctoral
Consortium
Typical Web Archive Access
1. Web User Interface
2. Memento
TimeGate
TimeMap
– Accept-Datetime (content negotiation)
A Framework for Aggregating Private and
Public Web Archives
18
URI-G
TimeMap
JCDL 2015 Doctoral
Consortium
Aggregating Multiple Web Archives
• Memento Aggregator
– Temporally Sorted TimeMap combined from
multiple archives
– Allows temporal gaps in one archive to be filled in
by another
TimeMap
Archive Supplementation
• More capturesgreater temporal coverage
• Content on Deep Web
• A large chunk of the Web is not preserved
– Tools’ inability
– Inconsistency over time due to personalization
A Framework for Aggregating Private and
Public Web Archives
20
JCDL 2015 Doctoral
Consortium
Concerns in Aggregating Private
Web Archives
• Privacy
• Inconsistency of page representation
– URI is insufficient key for access
A Framework for Aggregating Private and
Public Web Archives
21
JCDL 2015 Doctoral
Consortium
• Archival integrity
– Has private archives content been manipulated?
Why Individuals Might Want
Personalized Aggregations
• Show my private web archive captures
• Concerned about exposing sensitive info to
public
– But still want to view temporally inline
• Private/Restricted Archives are becoming ever
more common
A Framework for Aggregating Private and
Public Web Archives
22
JCDL 2015 Doctoral
Consortium
Temporal Supplementation
A Framework for Aggregating Private and
Public Web Archives
23
JCDL 2015 Doctoral
Consortium
My Archives Have
What They May Have Missed
A Framework for Aggregating Private and
Public Web Archives
24
JCDL 2015 Doctoral
Consortium
The Concerns Distilled
• Access Control
– And indicators for PWA
• Preservation of Private Content
• Interoperability without privacy compromise
A Framework for Aggregating Private and
Public Web Archives
25
JCDL 2015 Doctoral
Consortium
Web Archive Usage Pattern 1:
Direct Access
A Framework for Aggregating Private and
Public Web Archives
26
OR
TimeMap
JCDL 2015 Doctoral
Consortium
Web Archive Usage Pattern 2:
Web Archive Aggregation
• Better results for a URI due to more sources
for capture
A Framework for Aggregating Private and
Public Web Archives
27
TimeMap
JCDL 2015 Doctoral
Consortium
Previous Patterns: Status Quo
• Patterns 1 and 2 are status quo
– provided by framework
• Querying web archives currently only
considers public web content
– URI for lookup
• Framework introduces 2 new entities
– Memento Meta Aggregator (MMA)
– Private Web Archive Adapter (PWAA)
A Framework for Aggregating Private and
Public Web Archives
28
JCDL 2015 Doctoral
Consortium
Memento Meta Aggregator (MMA)
• Functional superset of (MA)
• Can act as intermediary client to relay MA
results to ultimate user
• Allows just-in-time (JIT) inclusion of archives
– as specified at query time
• Set of archives aggregated can be dynamic
– e.g., Results must not include IA
A Framework for Aggregating Private and
Public Web Archives
29
JCDL 2015 Doctoral
Consortium
MY CNN CAPTURES
Aggregating My Captures
A Framework for Aggregating Private and
Public Web Archives
30
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
Various public web archives
My web archives
MY CNN CAPTURES
The Current Memento Aggregator
A Framework for Aggregating Private and
Public Web Archives
31
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10
MY CNN CAPTURES
Accessing the Aggregator
A Framework for Aggregating Private and
Public Web Archives
32
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10
MY CNN CAPTURES
Accessing the Aggregator
…does not include our archives
A Framework for Aggregating Private and
Public Web Archives
33
MY BANK CAPTURES
NOT AGGREGATED
NOT AGGREGATED
JCDL 2015 Doctoral
Consortium
100
30
10
140
Access via the Meta Aggregator
MY CNN CAPTURES
Pattern 3: Aggregator Relay
MY BANK CAPTURES
100
30
10
140140
MY CNN CAPTURES
Web Archive Usage Pattern 4:
Including Additional Archives in Aggregation
MY BANK CAPTURES
Access via the Meta Aggregator
…allows our archives to be included
100
30
10
15
140155
MY CNN CAPTURES
MMAs Allow Our Public Captures
to be Shared
A Framework for Aggregating Private and
Public Web Archives
36
MY BANK CAPTURES
JCDL 2015 Doctoral
Consortium
100
30
10
15
140155
155
155
MY CNN CAPTURES
Web Archive Usage Pattern 5:
Recursive MMA Access
A Framework for Aggregating Private and
Public Web Archives
37
MY BANK CAPTURES
…
Bob’s public
CAPTURES
The organization’s
public CAPTURES 1
The organization’s
public CAPTURES 2
contains
A B C D
Contains
B C D
Contains
C D
A
B C
D
JCDL 2015 Doctoral
Consortium
10
5
15
15
20
35
35
15
50
50
New Framework Entity 1:
Memento Meta Aggregator
• Allow dynamic and JIT set of archives
• Superset can be recursively constructed
• Sets can be shared
My public captures
can be integrated
with public web archives’
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
38
Private Web Archive Adapter
(PWAA)
• Regulates access to Private Web Archives
(PWAs)
• Acts as token authorizer
• With credentials OK, relays results as if
querying the PWA directly
A Framework for Aggregating Private and
Public Web Archives
39
JCDL 2015 Doctoral
Consortium
MY CNN CAPTURES
User Establishes Access with PWA
A Framework for Aggregating Private and
Public Web Archives
40
MY BANK CAPTURES
GET TOKEN for PWA
Key: abcd1234
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
MMA Relays Request
A Framework for Aggregating Private and
Public Web Archives
41
MY BANK CAPTURES
GET TOKEN for PWA
Key: abcd1234
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
PWAA Accepts Request
Generates Reusable Token
A Framework for Aggregating Private and
Public Web Archives
42
MY BANK CAPTURES
ACCESS OK
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
User Submits Request for URI-R
with Token
A Framework for Aggregating Private and
Public Web Archives
43
MY BANK CAPTURES
GET mementos for URI
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
MMA Relays Request (again)
A Framework for Aggregating Private and
Public Web Archives
44
MY BANK CAPTURES
GET mementos for URI
Token: 4f33c64
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
PWAA Verified & Relays Request
MA Gets Mementos, per usual
A Framework for Aggregating Private and
Public Web Archives
45
MY BANK CAPTURES
Token: 4f33c64
OK
GET mementos for URI
GET mementos for URI
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
Archives Return Mementos
A Framework for Aggregating Private and
Public Web Archives
46
MY BANK CAPTURES
Token: 4f33c64 OK
Returning mementos
Return mementos
For URI
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
MY CNN CAPTURES
PWAA Relays TimeMap
A Framework for Aggregating Private and
Public Web Archives
47
MY BANK CAPTURES
TimeMap
TimeMap
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
140
10,000
10,143 140 captures
MY CNN CAPTURES
MMA Annotates and Aggregates
A Framework for Aggregating Private and
Public Web Archives
48
MY BANK CAPTURES
TimeMap
TimeMap
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
10,143
140 captures
3 captures
10,000 captures
MY CNN CAPTURES
Web Archive Usage Pattern 6:
Aggregating Public & Private Archives
A Framework for Aggregating Private and
Public Web Archives
49
MY BANK CAPTURES
TimeMap
JCDL 2015 Doctoral
Consortium
100
30
10
3 captures
10,000 captures
10,143 captures
MY CNN CAPTURES
Regulated Access Can Be Shared
A Framework for Aggregating Private and
Public Web Archives
50
MY BANK CAPTURES
GET mementos for URI
Token: 4f33c64
GET mementos for URI
Token: c5463b4
GET TOKEN for PWA
Key: 2265eef3
No/invalid token
returned
Access denied or
0 mementos
JCDL 2015 Doctoral
Consortium
3 captures
10,000 captures
Aggregating Multiple PWAs
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
51
MY BANK CAPTURES
Linda’s Private
Captures
Bob’s Private
Captures
GET TOKENs for PWAs
Key: abcd1234, Archive: My
Key: cab45cbf, Archive: Linda
Key: b0b01b, Archive: Bob
3 captures
5 captures
10 captures
5
3
10
Aggregating Multiple PWAs
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
52
MY BANK CAPTURES
Access OK
Token: 7790ca
Access OK
Token: b0b01b
ACCESS
DENIED
Linda’s Private
Captures
Bob’s Private
Captures
3 captures
5 captures
10 captures
5
3
10
PWAs Can Then be Aggregated
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
53
MY BANK CAPTURES
GET mementos for URI
Token: 7790ca, Archive: My
Token: null, Archive: Linda
Token: b0b01b, Archive: Bob
Linda’s Private
Captures
Bob’s Private
Captures
3 captures
5 captures
10 captures
5
3
10
3
10
ø13
Sample TimeMap
...
, <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT"
, <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen
to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT"
...
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
54
TimeMap
Access Token Included in TimeMap
...
, <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT"
, <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen
to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT"
...
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
55
MY PRIVATE FACEBOOK CAPTURES
My Public Web Archive,
Now Aggregated
...
, <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 15:57:03 GMT"
, <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento";
datetime="Sat, 28 Feb 2015 16:39:39 GMT"
, <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento";
datetime="Tue, 03 Mar 2015 16:28:41 GMT"
, <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento";
datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e"
, <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento";
datetime="Tue, 05 Mar 2015 21:59:22 GMT"
,
<http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="meme
nto"; datetime="Wed, 06 Mar 2015 12:34:57 GMT"
, <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento";
datetime="Tue, 10 Mar 2015 14:07:21 GMT"
...
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
56
MY PUBLIC FACEBOOK CAPTURES
Evaluation Plan
• How effective is the Framework?
• Scalability ramifications of additional
infrastructure?
• Is public-private tokenization most suitable
method for persistent access?
• How can a single archive be sub-divided
between private/public and access controlled?
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
57
Previous Work
Preservation and Replay
PDA 2013 - Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
JCDL 2012 - WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
Evaluating Capture
IJDL 2015 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources
IJDL 2015 - The Impact of JavaScript on Archivability
JCDL 2014 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources
JCDL 2014 - The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript
Dlib 2013 - A Method for Identifying Personalized Representations in the Archives
TPDL 2013 - On the Change in Archivability of Websites Over Time
Archival Integration
JCDL 2015 - Mobile Mink: Merging Mobile and Desktop Archived Webs
JCDL 2014 - Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento
58
WARCreate – preserve from the browser
WAIL – private web archiving all-in-one suite
Mink – Integrate the live and archived web
SOFTWARE PRODUCTS
PUBLICATIONS
Current Work
• Other approaches of archival lookup beyond
URI
• Appropriate metadata to indicate private web
content in WARC files
• Existing integration attempts by private web
archives & individuals
A Framework for Aggregating Private and
Public Web Archives
59
JCDL 2015 Doctoral
Consortium
 Background Research
 PhD Requirements (Coursework, Qualifying Exam, etc.)
 Build preliminary framework model
 JCDL Doctoral Consortium
EXTENDED RESEARCH
• Research prevalence of private web archives
• Research access control methods in web archiving and other domains
• Investigate other access patterns and expound on those defined
• PhD Candidacy Exam describing merit of research plan
• Implement feedback received from candidacy exam committee
• Programmatically implement MMA and PWAA
CASE STUDIES (real-world application)
• Publicly Available Non-Aggregated Archives (e.g., Rhizome)
• Deep web preservation/access (bank account/Facebook feeds)
• DISSERTATION DEFENSE
Dissertation Plan
Preliminary Publication Plan
JCDL 2016 Evaluation of User Access Patterns for Private Web Archives
TPDL 2016 Methods in adding JIT Inclusion of Private Web Archives in Memento
ACM
SACMAT*
Research exploring tokenization and similar methods for archival access
establishment
iPres 2016 Research investigating URI clash & other needed identifiers for
distinguishing archived content from the “deep web” with archived
content from the public live web.
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
61
* Symposium on Access Control Models and Technologies
Future Research Questions
• Can a PWAA perform content negotiation[1] on
the private-public spectrum?
• What level of security is needed?
– e.g., reporting UNAUTHORIZED vs. 0 mementos
A Framework for Aggregating Private and
Public Web Archives
62
JCDL 2015 Doctoral
Consortium
[1] RFC2295 https://www.ietf.org/rfc/rfc2295.txt
Summation
• Why?
– No means exists to integrate private and public web
archives.
• How to Evaluate?
– Does this framework fit real world needs? Scalable?
• When will I know I am done?
– Any public/private web archive* can be integrated.
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
63
* -compliant
References
• D. Abrams, R. Baecker, and M. Chignell. Information Archiving with Bookmarks: Personal Web Space Construction and
Archiving. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 41–48, 1998.
• A. AlSum, M. Weigle, M. Nelson, and H. Van de Sompel. Profiling Web Archive Coverage for Top-Level Domain and Content
Language. International Journal on Digital Libraries, 14(3-4):149–166, 2014.
• J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The
Impact Of Missing Resources. In Proceedings of JCDL, pages 321–330, London, England, 2014.
• J. F. Brunelle, M. Kelly, M. C. Weigle, and M. L. Nelson. The Impact of JavaScript on Archivability. International Journal on
Digital Libraries, pages 1–23, 2015.
• J. F. Brunelle and M. L. Nelson. An Evaluation of Caching Policies for Memento TimeMaps. In Proceedings of JCDL, pages
267–276, 2013.
• D. Gomes, S. Freitas, and M. J. Silva. Design and Selection Criteria for a National Web Archive. In Research and Advanced
Technology for Digital Libraries, pages 196–207. Springer, 2006.
• D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012.
• M. Jones and D. Hardt. The OAuth 2.0 Authorization Framework: Bearer Token Usage. IETF RFC 6750, October 2012.
• M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. A Method for Identifying Personalized Representations in the
Archives. D-Lib Magazine, 19(11/12), Nov/Dec 2013.
• M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. On the Change in Archivability of Websites Over Time. In Proceedings
of the International Conference on Theory and Practice of Digital Libraries (TPDL), pages 35–47, Valletta, Malta, 2013.
• M. Kelly, M. L. Nelson, and M. C. Weigle. Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving Using
XAMPP. Poster and demo presented at Personal Digital Archiving, February 2013.
• M. Kelly, M. L. Nelson, and M. C. Weigle. The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and
JavaScript. In Proceedings of JCDL, pages 25–28, London, England, September 2014.
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
64
References
• M. Kelly and M. C. Weigle. WARCreate - Create Wayback-Consumable WARC Files from Any Webpage. In Proceedings of
JCDL, pages 437–438, Washington, DC, June 2012.
• C. C. Marshall. Rethinking Personal Digital Archiving, Part 1. D-Lib Magazine, 14(3/4), Mar/Apr 2008.
• C. C. Marshall. Rethinking Personal Digital Archiving, Part 2. D-Lib Magazine, 14(3/4), Mar/Apr 2008.
• J. Niu. Functionalities of Web Archives. D-Lib Magazine, 18(3/4), Mar/Apr 2012.
• M. Phillips. PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports It.
http://pandora.nla.gov.au/pandas.html, 2003.
• H. C.-H. Rao, Y.-F. Chen, and M.-F. Chen. A Proxy-based Personal Web Archiving Service. SIGOPS Oper. Syst. Rev., 35(1):61–72,
Jan. 2001.
• A. Rauber, M. Kaiser, and B. Wachter. Ethical Issues in Web Archive Creation and Usage-Towards a Research Agenda. In 8th
International Web Archiving Workshop (IWAW08), 2008.
• D. Rosenthal. Re-thinking Memento Aggregation. http://blog.dshr.org/2013/03/re-thinking-memento-aggregation.html,
2013.
• T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk Failure
Investigations at the Internet Archive. In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and
Technologies (MSST2006), 2006.
• S. Strodl, F. Motlik, K. Stadler, and A. Rauber. Personal & Soho Archiving. In Proceedings of JCDL, pages 115–123, 2008.
• M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library &
Information Science Research, 26(2):162–176, 2004.
• B. Tofel. ‘Wayback’ for Accessing Web Archives. In 7th International Web Archiving Workshop (IWAW07), 2007.
• H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP Framework for Time-Based Access to Resource States – Memento.
IETF RFC 7089, December 2013.
• T. Wang, M. Srivatsa, and L. Liu. Fine-Grained Access Control of Personal Data. In Proceedings of the 17th ACM Symposium
on Access Control Models and Technologies, pages 145–156, 2012.
JCDL 2015 Doctoral
Consortium
A Framework for Aggregating Private and
Public Web Archives
65
A Framework for Aggregating
Private and Public Web Archives
Mat Kelly
Old Dominion University, Norfolk, VA
Advisor: Michele C. Weigle
JCDL 2015 Doctoral Consortium
June 21, 2015

More Related Content

PPTX
Slides
PPTX
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
PPTX
Detecting Off-Topic Web Pages at #CUWARC
PPTX
Digital Collection Management with CONTENTdm and Omeka
PDF
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
PPTX
ResourceSync Tutorial
PDF
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
PDF
Preservation as a Process MetaArchive and Distributed Digital Preservation
Slides
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Detecting Off-Topic Web Pages at #CUWARC
Digital Collection Management with CONTENTdm and Omeka
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
ResourceSync Tutorial
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Preservation as a Process MetaArchive and Distributed Digital Preservation

What's hot (20)

PDF
Mind the gap! Reflections on the state of repository data harvesting
PDF
DHO Intro to CMS for DH Workshop
PDF
Metadata Provenance Tutorial Part 2: Interoperable Metadata Provenance
PPTX
Authority Control: Wikipedia + Wikidata
PPTX
Your Digital Preservation Cookbook
PPTX
"Web Archive services framework for tighter integration between the past and ...
PPTX
Reminiscing about interoperability
PPT
Access to Content via Link Resolvers
PDF
DSpace Update from Open Repositories 2014
PDF
The Dark Side of Digital Preservation: Distributed Digital Preservation
PDF
Curation and Digital Storytelling
PPT
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
PDF
Integration of Web Protégé into DBpedia
PPTX
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
PDF
Cataloging Landscape Update: RDA and LC Working Group on the Future of Biblio...
PPTX
Best Practices for Descriptive Metadata
PDF
Web Archive Profiling Through Fulltext Search
PPT
Achieving Link Integrity for Managed Collections
PPTX
Introduction to Linked Data Platform (LDP)
PDF
Cenitpede: Analyzing Webcrawl
Mind the gap! Reflections on the state of repository data harvesting
DHO Intro to CMS for DH Workshop
Metadata Provenance Tutorial Part 2: Interoperable Metadata Provenance
Authority Control: Wikipedia + Wikidata
Your Digital Preservation Cookbook
"Web Archive services framework for tighter integration between the past and ...
Reminiscing about interoperability
Access to Content via Link Resolvers
DSpace Update from Open Repositories 2014
The Dark Side of Digital Preservation: Distributed Digital Preservation
Curation and Digital Storytelling
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Integration of Web Protégé into DBpedia
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Cataloging Landscape Update: RDA and LC Working Group on the Future of Biblio...
Best Practices for Descriptive Metadata
Web Archive Profiling Through Fulltext Search
Achieving Link Integrity for Managed Collections
Introduction to Linked Data Platform (LDP)
Cenitpede: Analyzing Webcrawl
Ad

Similar to JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Public Web Archives (20)

PPTX
Information sharing about Columbia University Library’s recent web archiving ...
PPTX
Building an Accessible Digital Institution
PPTX
Aggregating Private and Public Web Archives Using the Mementity Framework
PPTX
Archiving Web-Based #musetech for Institutional Memory
PDF
Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial
PDF
IR-AUDIT
PPTX
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
PPTX
Capture All the URLs: First Steps in Web Archiving
PDF
How you and your gateway can benefit from the services of the Science Gateway...
PPTX
Web archiving challenges and opportunities
PPT
JISC PoWR poster
PPT
Unleashing library services with web 2.0 (ss)
PPTX
METRO Conference 2014: How collaboration can save [more of] the web: recent p...
PDF
Time -Travel on the Internet
PPT
Estermann Wikidata and Heritage Data 20170914
PPTX
Linked Energy Data Generation
PPTX
PPTX
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
PDF
Essentials of Open Source Documentation
PPTX
BS 8878: Systematic Approaches to Documenting Web Accessibility Policies and ...
Information sharing about Columbia University Library’s recent web archiving ...
Building an Accessible Digital Institution
Aggregating Private and Public Web Archives Using the Mementity Framework
Archiving Web-Based #musetech for Institutional Memory
Exploiting Wikipedia for Information Retrieval Tasks, SIGIR Tutorial
IR-AUDIT
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Capture All the URLs: First Steps in Web Archiving
How you and your gateway can benefit from the services of the Science Gateway...
Web archiving challenges and opportunities
JISC PoWR poster
Unleashing library services with web 2.0 (ss)
METRO Conference 2014: How collaboration can save [more of] the web: recent p...
Time -Travel on the Internet
Estermann Wikidata and Heritage Data 20170914
Linked Energy Data Generation
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Essentials of Open Source Documentation
BS 8878: Systematic Approaches to Documenting Web Accessibility Policies and ...
Ad

More from Mat Kelly (17)

PPTX
Client-Assisted Memento Aggregation Using the Prefer Header
PDF
A Framework for Aggregating Public and Private Web Archives
PDF
Impact of URI Canonicalization on Memento Count
PPTX
Exploring Aggregation of Personal, Private, and Institutional Web Archives
PPTX
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
PDF
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
PDF
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
PPTX
Browser-Based Digital Preservation
PPTX
Archive What I See Now - Archive-It Partner Meeting 2013 2013
PDF
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
PPTX
Digital Preservation 2013
PDF
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
PPTX
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
PDF
The Revolution Will Not Be Archived
PPTX
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
PPTX
NDIIPP/NDSA 2011 - YouTube Link Restoration
PPTX
NDIIPP/NDSA 2011 - Archive Facebook
Client-Assisted Memento Aggregation Using the Prefer Header
A Framework for Aggregating Public and Private Web Archives
Impact of URI Canonicalization on Memento Count
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Browser-Based Digital Preservation
Archive What I See Now - Archive-It Partner Meeting 2013 2013
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
Digital Preservation 2013
Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving
An Extensible Framework for Creating Personal Web Archives of Content Behind ...
The Revolution Will Not Be Archived
WARCreate - Create Wayback-Consumable WARC Files from Any Webpage
NDIIPP/NDSA 2011 - YouTube Link Restoration
NDIIPP/NDSA 2011 - Archive Facebook

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology

JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Public Web Archives

  • 1. A Framework for Aggregating Private and Public Web Archives Mat Kelly Old Dominion University, Norfolk, VA Advisor: Michele C. Weigle JCDL 2015 Doctoral Consortium June 21, 2015
  • 2. The Problem JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 2 private archive private archive other private archive other private archive
  • 3. All Archives Cannot Be Aggregated JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 3 private archive private archive other private archive TimeMap other private archive
  • 4. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 4
  • 5. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 5
  • 6. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 6
  • 7. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 7
  • 8. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 8
  • 9. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 9
  • 10. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 10
  • 11. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 11 t = k t = k-1≠
  • 12. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 12
  • 13. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 13
  • 14. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 14 1 year ago 2 year ago 10 year ago … 180 year ago TimeMap
  • 15. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 15 private archive
  • 16. Proactive Preservation • Just-in-time WARC creation • Personal and potentially private web archiving • Mitigates deferral problem JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 16
  • 17. Public vs. Private Web Archiving • Public Web Archiving – Relies on deferred capture – Uses WARC, Memento, etc. – Integrates with other public web archives • Private Web Archiving – Same tools, less overhead, less bureaucracy – Uses WARC, Memento, etc. – Does not integrate A Framework for Aggregating Private and Public Web Archives 17 JCDL 2015 Doctoral Consortium
  • 18. Typical Web Archive Access 1. Web User Interface 2. Memento TimeGate TimeMap – Accept-Datetime (content negotiation) A Framework for Aggregating Private and Public Web Archives 18 URI-G TimeMap JCDL 2015 Doctoral Consortium
  • 19. Aggregating Multiple Web Archives • Memento Aggregator – Temporally Sorted TimeMap combined from multiple archives – Allows temporal gaps in one archive to be filled in by another TimeMap
  • 20. Archive Supplementation • More capturesgreater temporal coverage • Content on Deep Web • A large chunk of the Web is not preserved – Tools’ inability – Inconsistency over time due to personalization A Framework for Aggregating Private and Public Web Archives 20 JCDL 2015 Doctoral Consortium
  • 21. Concerns in Aggregating Private Web Archives • Privacy • Inconsistency of page representation – URI is insufficient key for access A Framework for Aggregating Private and Public Web Archives 21 JCDL 2015 Doctoral Consortium • Archival integrity – Has private archives content been manipulated?
  • 22. Why Individuals Might Want Personalized Aggregations • Show my private web archive captures • Concerned about exposing sensitive info to public – But still want to view temporally inline • Private/Restricted Archives are becoming ever more common A Framework for Aggregating Private and Public Web Archives 22 JCDL 2015 Doctoral Consortium
  • 23. Temporal Supplementation A Framework for Aggregating Private and Public Web Archives 23 JCDL 2015 Doctoral Consortium
  • 24. My Archives Have What They May Have Missed A Framework for Aggregating Private and Public Web Archives 24 JCDL 2015 Doctoral Consortium
  • 25. The Concerns Distilled • Access Control – And indicators for PWA • Preservation of Private Content • Interoperability without privacy compromise A Framework for Aggregating Private and Public Web Archives 25 JCDL 2015 Doctoral Consortium
  • 26. Web Archive Usage Pattern 1: Direct Access A Framework for Aggregating Private and Public Web Archives 26 OR TimeMap JCDL 2015 Doctoral Consortium
  • 27. Web Archive Usage Pattern 2: Web Archive Aggregation • Better results for a URI due to more sources for capture A Framework for Aggregating Private and Public Web Archives 27 TimeMap JCDL 2015 Doctoral Consortium
  • 28. Previous Patterns: Status Quo • Patterns 1 and 2 are status quo – provided by framework • Querying web archives currently only considers public web content – URI for lookup • Framework introduces 2 new entities – Memento Meta Aggregator (MMA) – Private Web Archive Adapter (PWAA) A Framework for Aggregating Private and Public Web Archives 28 JCDL 2015 Doctoral Consortium
  • 29. Memento Meta Aggregator (MMA) • Functional superset of (MA) • Can act as intermediary client to relay MA results to ultimate user • Allows just-in-time (JIT) inclusion of archives – as specified at query time • Set of archives aggregated can be dynamic – e.g., Results must not include IA A Framework for Aggregating Private and Public Web Archives 29 JCDL 2015 Doctoral Consortium
  • 30. MY CNN CAPTURES Aggregating My Captures A Framework for Aggregating Private and Public Web Archives 30 MY BANK CAPTURES JCDL 2015 Doctoral Consortium Various public web archives My web archives
  • 31. MY CNN CAPTURES The Current Memento Aggregator A Framework for Aggregating Private and Public Web Archives 31 MY BANK CAPTURES JCDL 2015 Doctoral Consortium 100 30 10
  • 32. MY CNN CAPTURES Accessing the Aggregator A Framework for Aggregating Private and Public Web Archives 32 MY BANK CAPTURES JCDL 2015 Doctoral Consortium 100 30 10
  • 33. MY CNN CAPTURES Accessing the Aggregator …does not include our archives A Framework for Aggregating Private and Public Web Archives 33 MY BANK CAPTURES NOT AGGREGATED NOT AGGREGATED JCDL 2015 Doctoral Consortium 100 30 10 140
  • 34. Access via the Meta Aggregator MY CNN CAPTURES Pattern 3: Aggregator Relay MY BANK CAPTURES 100 30 10 140140
  • 35. MY CNN CAPTURES Web Archive Usage Pattern 4: Including Additional Archives in Aggregation MY BANK CAPTURES Access via the Meta Aggregator …allows our archives to be included 100 30 10 15 140155
  • 36. MY CNN CAPTURES MMAs Allow Our Public Captures to be Shared A Framework for Aggregating Private and Public Web Archives 36 MY BANK CAPTURES JCDL 2015 Doctoral Consortium 100 30 10 15 140155 155 155
  • 37. MY CNN CAPTURES Web Archive Usage Pattern 5: Recursive MMA Access A Framework for Aggregating Private and Public Web Archives 37 MY BANK CAPTURES … Bob’s public CAPTURES The organization’s public CAPTURES 1 The organization’s public CAPTURES 2 contains A B C D Contains B C D Contains C D A B C D JCDL 2015 Doctoral Consortium 10 5 15 15 20 35 35 15 50 50
  • 38. New Framework Entity 1: Memento Meta Aggregator • Allow dynamic and JIT set of archives • Superset can be recursively constructed • Sets can be shared My public captures can be integrated with public web archives’ JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 38
  • 39. Private Web Archive Adapter (PWAA) • Regulates access to Private Web Archives (PWAs) • Acts as token authorizer • With credentials OK, relays results as if querying the PWA directly A Framework for Aggregating Private and Public Web Archives 39 JCDL 2015 Doctoral Consortium
  • 40. MY CNN CAPTURES User Establishes Access with PWA A Framework for Aggregating Private and Public Web Archives 40 MY BANK CAPTURES GET TOKEN for PWA Key: abcd1234 JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 41. MY CNN CAPTURES MMA Relays Request A Framework for Aggregating Private and Public Web Archives 41 MY BANK CAPTURES GET TOKEN for PWA Key: abcd1234 JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 42. MY CNN CAPTURES PWAA Accepts Request Generates Reusable Token A Framework for Aggregating Private and Public Web Archives 42 MY BANK CAPTURES ACCESS OK Token: 4f33c64 JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 43. MY CNN CAPTURES User Submits Request for URI-R with Token A Framework for Aggregating Private and Public Web Archives 43 MY BANK CAPTURES GET mementos for URI Token: 4f33c64 JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 44. MY CNN CAPTURES MMA Relays Request (again) A Framework for Aggregating Private and Public Web Archives 44 MY BANK CAPTURES GET mementos for URI Token: 4f33c64 JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 45. MY CNN CAPTURES PWAA Verified & Relays Request MA Gets Mementos, per usual A Framework for Aggregating Private and Public Web Archives 45 MY BANK CAPTURES Token: 4f33c64 OK GET mementos for URI GET mementos for URI JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 46. MY CNN CAPTURES Archives Return Mementos A Framework for Aggregating Private and Public Web Archives 46 MY BANK CAPTURES Token: 4f33c64 OK Returning mementos Return mementos For URI JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures
  • 47. MY CNN CAPTURES PWAA Relays TimeMap A Framework for Aggregating Private and Public Web Archives 47 MY BANK CAPTURES TimeMap TimeMap TimeMap JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures 140 10,000 10,143 140 captures
  • 48. MY CNN CAPTURES MMA Annotates and Aggregates A Framework for Aggregating Private and Public Web Archives 48 MY BANK CAPTURES TimeMap TimeMap TimeMap JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures 10,143 140 captures 3 captures 10,000 captures
  • 49. MY CNN CAPTURES Web Archive Usage Pattern 6: Aggregating Public & Private Archives A Framework for Aggregating Private and Public Web Archives 49 MY BANK CAPTURES TimeMap JCDL 2015 Doctoral Consortium 100 30 10 3 captures 10,000 captures 10,143 captures
  • 50. MY CNN CAPTURES Regulated Access Can Be Shared A Framework for Aggregating Private and Public Web Archives 50 MY BANK CAPTURES GET mementos for URI Token: 4f33c64 GET mementos for URI Token: c5463b4 GET TOKEN for PWA Key: 2265eef3 No/invalid token returned Access denied or 0 mementos JCDL 2015 Doctoral Consortium 3 captures 10,000 captures
  • 51. Aggregating Multiple PWAs JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 51 MY BANK CAPTURES Linda’s Private Captures Bob’s Private Captures GET TOKENs for PWAs Key: abcd1234, Archive: My Key: cab45cbf, Archive: Linda Key: b0b01b, Archive: Bob 3 captures 5 captures 10 captures 5 3 10
  • 52. Aggregating Multiple PWAs JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 52 MY BANK CAPTURES Access OK Token: 7790ca Access OK Token: b0b01b ACCESS DENIED Linda’s Private Captures Bob’s Private Captures 3 captures 5 captures 10 captures 5 3 10
  • 53. PWAs Can Then be Aggregated JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 53 MY BANK CAPTURES GET mementos for URI Token: 7790ca, Archive: My Token: null, Archive: Linda Token: b0b01b, Archive: Bob Linda’s Private Captures Bob’s Private Captures 3 captures 5 captures 10 captures 5 3 10 3 10 ø13
  • 54. Sample TimeMap ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 54 TimeMap
  • 55. Access Token Included in TimeMap ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="memen to"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 55 MY PRIVATE FACEBOOK CAPTURES
  • 56. My Public Web Archive, Now Aggregated ... , <http://web.archive.org/web/20150228155703/https://facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 15:57:03 GMT" , <http://web.archive.org/web/20150228163939/http://www.facebook.com/>;rel="memento"; datetime="Sat, 28 Feb 2015 16:39:39 GMT" , <http://web.archive.org/web/20150303162841/https://www.facebook.com/>;rel="memento"; datetime="Tue, 03 Mar 2015 16:28:41 GMT" , <http://users2machine.local/web/20150305000101/https://www.facebook.com/>;rel="memento"; datetime="Thu, 05 Mar 2015 00:01:00 GMT"; key="e395935019ee467c797034ee410cc91e" , <//wayback.archive-it.org/all/20150305215922/https://facebook.com/>;rel="memento"; datetime="Tue, 05 Mar 2015 21:59:22 GMT" , <http://previouslyUnaggregated.org/web/20150306123457/https://www.facebook.com/>;rel="meme nto"; datetime="Wed, 06 Mar 2015 12:34:57 GMT" , <http://web.archive.org/web/20150310140721/https://www.facebook.com/>;rel="memento"; datetime="Tue, 10 Mar 2015 14:07:21 GMT" ... JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 56 MY PUBLIC FACEBOOK CAPTURES
  • 57. Evaluation Plan • How effective is the Framework? • Scalability ramifications of additional infrastructure? • Is public-private tokenization most suitable method for persistent access? • How can a single archive be sub-divided between private/public and access controlled? JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 57
  • 58. Previous Work Preservation and Replay PDA 2013 - Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving JCDL 2012 - WARCreate - Create Wayback-Consumable WARC Files from Any Webpage Evaluating Capture IJDL 2015 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources IJDL 2015 - The Impact of JavaScript on Archivability JCDL 2014 - Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources JCDL 2014 - The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript Dlib 2013 - A Method for Identifying Personalized Representations in the Archives TPDL 2013 - On the Change in Archivability of Websites Over Time Archival Integration JCDL 2015 - Mobile Mink: Merging Mobile and Desktop Archived Webs JCDL 2014 - Mink: Integrating the Live and Archived Web Viewing Experience Using Web Browsers and Memento 58 WARCreate – preserve from the browser WAIL – private web archiving all-in-one suite Mink – Integrate the live and archived web SOFTWARE PRODUCTS PUBLICATIONS
  • 59. Current Work • Other approaches of archival lookup beyond URI • Appropriate metadata to indicate private web content in WARC files • Existing integration attempts by private web archives & individuals A Framework for Aggregating Private and Public Web Archives 59 JCDL 2015 Doctoral Consortium
  • 60.  Background Research  PhD Requirements (Coursework, Qualifying Exam, etc.)  Build preliminary framework model  JCDL Doctoral Consortium EXTENDED RESEARCH • Research prevalence of private web archives • Research access control methods in web archiving and other domains • Investigate other access patterns and expound on those defined • PhD Candidacy Exam describing merit of research plan • Implement feedback received from candidacy exam committee • Programmatically implement MMA and PWAA CASE STUDIES (real-world application) • Publicly Available Non-Aggregated Archives (e.g., Rhizome) • Deep web preservation/access (bank account/Facebook feeds) • DISSERTATION DEFENSE Dissertation Plan
  • 61. Preliminary Publication Plan JCDL 2016 Evaluation of User Access Patterns for Private Web Archives TPDL 2016 Methods in adding JIT Inclusion of Private Web Archives in Memento ACM SACMAT* Research exploring tokenization and similar methods for archival access establishment iPres 2016 Research investigating URI clash & other needed identifiers for distinguishing archived content from the “deep web” with archived content from the public live web. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 61 * Symposium on Access Control Models and Technologies
  • 62. Future Research Questions • Can a PWAA perform content negotiation[1] on the private-public spectrum? • What level of security is needed? – e.g., reporting UNAUTHORIZED vs. 0 mementos A Framework for Aggregating Private and Public Web Archives 62 JCDL 2015 Doctoral Consortium [1] RFC2295 https://www.ietf.org/rfc/rfc2295.txt
  • 63. Summation • Why? – No means exists to integrate private and public web archives. • How to Evaluate? – Does this framework fit real world needs? Scalable? • When will I know I am done? – Any public/private web archive* can be integrated. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 63 * -compliant
  • 64. References • D. Abrams, R. Baecker, and M. Chignell. Information Archiving with Bookmarks: Personal Web Space Construction and Archiving. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 41–48, 1998. • A. AlSum, M. Weigle, M. Nelson, and H. Van de Sompel. Profiling Web Archive Coverage for Top-Level Domain and Content Language. International Journal on Digital Libraries, 14(3-4):149–166, 2014. • J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of JCDL, pages 321–330, London, England, 2014. • J. F. Brunelle, M. Kelly, M. C. Weigle, and M. L. Nelson. The Impact of JavaScript on Archivability. International Journal on Digital Libraries, pages 1–23, 2015. • J. F. Brunelle and M. L. Nelson. An Evaluation of Caching Policies for Memento TimeMaps. In Proceedings of JCDL, pages 267–276, 2013. • D. Gomes, S. Freitas, and M. J. Silva. Design and Selection Criteria for a National Web Archive. In Research and Advanced Technology for Digital Libraries, pages 196–207. Springer, 2006. • D. Hardt. The OAuth 2.0 Authorization Framework. IETF RFC 6749, October 2012. • M. Jones and D. Hardt. The OAuth 2.0 Authorization Framework: Bearer Token Usage. IETF RFC 6750, October 2012. • M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. A Method for Identifying Personalized Representations in the Archives. D-Lib Magazine, 19(11/12), Nov/Dec 2013. • M. Kelly, J. F. Brunelle, M. C. Weigle, and M. L. Nelson. On the Change in Archivability of Websites Over Time. In Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL), pages 35–47, Valletta, Malta, 2013. • M. Kelly, M. L. Nelson, and M. C. Weigle. Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving Using XAMPP. Poster and demo presented at Personal Digital Archiving, February 2013. • M. Kelly, M. L. Nelson, and M. C. Weigle. The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript. In Proceedings of JCDL, pages 25–28, London, England, September 2014. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 64
  • 65. References • M. Kelly and M. C. Weigle. WARCreate - Create Wayback-Consumable WARC Files from Any Webpage. In Proceedings of JCDL, pages 437–438, Washington, DC, June 2012. • C. C. Marshall. Rethinking Personal Digital Archiving, Part 1. D-Lib Magazine, 14(3/4), Mar/Apr 2008. • C. C. Marshall. Rethinking Personal Digital Archiving, Part 2. D-Lib Magazine, 14(3/4), Mar/Apr 2008. • J. Niu. Functionalities of Web Archives. D-Lib Magazine, 18(3/4), Mar/Apr 2012. • M. Phillips. PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports It. http://pandora.nla.gov.au/pandas.html, 2003. • H. C.-H. Rao, Y.-F. Chen, and M.-F. Chen. A Proxy-based Personal Web Archiving Service. SIGOPS Oper. Syst. Rev., 35(1):61–72, Jan. 2001. • A. Rauber, M. Kaiser, and B. Wachter. Ethical Issues in Web Archive Creation and Usage-Towards a Research Agenda. In 8th International Web Archiving Workshop (IWAW08), 2008. • D. Rosenthal. Re-thinking Memento Aggregation. http://blog.dshr.org/2013/03/re-thinking-memento-aggregation.html, 2013. • T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk Failure Investigations at the Internet Archive. In Work-in-Progess session, NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), 2006. • S. Strodl, F. Motlik, K. Stadler, and A. Rauber. Personal & Soho Archiving. In Proceedings of JCDL, pages 115–123, 2008. • M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162–176, 2004. • B. Tofel. ‘Wayback’ for Accessing Web Archives. In 7th International Web Archiving Workshop (IWAW07), 2007. • H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP Framework for Time-Based Access to Resource States – Memento. IETF RFC 7089, December 2013. • T. Wang, M. Srivatsa, and L. Liu. Fine-Grained Access Control of Personal Data. In Proceedings of the 17th ACM Symposium on Access Control Models and Technologies, pages 145–156, 2012. JCDL 2015 Doctoral Consortium A Framework for Aggregating Private and Public Web Archives 65
  • 66. A Framework for Aggregating Private and Public Web Archives Mat Kelly Old Dominion University, Norfolk, VA Advisor: Michele C. Weigle JCDL 2015 Doctoral Consortium June 21, 2015