SlideShare a Scribd company logo
Warcbase
Building a Scalable Platform
on HBase and Hadoop
Part Two: Historian Use Case
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada
Why should a
historian
care?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Unlocking an Archive-It
Collection
• Archive-It has amazing collections of social,
cultural, political, and economic records generated
by everyday people, leaders, businesses,
academics, and beyond.
• Stories waiting to be hold.
• The data is there, but the problem is access.
Example Dataset
• Archive-It Collection 227,
Canadian Political Parties and
Political Interest Groups
(University of Toronto)
• October 2005 - Present
• All major and minor political
parties, as well as organized
political interest groups (Council
of Canadians, Coalition to
Oppose the Arms Trade
Assembly of First Nations, etc.)
• Started by now-retired librarian,
hard to get details on seed list
Two Main Approaches
• Warcbase
• Link extraction and analytics
• Full-text extraction and analytics
• Full-text faceted search
• UK Web Archive’s Shine solr front end
Using Warcbase to
analyze links and full-text
Basic Link Statistics
• Count number of pages per domain
• Count number of links for each crawl so they can
be normalized (very important)
• Run on command line using relatively simple pig
scripts
Example Script (counting
number of links for each crawl)
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractLinks	
  
org.warcbase.pig.piggybank.ExtractLinks();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/
arc/'	
  using	
  ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  
content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  url,	
  
FLATTEN(ExtractLinks((chararray)	
  content,	
  url));	
  
c	
  =	
  group	
  b	
  by	
  $0;	
  
d	
  =	
  foreach	
  c	
  generate	
  group,	
  COUNT(b);
Social Media Appearances -
Twitter
(20080611220246,http://creativecommons.org/,twitter)	
  
(20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)	
  
(20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)	
  
(20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221618,http://www.ndp.ca/home,twitter)	
  
(20080930221638,http://www.liberal.ca/default_e.aspx,twitter)	
  
(20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)	
  
(20080930221714,http://www.liberal.ca/video_e.aspx,twitter)	
  
(20080930221903,http://www.ndp.ca/page/5246,twitter)	
  
(20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php?
lang=en,twitter)	
  
(20080930222049,http://greenparty.ca/en/action,twitter)	
  
(20080930222124,http://www.ndp.ca/bloggingtools,twitter)	
  
(20080930222825,http://greenparty.ca/en/campaign/35053,twitter)	
  
(20080930223014,http://greenparty.ca/en/campaign/35068,twitter)	
  
(20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)	
  
(20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)	
  
(20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)	
  
(20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)	
  
(20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
Social Media Appearances -
Facebook
(20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)	
  
(20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)	
  
(20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)	
  
(20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)	
  
(20070418141139,http://greenparty.ca/en/blog/431,facebook)	
  
(20070418141930,http://greenparty.ca/en/blog?page=2,facebook)	
  
(20070418143749,http://greenparty.ca/en/node/1280,facebook)	
  
(20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)	
  
(20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)	
  
(20070418151727,http://www.equalvoice.ca/youth/,facebook)	
  
(20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)	
  
(20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)	
  
(20070418153832,http://greenparty.ca/fr/node/1280,facebook)	
  
(20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu?
page=2,facebook)	
  
(20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book?
page=2,facebook)	
  
(20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)	
  
(20070518134941,http://www.ndp.ca/page/4733,facebook)
Link Analysis
• Extracting links by domain (tab-separated values):
200810	
  conservative.ca	
   digg.com	
   2325	
  
200810	
  conservative.ca	
   facebook.com	
   2325	
  
200810	
  conservative.ca	
   mycampaign.conservative.ca	
   7902	
  
[..]	
  
200902	
  liberal.ca	
  ctv.ca	
  16	
  
200902	
  liberal.ca	
  del.icio.us	
   1118	
  
200902	
  liberal.ca	
  digg.com	
   1118	
  
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
Other Cases
• Extracting all links to the mainstream media, or
thinktanks, or other political parties
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case
2005 Canadian Federal Election
Text Analysis
register	
  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';	
  
DEFINE	
  ArcLoader	
  org.warcbase.pig.ArcLoader();	
  
DEFINE	
  ExtractRawText	
  org.warcbase.pig.piggybank.ExtractRawText();	
  
DEFINE	
  ExtractTopLevelDomain	
  
org.warcbase.pig.piggybank.ExtractTopLevelDomain();	
  
raw	
  =	
  load	
  '/shared/collections/CanadianPoliticalParties/arc/'	
  using	
  
ArcLoader	
  as	
  
	
  	
  (url:	
  chararray,	
  date:	
  chararray,	
  mime:	
  chararray,	
  content:	
  bytearray);	
  
a	
  =	
  filter	
  raw	
  by	
  mime	
  ==	
  'text/html'	
  and	
  date	
  is	
  not	
  null;	
  
b	
  =	
  foreach	
  a	
  generate	
  SUBSTRING(date,	
  0,	
  6)	
  as	
  date,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  REPLACE(ExtractTopLevelDomain(url),	
  '^s*www.',	
  
'')	
  as	
  url,	
  content;	
  
c	
  =	
  filter	
  b	
  by	
  url	
  ==	
  'greenparty.ca';	
  
d	
  =	
  foreach	
  c	
  generate	
  date,	
  url,	
  ExtractRawText((chararray)	
  content)	
  as	
  
text;	
  
store	
  d	
  into	
  'cpp.text-­‐greenparty';
Text Analysis
• Now have circumscribed corpus for specified
query (i.e. liberal.ca, or ndp.ca, or conservative.ca)
• Can now use standard text analysis tools, etc. to
extract meaning
• LDA (topic modeling)
• NER (named entity recognition)
NER
October	
  2005	
  
	
  	
  62476	
  Stephen	
  Harper	
  
	
  	
  30234	
  Michael	
  Chong	
  
	
  	
  30109	
  Gwynne	
  Dyer	
  
	
  	
  28011	
  ami	
  Entrez	
  
	
  	
  26238	
  Paul	
  Martin	
  
	
  	
  22303	
  Harper	
  
NER
November	
  2008	
  
	
  	
  	
  3188	
  Stéphane	
  Dion	
  
	
  	
  	
  2557	
  Stephen	
  Harper	
  
	
  	
  	
  2471	
  Stephen	
  HarperLaureen	
  
	
  	
  	
  2410	
  Dion	
  
	
  	
  	
  2356	
  Harper	
  
Visualizing Interface
Next Step?
Shine
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
Examples
Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks
Building connections
between Warcbase and
Shine
Conclusions &
Thanks
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada

More Related Content

PPT
Semantic web and Drupal: an introduction
PPTX
Apache Hadoop at 10
PPT
A hint of_mint
PPTX
Faster Faster Faster! Datamarts with Hive at Yahoo
PPTX
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Real-World NoSQL Schema Design
PPTX
Big data and Hadoop
Semantic web and Drupal: an introduction
Apache Hadoop at 10
A hint of_mint
Faster Faster Faster! Datamarts with Hive at Yahoo
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Real-World NoSQL Schema Design
Big data and Hadoop

What's hot (20)

PDF
Querying the Web of Data with XSPARQL 1.1
PDF
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
PDF
(PROJEKTURA) Big Data Open Data story for TGG
PDF
Flagis linked open_data_stijn_goedertier
PDF
Querying Linked Data with SPARQL
PPTX
An intriduction to hive
PPTX
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
PDF
Maintaining scholarly standards in the digital age: Publishing historical gaz...
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PDF
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
PDF
Querying 1.8 billion reddit comments with python
PDF
RDF Stream Processing Models (SR4LD2013)
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PDF
Kyiv.py #16 october 2015
PDF
IPython Notebook as a Unified Data Science Interface for Hadoop
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
PPTX
R Hadoop integration
PPTX
Hadoop overview
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
Apache Spark Overview
Querying the Web of Data with XSPARQL 1.1
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
(PROJEKTURA) Big Data Open Data story for TGG
Flagis linked open_data_stijn_goedertier
Querying Linked Data with SPARQL
An intriduction to hive
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
Querying 1.8 billion reddit comments with python
RDF Stream Processing Models (SR4LD2013)
Big data Hadoop Analytic and Data warehouse comparison guide
Kyiv.py #16 october 2015
IPython Notebook as a Unified Data Science Interface for Hadoop
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
R Hadoop integration
Hadoop overview
Introduction to Big Data & Hadoop Architecture - Module 1
Apache Spark Overview
Ad

Similar to Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case (20)

PPTX
Case study of Rujhaan.com (A social news app )
ODP
Open Data and CKAN Data Catalogues
PDF
Big Data Solutions in Azure - David Giard
PPTX
Client-Assisted Memento Aggregation Using the Prefer Header
PDF
Schema.org: What It Means For You and Your Library
PDF
Schema.org - An Extending Influence
KEY
YQL: Select * from Internet
PPTX
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
ZIP
SemWeb Fundamentals - Info Linking & Layering in Practice
PPTX
Big Data on azure
PPTX
Bingham, De Wild & Aasman Presentation
PDF
Metadata - Linked Data
KEY
YQL:: Select * from Internet
PPTX
Why do they call it Linked Data when they want to say...?
PDF
Building Hadoop Data Applications with Kite
PPT
Exploring the Semantic Web
PDF
Linked Data - Exposing what we have
ODP
Open Data and CKAN Data Catalogues
PPTX
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
PPTX
Big Data Analysis : Deciphering the haystack
Case study of Rujhaan.com (A social news app )
Open Data and CKAN Data Catalogues
Big Data Solutions in Azure - David Giard
Client-Assisted Memento Aggregation Using the Prefer Header
Schema.org: What It Means For You and Your Library
Schema.org - An Extending Influence
YQL: Select * from Internet
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
SemWeb Fundamentals - Info Linking & Layering in Practice
Big Data on azure
Bingham, De Wild & Aasman Presentation
Metadata - Linked Data
YQL:: Select * from Internet
Why do they call it Linked Data when they want to say...?
Building Hadoop Data Applications with Kite
Exploring the Semantic Web
Linked Data - Exposing what we have
Open Data and CKAN Data Catalogues
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Big Data Analysis : Deciphering the haystack
Ad

More from Ian Milligan (9)

PDF
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
PDF
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
PDF
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
PDF
Congress text-mining-event
PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PDF
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
PDF
Ruest and Milligan - The Great WARC Adventure
PDF
International Internet Preservation Consortium Research Slides from Ian Milligan
PDF
Historical Research Breakout Session Notes, WIRE 2014
Welcome to the GeoHood: Using the GeoCities Web Archive to Explore Virtual Co...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Congress text-mining-event
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Ruest and Milligan - The Great WARC Adventure
International Internet Preservation Consortium Research Slides from Ian Milligan
Historical Research Breakout Session Notes, WIRE 2014

Recently uploaded (20)

PDF
Uptota Investor Deck - Where Africa Meets Blockchain
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
newyork.pptxirantrafgshenepalchinachinane
PDF
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
PDF
Slides PDF: The World Game (s) Eco Economic Epochs.pdf
PDF
The Evolution of Traditional to New Media .pdf
PPTX
Introduction to cybersecurity and digital nettiquette
PDF
si manuel quezon at mga nagawa sa bansang pilipinas
PDF
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Internet Safety for Seniors presentation
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
PDF
simpleintnettestmetiaerl for the simple testint
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
Exploring VPS Hosting Trends for SMBs in 2025
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPTX
artificial intelligence overview of it and more
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
Uptota Investor Deck - Where Africa Meets Blockchain
SASE Traffic Flow - ZTNA Connector-1.pdf
newyork.pptxirantrafgshenepalchinachinane
SlidesGDGoCxRAIS about Google Dialogflow and NotebookLM.pdf
Slides PDF: The World Game (s) Eco Economic Epochs.pdf
The Evolution of Traditional to New Media .pdf
Introduction to cybersecurity and digital nettiquette
si manuel quezon at mga nagawa sa bansang pilipinas
📍 LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1 TERPOPULER DI INDONESIA ! 🌟
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Internet Safety for Seniors presentation
Mathew Digital SEO Checklist Guidlines 2025
artificialintelligenceai1-copy-210604123353.pptx
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
simpleintnettestmetiaerl for the simple testint
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Exploring VPS Hosting Trends for SMBs in 2025
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
artificial intelligence overview of it and more
Power Point - Lesson 3_2.pptx grad school presentation

Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Historian Use Case

  • 1. Warcbase Building a Scalable Platform on HBase and Hadoop Part Two: Historian Use Case Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada
  • 2. Why should a historian care? The sheer amount of social, cultural, and political information generated every day presents new opportunities for historians.
  • 3. Could one even study the 1990s and beyond without web archives?
  • 4. No. Historians need to do this now, or we’re going to be left behind.
  • 5. Nightmare Scenario • Wayback Machine won’t be enough. We won’t use that. • Historians rely uncritically on date-ordered keyword search results, putting them at mercy of search algorithms they do not understand; • Historians are completely left out of post-1996 research, letting everybody else do the work (a la Culturomics project/Nature magazine article); • Our profession gets left behind…
  • 7. Unlocking an Archive-It Collection • Archive-It has amazing collections of social, cultural, political, and economic records generated by everyday people, leaders, businesses, academics, and beyond. • Stories waiting to be hold. • The data is there, but the problem is access.
  • 8. Example Dataset • Archive-It Collection 227, Canadian Political Parties and Political Interest Groups (University of Toronto) • October 2005 - Present • All major and minor political parties, as well as organized political interest groups (Council of Canadians, Coalition to Oppose the Arms Trade Assembly of First Nations, etc.) • Started by now-retired librarian, hard to get details on seed list
  • 9. Two Main Approaches • Warcbase • Link extraction and analytics • Full-text extraction and analytics • Full-text faceted search • UK Web Archive’s Shine solr front end
  • 10. Using Warcbase to analyze links and full-text
  • 11. Basic Link Statistics • Count number of pages per domain • Count number of links for each crawl so they can be normalized (very important) • Run on command line using relatively simple pig scripts
  • 12. Example Script (counting number of links for each crawl) register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractLinks   org.warcbase.pig.piggybank.ExtractLinks();   raw  =  load  '/shared/collections/CanadianPoliticalParties/ arc/'  using  ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,   content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,  url,   FLATTEN(ExtractLinks((chararray)  content,  url));   c  =  group  b  by  $0;   d  =  foreach  c  generate  group,  COUNT(b);
  • 13. Social Media Appearances - Twitter (20080611220246,http://creativecommons.org/,twitter)   (20080711224545,http://www.pm.gc.ca/eng/feature.asp?pageId=105,twitter)   (20080712030632,http://www.pm.gc.ca/fra/feature.asp?pageId=105,twitter)   (20080712142357,http://www.pm.gc.ca/eng/media.asp?category=2&;id=1814,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221618,http://www.ndp.ca/home,twitter)   (20080930221638,http://www.liberal.ca/default_e.aspx,twitter)   (20080930221641,http://www.liberal.ca/story_15081_e.aspx,twitter)   (20080930221714,http://www.liberal.ca/video_e.aspx,twitter)   (20080930221903,http://www.ndp.ca/page/5246,twitter)   (20080930221904,http://www.ndp.ca/twitterblogwidget/ndp-­‐twitter.php? lang=en,twitter)   (20080930222049,http://greenparty.ca/en/action,twitter)   (20080930222124,http://www.ndp.ca/bloggingtools,twitter)   (20080930222825,http://greenparty.ca/en/campaign/35053,twitter)   (20080930223014,http://greenparty.ca/en/campaign/35068,twitter)   (20080930223240,http://www.liberal.ca/depth_e.aspx,twitter)   (20080930223258,http://www.liberal.ca/enews_e.aspx,twitter)   (20080930223315,http://www.liberal.ca/glance_e.aspx,twitter)   (20080930223320,http://www.liberal.ca/story_15073_e.aspx,twitter)   (20080930223323,http://www.liberal.ca/gallery_e.aspx,twitter)
  • 14. Social Media Appearances - Facebook (20070418135140,http://www.liberal.ca/glance_e.aspx,facebook)   (20070418135947,http://greenparty.ca/en/blog/activemenu/menu?page=2,facebook)   (20070418140056,http://greenparty.ca/en/blog/activemenu/book?page=2,facebook)   (20070418140511,http://greenparty.ca/en/blog/popular?page=3,facebook)   (20070418140516,http://www.liberal.ca/glance_f.aspx,facebook)   (20070418141139,http://greenparty.ca/en/blog/431,facebook)   (20070418141930,http://greenparty.ca/en/blog?page=2,facebook)   (20070418143749,http://greenparty.ca/en/node/1280,facebook)   (20070418143900,http://greenparty.ca/en/blog/activemenu/activemenu/book?page=2,facebook)   (20070418144002,http://greenparty.ca/en/blog/activemenu/activemenu/menu?page=2,facebook)   (20070418151727,http://www.equalvoice.ca/youth/,facebook)   (20070418151734,http://www.equalvoice.ca/youth/index.htm,facebook)   (20070418151843,http://www.equalvoice.ca/youth/Bios.htm,facebook)   (20070418153832,http://greenparty.ca/fr/node/1280,facebook)   (20070418154008,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/menu? page=2,facebook)   (20070418154112,http://greenparty.ca/en/blog/activemenu/activemenu/activemenu/book? page=2,facebook)   (20070518134656,http://www.liberal.ca/glance_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134918,http://www.liberal.ca/conversation_e.aspx,facebook)   (20070518134941,http://www.ndp.ca/page/4733,facebook)
  • 15. Link Analysis • Extracting links by domain (tab-separated values): 200810  conservative.ca   digg.com   2325   200810  conservative.ca   facebook.com   2325   200810  conservative.ca   mycampaign.conservative.ca   7902   [..]   200902  liberal.ca  ctv.ca  16   200902  liberal.ca  del.icio.us   1118   200902  liberal.ca  digg.com   1118  
  • 22. Other Cases • Extracting all links to the mainstream media, or thinktanks, or other political parties
  • 25. Text Analysis register  'target/warcbase-­‐0.1.0-­‐SNAPSHOT-­‐fatjar.jar';   DEFINE  ArcLoader  org.warcbase.pig.ArcLoader();   DEFINE  ExtractRawText  org.warcbase.pig.piggybank.ExtractRawText();   DEFINE  ExtractTopLevelDomain   org.warcbase.pig.piggybank.ExtractTopLevelDomain();   raw  =  load  '/shared/collections/CanadianPoliticalParties/arc/'  using   ArcLoader  as      (url:  chararray,  date:  chararray,  mime:  chararray,  content:  bytearray);   a  =  filter  raw  by  mime  ==  'text/html'  and  date  is  not  null;   b  =  foreach  a  generate  SUBSTRING(date,  0,  6)  as  date,                                                REPLACE(ExtractTopLevelDomain(url),  '^s*www.',   '')  as  url,  content;   c  =  filter  b  by  url  ==  'greenparty.ca';   d  =  foreach  c  generate  date,  url,  ExtractRawText((chararray)  content)  as   text;   store  d  into  'cpp.text-­‐greenparty';
  • 26. Text Analysis • Now have circumscribed corpus for specified query (i.e. liberal.ca, or ndp.ca, or conservative.ca) • Can now use standard text analysis tools, etc. to extract meaning • LDA (topic modeling) • NER (named entity recognition)
  • 27. NER October  2005      62476  Stephen  Harper      30234  Michael  Chong      30109  Gwynne  Dyer      28011  ami  Entrez      26238  Paul  Martin      22303  Harper  
  • 28. NER November  2008        3188  Stéphane  Dion        2557  Stephen  Harper        2471  Stephen  HarperLaureen        2410  Dion        2356  Harper  
  • 30. Shine • UK Web Archive’s Shine (https://github.com/ukwa/ shine) • Indexing as bottleneck • ~ 250GB of WARCs takes ~ 5 days on a single machine • Hadoop indexer available if data in HFDS • ~ 90GB index size
  • 32. Shine • Advantages: accessible to the general public, easy to use, interactive trend diagram allows digging down for context, can move down to level of document itself. • Disadvantage: keyword searching requires you know what to look for; random sampling misleading when tens of thousands of records; etc. • Doesn’t take advantage of what makes web sources so powerful: hyperlinks
  • 34. Conclusions & Thanks Jimmy Lin University of Maryland College Park, MD Ian Milligan University of Waterloo Waterloo, ON Canada