SlideShare a Scribd company logo
Ancient History of the UK Web
With support by and thanks to Ning Wang and Adham Tamer
Josh Cowls, Scott A. Hale, Helen Margetts,
Eric T. Meyer, Ralph Schroeder, Taha Yasseri
Past Web Archive Activities at OII
• 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of
Humanities (Jisc & NEH funded)
– OII, Internet Archive, Hanzo Archives
– Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final
Report to JISC. Online:
http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalrepor
t.pdf
• 2010. Researcher Engagement with Web Archives (Jisc funded)
– OII, VKS
– Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010).
Researcher Engagement with Web Archives: State of the Art. London: JISC. Online:
http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/
– Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010).
Researcher Engagement with Web Archives: Challenges and Opportunities for
Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie-
repository.jisc.ac.uk/543/
– Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving:
The state of the art in relation to social science and humanities research needs. Journal
of the American Society of Information Science & Technology.
http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract
• 2011. Using Web Archives: A Futures Perspective (IIPC funded)
– OII
– Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London:
IIPC. Online: http://ssrn.com/abstract=1830025
Recent Web Archive Activities at OII
• 2013-2015: Jisc Big Data project (Jisc funded)
– OII, British Library
– Prepare and release hyperlink corpus
• 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC
funded)
– IHR, OII, British Library
– Supporting researchers in Arts & Humanities to use web archive data
– Producing edited book of empirical studies concerning the history of
the UK web
• First paper from these combined projects
– Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H.
(2014, July). Mapping the UK webspace: Fifteen years of British
universities on the web. ACM WebSci’14, Bloomington, Indiana.
http://papers.ssrn.com/abstract=2435481 or
http://arxiv.org/abs/1405.2856
Big Data:
Demonstrating the Value of the UK Web Domain Dataset
for Social Science Research
This project aims to enhance JISC's UK Web
Domain archive, a 30 TB archive of the .uk
country-code top level domain collected from
1996 to 2010. It will extract link graphs from the
data and disseminate social science research
using the collection.
February 2012 - February 2014
Taming a mammoth:
Web Archive Dataset Preparation
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
30 TB compressed data in (w)arc format
– Approx. 4.5 million files
– Mix of binary and plain text payloads along
with header data
– Two formats: old arc and newer warc
Housed at the BL, access restrictions
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-
network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d
ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+
%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger
s&c7=10-Nov-
08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa
tablog&c2=GUID:(none)
WARC-Date: 2010-12-05T02:58:00Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 66.235.138.18
WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>
Content-Type: application/http; msgtype=response
Content-Length: 740
HTTP/1.1 302 Found
Date: Sun, 05 Dec 2010 02:58:00 GMT
Server: Omniture DC/2.0.0
X-C: ms-4.3.1
Expires: Sat, 04 Dec 2010 02:58:00 GMT
Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT
Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private
Pragma: no-cache
ETag: "4CFAFFB8-0E4C-7443902F"
Vary: *
P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"
Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-
network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08
%2Fprisoner-of-war-camps-uk
xserver: www422
Content-Length: 0
Keep-Alive: timeout=15
Connection: close
Content-Type: text/plain
Extract meta-data and links (wat format)
– Approx. 4.5 million files
– 6.2TB on disk compressed
– Housed at OII
– Structured JSON
– Different formats for arc/warcs
{
"Container": {
"Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000-
00000.arc.gz",
"Offset": "88937",
"Compressed": true,
"Gzip-Metadata": {
"Header-Length": "10",
"Inflated-CRC": "-1223265901",
"Inflated-Length": "26073",
"Deflate-Length": "4463",
"Footer-Length": "8"
}
},
"Envelope": {
"ARC-Header-Length": "102",
"ARC-Header-Metadata": {
"Date": "20080509081524",
"Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php",
"Content-Length": "25970",
"Content-Type": "text/html",
"IP-Address": "83.223.106.10"
},
"Payload-Metadata": {
"Actual-Content-Type": "application/http; msgtype=response",
"Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF",
"HTTP-Response-Metadata": {
"Headers-Length": "591",
"HTML-Metadata": {
"Head": {
"Title": "Exterior Corbels",
Plain text lists
Build own ad-hawk Hadoop cluster, fix
incompatibilities, divide into smaller batches
– Build plain text lists of pages and hyperlinks
– Remove error page (e.g., 404 Not Found)
– Remove pages not in .uk
– Standardize dates (many formats)
– Standardize hyperlinks (trailing /, etc.)
– Fix/remove tons of invalid hyperlinks (whitespace,
invalid characters, etc.)
Load results into Apache Hive (2.5 TB)
Source Destination Time
LinkText
http://octopus.well.ox.ac.uk:80/
http://octopus.well.ox.ac.uk:80/links.html
1032758438
Links
http://octopus.well.ox.ac.uk:80/
http://octopus.well.ox.ac.uk:80/projects.html
1001793436
Projects
http://octopus.well.ox.ac.uk:80/computing.sht
ml
http://debian.org/
1075794060
Debian/GNU
Overall Statistics
Third-level-
domains:
e.g.
ox.ac.uk
Relative size of second-level-domains
Number of links within SLD per node
Cross-domain links (2010)
Absolute Normalized to target size
Case of ac.uk
Mapping the UK Webspace:
Fifteen Years of British Universities on the Web
Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856
121 UK universities
websites and links
1) League table ranking
2) Group affiliation
3) Geographical location
Group Affiliations
League table ranking
Geography
Colour ~ intensity
Gravity Law σ𝑖𝑗 =
𝑠𝑖𝑗
𝑠𝑖
𝑜𝑢𝑡
𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =
𝑠𝑖
𝑜𝑢𝑡
𝑠𝑗
𝑖𝑛
𝑟0.28
Big UK Domain Data for the Arts and
Humanities
Primary aim: developing a methodological and
theoretical framework within which to study over 15
years of UK domain data – with lessons for the
future study of web archives more generally
Big UK Domain Data for the Arts and
Humanities
The dataset:
– Crawled from 1996 – 2013
– Approximately 65 TB, billions of words
– Building interface to allow search by retrieval
date, target domain of links, sentiment
– Allow qualitative and quantitative analysis – and
iteration between multiple research techniques
Big UK Domain Data for the Arts and
Humanities
Key outputs:
– Ten bursary projects using web archive data to
investigate a broad range of topics, for example…
• Armed services recruitment online
• The accessibility of the web for disabled users
• Online discussions of ‘Beat’ poetry
– An edited book of empirical studies concerning the
history of the UK web, featuring chapters on, for
example…
• Constitutional and institutional change in UK government
• The BBC’s online presence
• The ‘web of faith’ online
Next
● Studies underway at OII, BL, IHR
● Book and articles
– Study overall growth of .uk
– Case study of .gov.uk
– Study of media and select committee
visibility
● Releasing data open source

More Related Content

PDF
Peter webster interrogating the archived uk web
PPT
AddressingHistory - Tracing the Past
PPT
Digital archaeology and museums
PDF
2014_WWW_BTOR
PDF
PARTHENOS Webinar: Boost Your eHumanities and eHeritage Research with Researc...
PPT
Missing links closing talk - with notes
PPTX
101 This is Digital Scholarship 2016
PDF
Peter Webster - Digital History - 11 June 2013
Peter webster interrogating the archived uk web
AddressingHistory - Tracing the Past
Digital archaeology and museums
2014_WWW_BTOR
PARTHENOS Webinar: Boost Your eHumanities and eHeritage Research with Researc...
Missing links closing talk - with notes
101 This is Digital Scholarship 2016
Peter Webster - Digital History - 11 June 2013

What's hot (20)

PDF
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
PPTX
Open Access in Architectural Research
PPTX
Digital Cultural Heritage and Open Education
PPT
Reports from the UKMHL and Historical Texts live lab
PDF
3e Studiedag Webarchivering - Promise
PDF
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
PPTX
ZIP
Linked Open Data in Libraries Archives & Museums
PPTX
Digital Humanities: A brief introduction to the field
PDF
Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...
PPTX
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
PPT
Developing Open Access Content into Academic English Resources for Data-Drive...
PPTX
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
PPTX
Disrupting the transactional library model: the challenges and opportunities ...
PPT
Digital Libraries: Local and Global
PPT
Launch of Welsh Newspapers Online
PPTX
JCDL 2015 Tutorial Opening Slides
PPT
資訊素養工作坊PowerPoint
PPT
Future Directions of the European Library
PPTX
Digital Cultural Heritage: Experiences from British Library
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Open Access in Architectural Research
Digital Cultural Heritage and Open Education
Reports from the UKMHL and Historical Texts live lab
3e Studiedag Webarchivering - Promise
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
Linked Open Data in Libraries Archives & Museums
Digital Humanities: A brief introduction to the field
Maphub und Pelagios: Anwendung von Linked Data in den Digitalen Geisteswissen...
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
Developing Open Access Content into Academic English Resources for Data-Drive...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
Disrupting the transactional library model: the challenges and opportunities ...
Digital Libraries: Local and Global
Launch of Welsh Newspapers Online
JCDL 2015 Tutorial Opening Slides
資訊素養工作坊PowerPoint
Future Directions of the European Library
Digital Cultural Heritage: Experiences from British Library
Ad

Viewers also liked (8)

PPT
The uk today
PPTX
History of uk music press
PPTX
Culture in United Kingdom & Ireland
PPTX
The royal family of great britain
PPTX
Education in the uk
PPT
Education In The Uk
PPTX
The uk education system
PPT
Educational System in UK
The uk today
History of uk music press
Culture in United Kingdom & Ireland
The royal family of great britain
Education in the uk
Education In The Uk
The uk education system
Educational System in UK
Ad

Similar to Ancient History of the UK Web (20)

PDF
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
PPT
Working with the archived web, 1996-2013
PDF
Data Science at the ATI and BL Web Archiving
PPT
Introduction to British Library digital resources for social scientists
PDF
Building a Collection of the Historical UK Web for scholarly use
PPT
Jisc MediaHub webinar
PPT
National Digital Forum 2008
PPT
LIS 653 Posters
PPTX
Millward - We cannot put this off any longer - upload.pptx
PPTX
UVA MDST 3703 Thematic Research Collections 2012-09-18
PPT
090127 MLA-SE Museums and the web
PPT
Web usability in practice: a case study from the First World War Poetry Digit...
PDF
Blogging History: What are the uses of blogs in academic and archival settings?
PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PDF
Julian D. Richards - Open Data in European Archaeology
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PDF
When Search becomes Research and Research becomes Search
PPT
Analytics and Access to the UK web archive
PPT
Cultural Heritage Insitutions and Big Data Collections
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
Working with the archived web, 1996-2013
Data Science at the ATI and BL Web Archiving
Introduction to British Library digital resources for social scientists
Building a Collection of the Historical UK Web for scholarly use
Jisc MediaHub webinar
National Digital Forum 2008
LIS 653 Posters
Millward - We cannot put this off any longer - upload.pptx
UVA MDST 3703 Thematic Research Collections 2012-09-18
090127 MLA-SE Museums and the web
Web usability in practice: a case study from the First World War Poetry Digit...
Blogging History: What are the uses of blogs in academic and archival settings?
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
Julian D. Richards - Open Data in European Archaeology
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
When Search becomes Research and Research becomes Search
Analytics and Access to the UK web archive
Cultural Heritage Insitutions and Big Data Collections

More from Scott A. Hale (11)

PDF
Researching Misinformation
PDF
Big Tech & Disinformation: What are the main threats and how can journalists ...
PDF
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
PDF
Foreign-language Reviews: Help or Hindrance? (Slides)
PDF
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
PDF
Interactive Visualizations for teaching, research, and dissemination
PDF
Oxford Digital Humanities Summer School
PDF
Multilinguals and Wikipedia Editing
PDF
Design and Multilingual Users on Twitter and Wikipedia
PDF
Global connectivity and multilinguals in the Twitter network (slides)
PDF
ECPR 2011 Leaders and Followers Experiment
Researching Misinformation
Big Tech & Disinformation: What are the main threats and how can journalists ...
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
Foreign-language Reviews: Help or Hindrance? (Slides)
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
Interactive Visualizations for teaching, research, and dissemination
Oxford Digital Humanities Summer School
Multilinguals and Wikipedia Editing
Design and Multilingual Users on Twitter and Wikipedia
Global connectivity and multilinguals in the Twitter network (slides)
ECPR 2011 Leaders and Followers Experiment

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Supervised vs unsupervised machine learning algorithms
STERILIZATION AND DISINFECTION-1.ppthhhbx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
STUDY DESIGN details- Lt Col Maksud (21).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Ancient History of the UK Web

  • 1. Ancient History of the UK Web With support by and thanks to Ning Wang and Adham Tamer Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri
  • 2. Past Web Archive Activities at OII • 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of Humanities (Jisc & NEH funded) – OII, Internet Archive, Hanzo Archives – Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final Report to JISC. Online: http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalrepor t.pdf • 2010. Researcher Engagement with Web Archives (Jisc funded) – OII, VKS – Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010). Researcher Engagement with Web Archives: State of the Art. London: JISC. Online: http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/ – Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie- repository.jisc.ac.uk/543/ – Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society of Information Science & Technology. http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract • 2011. Using Web Archives: A Futures Perspective (IIPC funded) – OII – Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London: IIPC. Online: http://ssrn.com/abstract=1830025
  • 3. Recent Web Archive Activities at OII • 2013-2015: Jisc Big Data project (Jisc funded) – OII, British Library – Prepare and release hyperlink corpus • 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC funded) – IHR, OII, British Library – Supporting researchers in Arts & Humanities to use web archive data – Producing edited book of empirical studies concerning the history of the UK web • First paper from these combined projects – Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H. (2014, July). Mapping the UK webspace: Fifteen years of British universities on the web. ACM WebSci’14, Bloomington, Indiana. http://papers.ssrn.com/abstract=2435481 or http://arxiv.org/abs/1405.2856
  • 4. Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection. February 2012 - February 2014
  • 5. Taming a mammoth: Web Archive Dataset Preparation 30 TB compressed data 6.2TB metadata and links 2.5 TB temporal links
  • 6. 30 TB compressed data in (w)arc format – Approx. 4.5 million files – Mix of binary and plain text payloads along with header data – Two formats: old arc and newer warc Housed at the BL, access restrictions
  • 7. WARC/1.0 WARC-Type: response WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu- network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+ %28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger s&c7=10-Nov- 08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa tablog&c2=GUID:(none) WARC-Date: 2010-12-05T02:58:00Z WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ WARC-IP-Address: 66.235.138.18 WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8> Content-Type: application/http; msgtype=response Content-Length: 740 HTTP/1.1 302 Found Date: Sun, 05 Dec 2010 02:58:00 GMT Server: Omniture DC/2.0.0 X-C: ms-4.3.1 Expires: Sat, 04 Dec 2010 02:58:00 GMT Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private Pragma: no-cache ETag: "4CFAFFB8-0E4C-7443902F" Vary: * P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA" Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu- network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08 %2Fprisoner-of-war-camps-uk xserver: www422 Content-Length: 0 Keep-Alive: timeout=15 Connection: close Content-Type: text/plain
  • 8. Extract meta-data and links (wat format) – Approx. 4.5 million files – 6.2TB on disk compressed – Housed at OII – Structured JSON – Different formats for arc/warcs
  • 9. { "Container": { "Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000- 00000.arc.gz", "Offset": "88937", "Compressed": true, "Gzip-Metadata": { "Header-Length": "10", "Inflated-CRC": "-1223265901", "Inflated-Length": "26073", "Deflate-Length": "4463", "Footer-Length": "8" } }, "Envelope": { "ARC-Header-Length": "102", "ARC-Header-Metadata": { "Date": "20080509081524", "Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php", "Content-Length": "25970", "Content-Type": "text/html", "IP-Address": "83.223.106.10" }, "Payload-Metadata": { "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF", "HTTP-Response-Metadata": { "Headers-Length": "591", "HTML-Metadata": { "Head": { "Title": "Exterior Corbels",
  • 10. Plain text lists Build own ad-hawk Hadoop cluster, fix incompatibilities, divide into smaller batches – Build plain text lists of pages and hyperlinks – Remove error page (e.g., 404 Not Found) – Remove pages not in .uk – Standardize dates (many formats) – Standardize hyperlinks (trailing /, etc.) – Fix/remove tons of invalid hyperlinks (whitespace, invalid characters, etc.) Load results into Apache Hive (2.5 TB)
  • 13. Relative size of second-level-domains
  • 14. Number of links within SLD per node
  • 15. Cross-domain links (2010) Absolute Normalized to target size
  • 16. Case of ac.uk Mapping the UK Webspace: Fifteen Years of British Universities on the Web Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856 121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location
  • 20. Gravity Law σ𝑖𝑗 = 𝑠𝑖𝑗 𝑠𝑖 𝑜𝑢𝑡 𝑠𝑗 𝑖𝑛 𝑠𝑖𝑗 = 𝑠𝑖 𝑜𝑢𝑡 𝑠𝑗 𝑖𝑛 𝑟0.28
  • 21. Big UK Domain Data for the Arts and Humanities Primary aim: developing a methodological and theoretical framework within which to study over 15 years of UK domain data – with lessons for the future study of web archives more generally
  • 22. Big UK Domain Data for the Arts and Humanities The dataset: – Crawled from 1996 – 2013 – Approximately 65 TB, billions of words – Building interface to allow search by retrieval date, target domain of links, sentiment – Allow qualitative and quantitative analysis – and iteration between multiple research techniques
  • 23. Big UK Domain Data for the Arts and Humanities Key outputs: – Ten bursary projects using web archive data to investigate a broad range of topics, for example… • Armed services recruitment online • The accessibility of the web for disabled users • Online discussions of ‘Beat’ poetry – An edited book of empirical studies concerning the history of the UK web, featuring chapters on, for example… • Constitutional and institutional change in UK government • The BBC’s online presence • The ‘web of faith’ online
  • 24. Next ● Studies underway at OII, BL, IHR ● Book and articles – Study overall growth of .uk – Case study of .gov.uk – Study of media and select committee visibility ● Releasing data open source