SlideShare a Scribd company logo
What is Hadoop? 
Hadoop Driven Digital Preservation 
Clemens Neudecker 
KB National Library of the Netherlands 
SCAPE & OPF Hackathon 
Vienna, 2 dec 2013
• Dec 2004: Dean/Ghemawat (Google) MapReduce paper 
2 
Timeline 
• 2005: Doug Cutting and Mike Cafarella (Yahoo) 
create Hadoop, at first only to extend Nutch 
(the name is derived from Doug’s son’s toy elephant) 
• 2006: Yahoo runs Hadoop on 5-20 nodes 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
3 
Timeline 
•March 2008: Cloudera founded 
•July 2008: Hadoop wins TeraByte sort benchmark 
(1st time a Java program won this competition) 
•April 2009: Amazon introduce “Elastic MapReduce” 
as a service on S3/EC2 
•June 2011: Hortonworks founded 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
4 
Timeline 
•27 dec 2011: Apache Hadoop release 1.0.0 
•June 2012: Facebook claim “biggest Hadoop cluster”, 
totalling more than 100 PetaBytes in HDFS 
•2013: Yahoo runs Hadoop on 42,000 nodes, 
computing about 500,000 MapReduce jobs per day 
•15 oct 2013: Apache Hadoop release 2.2.0 (YARN) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
5 
Contributions 2006 - 2011 
(Cf. http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
6 
“Core” Hadoop 
• Hadoop Common (formerly Hadoop Core) 
• Hadoop MapReduce 
• Hadoop YARN (MapReduce 2.0) 
• Hadoop Distributed File System (HDFS) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
7 
The wider Hadoop Ecosystem 
• Ambari, Zookeeper (managing & monitoring) 
• HBase, Cassandra (database) 
• Hive, Pig (data warehouse and query language) 
• Mahout (machine learning) 
• Chukwa, Avro, Oozie, Giraph, and many more 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
8 
The wider Hadoop Ecosystem 
http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins- 
charles-zedlewski-cloudera 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• “Hadoop is a hammer. Start by figuring out what house 
you‘re gonna build.“ 
Alistair Croll 
• “If all you have is a hammer, throw away everything 
that is not a nail!“ 
Jimmy Lin 
9 
“Hadoop is a hammer” 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
10 
MapReduce in 41 words (including “library”) 
Goal: count the number of books in the library. 
• Map: 
You count up shelf #1, I count up shelf #2. 
(The more people we get, the faster this part goes) 
• Reduce: 
We all get together and add up our individual counts. 
(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
MapReduce in a nutshell 
Task1 
Task 2 
Task 3 
Aggregated 
Result 
Aggregated 
Result 
Aggregated 
Result 
11 This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
Output data 
Aggregated 
Result 
© Sven Schlarb
12 
MapReduce “v1” issues 
• JobTracker as a single-point of failure 
• Deficiencies in scalability, memory consumption, 
threading-model, reliability and performance 
(https://issues.apache.org/jira/browse/MAPREDUCE- 
278) 
• Aim to support programming paradigms other than 
MapReduce (BSP) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
13 
MapReduce vs YARN 
(Cf. http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
14 
When to use Hadoop? 
• Generally, always when “standard tools” don’t work 
anymore because of sheer data size 
(rule of thumb: if your data fits on a regular hard drive, 
your better off sticking to Python/SQL/Bash/etc.!) 
• Aggregation across large data sets: use the power of 
Reducers! 
• Large-scale ETL operations (extract, transform, load) 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Reading 
• Tom White: Hadoop. The Definitive Guide 
(get 3rd ed. for extra YARN chapter) 
• YARN explained (really quite well): 
http://blog.cloudera.com/blog/2012/02/mapreduce-2- 
0-in-hadoop-0-23/ 
• Jimmy Lin: Text Processing with MapReduce: 
http://lintool.github.io/MapReduceAlgorithms/ed1n.ht 
ml 
15 This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
16 
Happy Hadooping! 
This work was partially supported by the SCAPE Project. 
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

More Related Content

PPTX
Big data and hadoop
PPTX
Hadoop
PPTX
The Exabyte Journey and DataBrew with CICD
PDF
Introduction to Hadoop
DOCX
Hadoop Seminar Report
PDF
Introduction to Hadoop part1
DOCX
Hadoop Seminar Report
PDF
Seminar_Report_hadoop
Big data and hadoop
Hadoop
The Exabyte Journey and DataBrew with CICD
Introduction to Hadoop
Hadoop Seminar Report
Introduction to Hadoop part1
Hadoop Seminar Report
Seminar_Report_hadoop

What's hot (20)

ODP
Hadoop seminar
DOCX
Hadoop technology doc
PPTX
Hadoop info
PPTX
Hadoop Presentation - PPT
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
PDF
Hadoop MapReduce Framework
PPTX
Hadoop and Big Data
PDF
Final Year Project Guidance
PPTX
Large Scale Data With Hadoop
PPTX
Big data Hadoop presentation
PDF
Intro to HDFS and MapReduce
PPTX
Introduction to Apache Hadoop Ecosystem
PPT
Performance Issues on Hadoop Clusters
PPTX
Apache Hadoop
PDF
An Introduction to the World of Hadoop
PDF
Hadoop tools with Examples
ODP
Hadoop demo ppt
PPTX
Big data and Hadoop
PDF
Apache Hadoop - Big Data Engineering
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Hadoop seminar
Hadoop technology doc
Hadoop info
Hadoop Presentation - PPT
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Hadoop MapReduce Framework
Hadoop and Big Data
Final Year Project Guidance
Large Scale Data With Hadoop
Big data Hadoop presentation
Intro to HDFS and MapReduce
Introduction to Apache Hadoop Ecosystem
Performance Issues on Hadoop Clusters
Apache Hadoop
An Introduction to the World of Hadoop
Hadoop tools with Examples
Hadoop demo ppt
Big data and Hadoop
Apache Hadoop - Big Data Engineering
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Ad

Similar to What is Hadoop? (20)

PDF
Hadoop and its applications at the State and University Library, SCAPE Inform...
PDF
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
PDF
LIBER Satellite Event, SCAPE by Sven Schlarb
PPTX
Application scenarios of the SCAPE project at the Austrian National Library
PDF
SCAPE Information Day at BL - Large Scale Processing with Hadoop
PDF
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
PPTX
Scape project presentation - Scalable Preservation Environments
PDF
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
PDF
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
PPT
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
PDF
Content profiling and C3PO
PDF
Preservation Policy in SCAPE - Training, Aarhus
PDF
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
PPT
IMPACT HPC Cloud Day
PPT
IMPACT at OCR Summit
PPTX
SCAPE general presentation
PPTX
INTRODUCTION TO APACHE HADOOP AND MAPREDUCE
PDF
Apache Con Eu2008 Hadoop Tour Tom White
PDF
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
PDF
Team 10 geo dcat ap for earth observation data
Hadoop and its applications at the State and University Library, SCAPE Inform...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
LIBER Satellite Event, SCAPE by Sven Schlarb
Application scenarios of the SCAPE project at the Austrian National Library
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
Scape project presentation - Scalable Preservation Environments
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Content profiling and C3PO
Preservation Policy in SCAPE - Training, Aarhus
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
IMPACT HPC Cloud Day
IMPACT at OCR Summit
SCAPE general presentation
INTRODUCTION TO APACHE HADOOP AND MAPREDUCE
Apache Con Eu2008 Hadoop Tour Tom White
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Team 10 geo dcat ap for earth observation data
Ad

More from cneudecker (20)

PPTX
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
PPTX
ALTO, PAGE & Co. Formate für Volltexte
PPTX
OCR und Strukturerkennung für Zeitungen
PPTX
Digitisation and Digital Humanities - what is the role of Libraries?
PPTX
Multimodal Perspectives for Digitised Historical Newspapers
PPTX
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
PPTX
AI for digitized cultural heritage
PPTX
Kuratieren mit künstlicher Intelligenz
PPTX
Überblick zum DFG-Projekt OCR-D
PDF
The many uses of digitized newspapers
PPTX
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
PPTX
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PPTX
Text and Data Mining
PPTX
Formate für Volltexte
PPTX
Extrablatt: The Latest News on Newspaper Digitisation in Europe
PPTX
Reise durch Europeana Collections in 11 Minuten
PPTX
Europeana Newspapers in a Nutshell
PPTX
lab.sbb.berlin
PPTX
Named Entity Recognition for Europeana Newspapers
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
ALTO, PAGE & Co. Formate für Volltexte
OCR und Strukturerkennung für Zeitungen
Digitisation and Digital Humanities - what is the role of Libraries?
Multimodal Perspectives for Digitised Historical Newspapers
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
AI for digitized cultural heritage
Kuratieren mit künstlicher Intelligenz
Überblick zum DFG-Projekt OCR-D
The many uses of digitized newspapers
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
OCR-D: An end-to-end open source OCR framework for historical printed documents
Text and Data Mining
Formate für Volltexte
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Reise durch Europeana Collections in 11 Minuten
Europeana Newspapers in a Nutshell
lab.sbb.berlin
Named Entity Recognition for Europeana Newspapers

Recently uploaded (20)

PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
The various Industrial Revolutions .pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
STKI Israel Market Study 2025 version august
PDF
Getting Started with Data Integration: FME Form 101
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
project resource management chapter-09.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPT
What is a Computer? Input Devices /output devices
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
1 - Historical Antecedents, Social Consideration.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
The various Industrial Revolutions .pptx
A comparative study of natural language inference in Swahili using monolingua...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Programs and apps: productivity, graphics, security and other tools
observCloud-Native Containerability and monitoring.pptx
STKI Israel Market Study 2025 version august
Getting Started with Data Integration: FME Form 101
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
TLE Review Electricity (Electricity).pptx
project resource management chapter-09.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
OMC Textile Division Presentation 2021.pptx
What is a Computer? Input Devices /output devices

What is Hadoop?

  • 1. What is Hadoop? Hadoop Driven Digital Preservation Clemens Neudecker KB National Library of the Netherlands SCAPE & OPF Hackathon Vienna, 2 dec 2013
  • 2. • Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2 Timeline • 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (the name is derived from Doug’s son’s toy elephant) • 2006: Yahoo runs Hadoop on 5-20 nodes This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3. 3 Timeline •March 2008: Cloudera founded •July 2008: Hadoop wins TeraByte sort benchmark (1st time a Java program won this competition) •April 2009: Amazon introduce “Elastic MapReduce” as a service on S3/EC2 •June 2011: Hortonworks founded This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 4. 4 Timeline •27 dec 2011: Apache Hadoop release 1.0.0 •June 2012: Facebook claim “biggest Hadoop cluster”, totalling more than 100 PetaBytes in HDFS •2013: Yahoo runs Hadoop on 42,000 nodes, computing about 500,000 MapReduce jobs per day •15 oct 2013: Apache Hadoop release 2.2.0 (YARN) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 5. 5 Contributions 2006 - 2011 (Cf. http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 6. 6 “Core” Hadoop • Hadoop Common (formerly Hadoop Core) • Hadoop MapReduce • Hadoop YARN (MapReduce 2.0) • Hadoop Distributed File System (HDFS) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 7. 7 The wider Hadoop Ecosystem • Ambari, Zookeeper (managing & monitoring) • HBase, Cassandra (database) • Hive, Pig (data warehouse and query language) • Mahout (machine learning) • Chukwa, Avro, Oozie, Giraph, and many more This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 8. 8 The wider Hadoop Ecosystem http://www.slideshare.net/cloudera/the-hadoop-stack-then-now-and-in-the-future-eli-collins- charles-zedlewski-cloudera This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 9. • “Hadoop is a hammer. Start by figuring out what house you‘re gonna build.“ Alistair Croll • “If all you have is a hammer, throw away everything that is not a nail!“ Jimmy Lin 9 “Hadoop is a hammer” This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 10. 10 MapReduce in 41 words (including “library”) Goal: count the number of books in the library. • Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes) • Reduce: We all get together and add up our individual counts. (Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11. MapReduce in a nutshell Task1 Task 2 Task 3 Aggregated Result Aggregated Result Aggregated Result 11 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Output data Aggregated Result © Sven Schlarb
  • 12. 12 MapReduce “v1” issues • JobTracker as a single-point of failure • Deficiencies in scalability, memory consumption, threading-model, reliability and performance (https://issues.apache.org/jira/browse/MAPREDUCE- 278) • Aim to support programming paradigms other than MapReduce (BSP) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 13. 13 MapReduce vs YARN (Cf. http://hortonworks.com/blog/office-hours-qa-on-yarn-in-hadoop-2/) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 14. 14 When to use Hadoop? • Generally, always when “standard tools” don’t work anymore because of sheer data size (rule of thumb: if your data fits on a regular hard drive, your better off sticking to Python/SQL/Bash/etc.!) • Aggregation across large data sets: use the power of Reducers! • Large-scale ETL operations (extract, transform, load) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 15. Reading • Tom White: Hadoop. The Definitive Guide (get 3rd ed. for extra YARN chapter) • YARN explained (really quite well): http://blog.cloudera.com/blog/2012/02/mapreduce-2- 0-in-hadoop-0-23/ • Jimmy Lin: Text Processing with MapReduce: http://lintool.github.io/MapReduceAlgorithms/ed1n.ht ml 15 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 16. 16 Happy Hadooping! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).