SlideShare a Scribd company logo
Hadoop Hands On
Successes and failures to drive
evolution
Benoit PERROUD
Software Engineer @Verisign & Apache Committer
GITI BigData, EPFL, November 6. 2012
Disclaimer

   •     I apologize for speaking “Frenglish”

   •     The views and statements expressed in this talk do not necessarily reflect the
         views of VeriSign, Inc and any other person involved in the company do not
         warrant the accuracy, reliability, currency or completeness of those views or
         statements and do not accept any legal liability whatsoever arising from any
         reliance on the views, statements and subject matter of the talk.

   •     Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache
         Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache
         Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either
         registered trademarks or trademarks of the Apache Software Foundation in the
         United States and/or other countries.
   •     Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its
         affiliates
   •     Python and the Python logo are either registered trademarks or trademarks of the
         Python Software Foundation
   •     MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc.
   •     All other marks are the property of their respective owners.

Verisign Public                                                                             2
Let’s talk about Hadoop!




Verisign Public             3
Hadoop 10k Feet View

   1. MapReduce Processing Framework
           • Map  Combine  Shuffle  Reduce
   2. Distributed File System (HDFS)




Verisign Public        Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/   4
Your first Hadoop Deployment

   • Pseudo-distributed mode on a single node




Verisign Public                                 5
Going Distributed

   • TaskTracker (TT) and DataNode (DN) is moved to a
     dedicated box




Verisign Public                                         6
NameNode Single Point of Failure

   • NameNode crashes. Configuring PNN and SNN




                            NFS HA setup is not detailed here.


Verisign Public                                                  7
Bringing Data into the Cluster

   • Data could be internal to the company, but also
     external.




                                Data Retrieval and Stream Ingestion
                                are over simplified.

Verisign Public                                                       8
Dealing with API Changes

   • Integration/Validation Cluster setup




                                   Validation Cluster will be omitted
                                   in further slides for more clarity

Verisign Public                                                         9
Cluster Is Growing




Verisign Public         10
Add Monitoring




Verisign Public     11
Turn On Rack Awareness




Verisign Public             12
Split the Cluster to Production and Research




Verisign Public                                   13
Data Retrieval through REST End Point




Verisign Public                            14
Data Retrieval with Search Features




Verisign Public                          15
Data Retrieval add Cache




Verisign Public               16
Data Visualization Tools




Verisign Public               17
Upstream Updates Channel




Verisign Public               18
Realtime Updates




Verisign Public       19
Future Evolutions

   • Hadoop Next Gen
           • YARN (2.0)


   • Graph processing
           • Neo4J
           • Google Pregel / Apache Hama


   • Incremental Updates

   • Real time ad hoc queries
           • Cloudera Impala / Google Dremel



Verisign Public                                20
Conclusion

   • Hadoop has gained huge momentum
   • Technologies (around Hadoop) are evolving really fast
   • There is no “One size fits all” solution
           • Design hardly driven by customer needs
   • Data quality is a hidden requirement




Verisign Public                                          21
Conclusion #2

   • Data Scientists cost a lot
   • Running on commodity hardware still costs a lot
   • No one has the full understanding of the full data flow
           • And you need several FTE just to track the architecture
   • You have a high risk of misuse of these softwares
   • Hiring engineers with deep knowledge (meaning:
     hands on experience) in some of these softwares is
     already a challenge




Verisign Public                                                        22
Recommended Reading

  Hadoop In Practice
  by Alex Holmes
  Senior Software Engineer @Verisign




Verisign Public                        23
Q&A
                     Benoit PERROUD
                  bperroud@verisign.com




Verisign Public                           24
Thank You




© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

More Related Content

PDF
Hadoop Distributed File System Reliability and Durability at Facebook
PDF
Storage Infrastructure Behind Facebook Messages
PDF
HBase @ Twitter
PPTX
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
PPTX
Soft-Shake 2013 : Enabling Realtime Queries to End Users
PDF
Hadoop 101
 
PDF
Storage infrastructure using HBase behind LINE messages
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Hadoop Distributed File System Reliability and Durability at Facebook
Storage Infrastructure Behind Facebook Messages
HBase @ Twitter
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Hadoop 101
 
Storage infrastructure using HBase behind LINE messages
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

What's hot (20)

PPTX
Moving from C#/.NET to Hadoop/MongoDB
PPTX
Geo-based content processing using hbase
PDF
Facebook keynote-nicolas-qcon
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PPTX
Hadoop introduction
PDF
Realtime Apache Hadoop at Facebook
PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
PPTX
Learn Hadoop Administration
PPTX
Hadoop Backup and Disaster Recovery
KEY
Large scale ETL with Hadoop
PDF
Hadoop Overview
 
PDF
Improving Hadoop Performance via Linux
PPT
Hadoop - Introduction to Hadoop
PDF
Introduction to hadoop and hdfs
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
HBaseCon 2013: 1500 JIRAs in 20 Minutes
PPT
Hadoop 1.x vs 2
PDF
Apache HBase for Architects
Moving from C#/.NET to Hadoop/MongoDB
Geo-based content processing using hbase
Facebook keynote-nicolas-qcon
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop introduction
Realtime Apache Hadoop at Facebook
Hadoop and WANdisco: The Future of Big Data
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Learn Hadoop Administration
Hadoop Backup and Disaster Recovery
Large scale ETL with Hadoop
Hadoop Overview
 
Improving Hadoop Performance via Linux
Hadoop - Introduction to Hadoop
Introduction to hadoop and hdfs
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
HDFS: Hadoop Distributed Filesystem
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Hadoop 1.x vs 2
Apache HBase for Architects
Ad

Viewers also liked (20)

KEY
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
PDF
Hadoop 101 v1
PDF
storm at twitter
KEY
Intro To Hadoop
PDF
Big data: Loading your data with flume and sqoop
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PDF
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
Transperancy & Accountability
PPTX
Cloudera's Flume
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPT
Scalable Web Architecture
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Introduction to Apache Kudu
PDF
Sqoop on Spark for Data Ingestion
PPTX
Facebook for Business
PDF
Apache Flume
PPTX
7 Predictive Analytics, Spark , Streaming use cases
PPTX
Flume vs. kafka
PPTX
Apache kafka
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop 101 v1
storm at twitter
Intro To Hadoop
Big data: Loading your data with flume and sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Transperancy & Accountability
Cloudera's Flume
Apache Tez: Accelerating Hadoop Query Processing
Scalable Web Architecture
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Introduction to Apache Kudu
Sqoop on Spark for Data Ingestion
Facebook for Business
Apache Flume
7 Predictive Analytics, Spark , Streaming use cases
Flume vs. kafka
Apache kafka
Ad

Similar to Hadoop Successes and Failures to Drive Deployment Evolution (20)

PPTX
Vmware Serengeti - Based on Infochimps Ironfan
PDF
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
PDF
Hadoop in the Enterprise
PDF
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
PPTX
Streamlining Deployments in a Large Websphere Environment
DOC
Robin_Hadoop
PDF
Applications on Hadoop
PDF
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
PDF
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
PPTX
Case study - Application Re architecture (ODC)
DOC
Srivenkata_Resume
PDF
What it takes to bring Hadoop to a production-ready state
PDF
451 Research: Data Is the Key to Friction in DevOps
PDF
Continuuity Presents at Under the Radar 2013
PPTX
DevOps for the DBA- Jax Style!
PDF
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
PDF
Hp discover 2012 managing the virtualization explosion
PDF
Transforming Application Delivery with PaaS and Linux Containers
PPTX
Screw DevOps, Let's Talk DataOps
Vmware Serengeti - Based on Infochimps Ironfan
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
Hadoop in the Enterprise
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Streamlining Deployments in a Large Websphere Environment
Robin_Hadoop
Applications on Hadoop
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Case study - Application Re architecture (ODC)
Srivenkata_Resume
What it takes to bring Hadoop to a production-ready state
451 Research: Data Is the Key to Friction in DevOps
Continuuity Presents at Under the Radar 2013
DevOps for the DBA- Jax Style!
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Hp discover 2012 managing the virtualization explosion
Transforming Application Delivery with PaaS and Linux Containers
Screw DevOps, Let's Talk DataOps

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Advanced IT Governance
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Advanced IT Governance
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Advanced Soft Computing BINUS July 2025.pdf
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Hadoop Successes and Failures to Drive Deployment Evolution

  • 1. Hadoop Hands On Successes and failures to drive evolution Benoit PERROUD Software Engineer @Verisign & Apache Committer GITI BigData, EPFL, November 6. 2012
  • 2. Disclaimer • I apologize for speaking “Frenglish” • The views and statements expressed in this talk do not necessarily reflect the views of VeriSign, Inc and any other person involved in the company do not warrant the accuracy, reliability, currency or completeness of those views or statements and do not accept any legal liability whatsoever arising from any reliance on the views, statements and subject matter of the talk. • Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. • Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its affiliates • Python and the Python logo are either registered trademarks or trademarks of the Python Software Foundation • MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc. • All other marks are the property of their respective owners. Verisign Public 2
  • 3. Let’s talk about Hadoop! Verisign Public 3
  • 4. Hadoop 10k Feet View 1. MapReduce Processing Framework • Map  Combine  Shuffle  Reduce 2. Distributed File System (HDFS) Verisign Public Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 4
  • 5. Your first Hadoop Deployment • Pseudo-distributed mode on a single node Verisign Public 5
  • 6. Going Distributed • TaskTracker (TT) and DataNode (DN) is moved to a dedicated box Verisign Public 6
  • 7. NameNode Single Point of Failure • NameNode crashes. Configuring PNN and SNN NFS HA setup is not detailed here. Verisign Public 7
  • 8. Bringing Data into the Cluster • Data could be internal to the company, but also external. Data Retrieval and Stream Ingestion are over simplified. Verisign Public 8
  • 9. Dealing with API Changes • Integration/Validation Cluster setup Validation Cluster will be omitted in further slides for more clarity Verisign Public 9
  • 12. Turn On Rack Awareness Verisign Public 12
  • 13. Split the Cluster to Production and Research Verisign Public 13
  • 14. Data Retrieval through REST End Point Verisign Public 14
  • 15. Data Retrieval with Search Features Verisign Public 15
  • 16. Data Retrieval add Cache Verisign Public 16
  • 20. Future Evolutions • Hadoop Next Gen • YARN (2.0) • Graph processing • Neo4J • Google Pregel / Apache Hama • Incremental Updates • Real time ad hoc queries • Cloudera Impala / Google Dremel Verisign Public 20
  • 21. Conclusion • Hadoop has gained huge momentum • Technologies (around Hadoop) are evolving really fast • There is no “One size fits all” solution • Design hardly driven by customer needs • Data quality is a hidden requirement Verisign Public 21
  • 22. Conclusion #2 • Data Scientists cost a lot • Running on commodity hardware still costs a lot • No one has the full understanding of the full data flow • And you need several FTE just to track the architecture • You have a high risk of misuse of these softwares • Hiring engineers with deep knowledge (meaning: hands on experience) in some of these softwares is already a challenge Verisign Public 22
  • 23. Recommended Reading Hadoop In Practice by Alex Holmes Senior Software Engineer @Verisign Verisign Public 23
  • 24. Q&A Benoit PERROUD bperroud@verisign.com Verisign Public 24
  • 25. Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.