Hadoop Successes and Failures to Drive Deployment Evolution

Hadoop Hands On
Successes and failures to drive
evolution
Benoit PERROUD
Software Engineer @Verisign & Apache Committer
GITI BigData, EPFL, November 6. 2012

Disclaimer

• I apologize for speaking “Frenglish”

• The views and statements expressed in this talk do not necessarily reflect the
views of VeriSign, Inc and any other person involved in the company do not
warrant the accuracy, reliability, currency or completeness of those views or
statements and do not accept any legal liability whatsoever arising from any
reliance on the views, statements and subject matter of the talk.

• Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache
Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache
Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either
registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries.
• Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its
affiliates
• Python and the Python logo are either registered trademarks or trademarks of the
Python Software Foundation
• MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc.
• All other marks are the property of their respective owners.

Verisign Public 2

Let’s talk about Hadoop!

Verisign Public 3

Hadoop 10k Feet View

1. MapReduce Processing Framework
• Map  Combine  Shuffle  Reduce
2. Distributed File System (HDFS)

Verisign Public Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 4

Your first Hadoop Deployment

• Pseudo-distributed mode on a single node

Verisign Public 5

Going Distributed

• TaskTracker (TT) and DataNode (DN) is moved to a
dedicated box

Verisign Public 6

NameNode Single Point of Failure

• NameNode crashes. Configuring PNN and SNN

NFS HA setup is not detailed here.

Verisign Public 7

Bringing Data into the Cluster

• Data could be internal to the company, but also
external.

Data Retrieval and Stream Ingestion
are over simplified.

Verisign Public 8

Dealing with API Changes

• Integration/Validation Cluster setup

Validation Cluster will be omitted
in further slides for more clarity

Verisign Public 9

Cluster Is Growing

Verisign Public 10

Add Monitoring

Verisign Public 11

Turn On Rack Awareness

Verisign Public 12

Split the Cluster to Production and Research

Verisign Public 13

Data Retrieval through REST End Point

Verisign Public 14

Data Retrieval with Search Features

Verisign Public 15

Data Retrieval add Cache

Verisign Public 16

Data Visualization Tools

Verisign Public 17

Upstream Updates Channel

Verisign Public 18

Realtime Updates

Verisign Public 19

Future Evolutions

• Hadoop Next Gen
• YARN (2.0)

• Graph processing
• Neo4J
• Google Pregel / Apache Hama

• Incremental Updates

• Real time ad hoc queries
• Cloudera Impala / Google Dremel

Verisign Public 20

Conclusion

• Hadoop has gained huge momentum
• Technologies (around Hadoop) are evolving really fast
• There is no “One size fits all” solution
• Design hardly driven by customer needs
• Data quality is a hidden requirement

Verisign Public 21

Conclusion #2

• Data Scientists cost a lot
• Running on commodity hardware still costs a lot
• No one has the full understanding of the full data flow
• And you need several FTE just to track the architecture
• You have a high risk of misuse of these softwares
• Hiring engineers with deep knowledge (meaning:
hands on experience) in some of these softwares is
already a challenge

Verisign Public 22

Recommended Reading

Hadoop In Practice
by Alex Holmes
Senior Software Engineer @Verisign

Verisign Public 23

Q&A
Benoit PERROUD
bperroud@verisign.com

Verisign Public 24

Thank You

© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

Hadoop Successes and Failures to Drive Deployment Evolution

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop Successes and Failures to Drive Deployment Evolution (20)

Recently uploaded (20)

Hadoop Successes and Failures to Drive Deployment Evolution