SlideShare a Scribd company logo
Infrastructure and Stack




              Presented by John Dougherty, Viriton
                          10/03/2012
                  john.dougherty@viriton.com
What is Hadoop?

• Apache’s implementation of Google’s BigTable
• Uses a Java VM in order to parse instructions
• Uses sequential writes & column based file
  structures with HDFS
• Grants the ability to read/write/manipulate very
  large data sets/structures.
What is Hadoop? (cont.)



         VS.
What is BigTable
• Contains the framework that was based on,
  and is used in, hadoop

• Uses a commodity approach to hardware

• Extreme scalability and redundancy
Commodity Perspective
• Commercial Hardware cost vs. failure rate
  – Roughly double the cost of commodity
  – Roughly 5% failure rate


• Commodity Hardware cost vs. failure rate
  – Roughly half the cost of commodity
  – Rougly 10-15% failure rate
Breaking Down the Complexity
What is HDFS
• Backend file system for the Hadoop platform

• Allows for easy operability/node management

• Certain technologies can replace or augment
  – Hbase (Augments HDFS)
  – Cassandra (Replaces HDFS)
What works with Hadoop?
• Middleware and connectivity tools improve functionality

• Hive, Pig, Cassandra (all sub-projects of Apache’s Hadoop)
  help to connect and utilize

• Each application set has different uses




                       Pig
Layout of Middleware
Schedulers/Configurators
• Zookeeper
  – Helps you in configuring many nodes
  – Can be integrated easily
• Oozie
  – A job resource/scheduler for hadoop
  – Open source
• Flume
  – Concatenator/Aggregator (Dist. log collection)
Middleware
• Hive
  – Data warehouse, connects natively to hadoop’s internals
  – Uses HiveQL to create queries
  – Easily extendable with plugins/macros
• Pig
  – Hive-like in that it uses its own query language (pig latin)
  – Easily extendable, more like SQL than Hive
• Sqoop
  – Connects databases and datasets
  – Limited, but powerful
How can Hadoop/Hbase/MapReduce help?

• You have a very large data set(s)
• You require results on your data in a timely
  manner
• You don’t enjoy spending millions on
  infrastructure
• Your data is large enough to cause a classic
  RDBMS headaches
Column Based Data
• Developer woes
  – Extract/Transfer/Load is a concern for complicated
    schemas
  – Egress/Ingress between existing queries/results
    becomes complicated
  – Solutions are deployed with walls of functionality
  – Hard questions turn into hard queries
Column Based Data (cont.)
• Developer joys
  – You can now process PB, into EB, and beyond
  – Your extended datasets can be aggregated, not
    easily; but also unlike ever before
  – You can extend your daily queries to include
    historical data, even incorporating into existing
    real-time data usage
Future Projects/Approaches

• Cross discipline data sharing/comparisons
• Complex statistical models re-constructed
• Massive data set conglomeration and
  standardization (Public sector data, mostly)
How some software makes it easier
• Alteryx
   – Very similar to Talend for interface, visual
   – Allows easy integration into reporting (Crystal Reports)
• Qubole
   – This will be expanded on shortly
   – Easy to use interface and management of data
• Hortonworks (Open Source)
   – Management utility for internal cluster deployments
• Cloudera (Open, to an extent)
   – Management utility from Cloudera, also for internal deployments

More Related Content

PPTX
Hadoop, Infrastructure and Stack
PDF
Conhecendo o Apache HBase
PPTX
Apache hadoop technology : Beginners
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PPT
MySql to HBase in 5 Steps
PPTX
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Hadoop, Infrastructure and Stack
Conhecendo o Apache HBase
Apache hadoop technology : Beginners
Messaging architecture @FB (Fifth Elephant Conference)
MySql to HBase in 5 Steps
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
HBase at Bloomberg: High Availability Needs for the Financial Industry
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

What's hot (20)

PPTX
Cloud Optimized Big Data
PDF
SpringPeople Introduction to Apache Hadoop
PPTX
The Meta of Hadoop - COMAD 2012
PPTX
AWS Database Services
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PDF
PDF
Large-scale Web Apps @ Pinterest
PPTX
Introduction to hbase
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
4. hadoop גיא לבנברג
PPTX
Google cloud certification data engineer
PPTX
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
ODT
Hadoop online trainings
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
PPTX
Sql over hadoop ver 3
PPTX
Transform your DBMS to drive engagement innovation with Big Data
Cloud Optimized Big Data
SpringPeople Introduction to Apache Hadoop
The Meta of Hadoop - COMAD 2012
AWS Database Services
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Large-scale Web Apps @ Pinterest
Introduction to hbase
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
4. hadoop גיא לבנברג
Google cloud certification data engineer
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Hadoop online trainings
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Sql over hadoop ver 3
Transform your DBMS to drive engagement innovation with Big Data
Ad

Viewers also liked (20)

PPTX
Pre cd and artist research
DOC
PPTX
Applying Big Data
PPTX
Usability ppt
PDF
Rosalia de Castro
PPT
Vir’s ib educators ankeeta
PPTX
Big Data ROI
PPT
Aca advocacy
PPSX
SEO Pricing & Cost
PPTX
Evolución de los avances tecnológicos
PPT
Enc 3241 document_design1
PPTX
PRUEBA TOEFL
PPT
Enc 3241 color
ODT
Top 150 global design firms
PPTX
Subculture hippie
PPTX
Rosalia de Castro
PPTX
Catedra virtual de cultura ciudadana
PPTX
페차쿠차
PPTX
Jiit 2013 14 project presentation aniket mishra
PPTX
페차쿠차_ 조연진
Pre cd and artist research
Applying Big Data
Usability ppt
Rosalia de Castro
Vir’s ib educators ankeeta
Big Data ROI
Aca advocacy
SEO Pricing & Cost
Evolución de los avances tecnológicos
Enc 3241 document_design1
PRUEBA TOEFL
Enc 3241 color
Top 150 global design firms
Subculture hippie
Rosalia de Castro
Catedra virtual de cultura ciudadana
페차쿠차
Jiit 2013 14 project presentation aniket mishra
페차쿠차_ 조연진
Ad

Similar to Hadoop Infrastructure (Oct. 3rd, 2012) (20)

PPTX
Introduction to Hadoop
PPTX
SQL Server 2012 and Big Data
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Big Data and Cloud Computing
PPTX
Hadoop jon
PDF
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
PPTX
Big data - Online Training
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPTX
Big data Hadoop
PPTX
Hive - A theoretical overview in Detail.pptx
PPTX
Apache Hadoop Hive
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Apache hadoop basics
PDF
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
PPTX
INTRODUCTION TO BIG DATA HADOOP
PPT
Hadoop presentation
PPTX
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Introduction to Hadoop
SQL Server 2012 and Big Data
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Big Data and Cloud Computing
Hadoop jon
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Big data - Online Training
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Big data Hadoop
Hive - A theoretical overview in Detail.pptx
Apache Hadoop Hive
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Apache hadoop basics
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Backup and Disaster Recovery in Hadoop
Hadoop introduction , Why and What is Hadoop ?
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
INTRODUCTION TO BIG DATA HADOOP
Hadoop presentation
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Machine Learning_overview_presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Big Data Technologies - Introduction.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine Learning_overview_presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Big Data Technologies - Introduction.pptx
1. Introduction to Computer Programming.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology

Hadoop Infrastructure (Oct. 3rd, 2012)

  • 1. Infrastructure and Stack Presented by John Dougherty, Viriton 10/03/2012 john.dougherty@viriton.com
  • 2. What is Hadoop? • Apache’s implementation of Google’s BigTable • Uses a Java VM in order to parse instructions • Uses sequential writes & column based file structures with HDFS • Grants the ability to read/write/manipulate very large data sets/structures.
  • 3. What is Hadoop? (cont.) VS.
  • 4. What is BigTable • Contains the framework that was based on, and is used in, hadoop • Uses a commodity approach to hardware • Extreme scalability and redundancy
  • 5. Commodity Perspective • Commercial Hardware cost vs. failure rate – Roughly double the cost of commodity – Roughly 5% failure rate • Commodity Hardware cost vs. failure rate – Roughly half the cost of commodity – Rougly 10-15% failure rate
  • 6. Breaking Down the Complexity
  • 7. What is HDFS • Backend file system for the Hadoop platform • Allows for easy operability/node management • Certain technologies can replace or augment – Hbase (Augments HDFS) – Cassandra (Replaces HDFS)
  • 8. What works with Hadoop? • Middleware and connectivity tools improve functionality • Hive, Pig, Cassandra (all sub-projects of Apache’s Hadoop) help to connect and utilize • Each application set has different uses Pig
  • 10. Schedulers/Configurators • Zookeeper – Helps you in configuring many nodes – Can be integrated easily • Oozie – A job resource/scheduler for hadoop – Open source • Flume – Concatenator/Aggregator (Dist. log collection)
  • 11. Middleware • Hive – Data warehouse, connects natively to hadoop’s internals – Uses HiveQL to create queries – Easily extendable with plugins/macros • Pig – Hive-like in that it uses its own query language (pig latin) – Easily extendable, more like SQL than Hive • Sqoop – Connects databases and datasets – Limited, but powerful
  • 12. How can Hadoop/Hbase/MapReduce help? • You have a very large data set(s) • You require results on your data in a timely manner • You don’t enjoy spending millions on infrastructure • Your data is large enough to cause a classic RDBMS headaches
  • 13. Column Based Data • Developer woes – Extract/Transfer/Load is a concern for complicated schemas – Egress/Ingress between existing queries/results becomes complicated – Solutions are deployed with walls of functionality – Hard questions turn into hard queries
  • 14. Column Based Data (cont.) • Developer joys – You can now process PB, into EB, and beyond – Your extended datasets can be aggregated, not easily; but also unlike ever before – You can extend your daily queries to include historical data, even incorporating into existing real-time data usage
  • 15. Future Projects/Approaches • Cross discipline data sharing/comparisons • Complex statistical models re-constructed • Massive data set conglomeration and standardization (Public sector data, mostly)
  • 16. How some software makes it easier • Alteryx – Very similar to Talend for interface, visual – Allows easy integration into reporting (Crystal Reports) • Qubole – This will be expanded on shortly – Easy to use interface and management of data • Hortonworks (Open Source) – Management utility for internal cluster deployments • Cloudera (Open, to an extent) – Management utility from Cloudera, also for internal deployments