SlideShare a Scribd company logo
The Good, The Bad and the Ugly
How to tame the Big Data Beast
Guy Loewenberg
May 2013
Overview
• Data Explosion
Overview
• Big Data: A collection of data sets so large and complex that
it becomes difficult to process using on-hand database
management tools or traditional data processing applications
• Hadoop: A framework that allows distributed
processing of large data-sets across clusters of
computers using a simple programming model
• 1000 Kilobytes = 1 Megabyte
• 1000 Megabytes = 1 Gigabyte
• 1000 Gigabytes = 1 Terabyte
• 1000 Terabytes = 1 Petabyte
• 1000 Petabytes = 1 Exabyte
• 1000 Exabytes = 1 Zettabyte
• 1000 Zettabytes = 1 Yottabyte
• 1000 Yottabytes = 1 Brontobyte
• 1000 Brontobytes = 1 Geopbyte
Most US SME corporations
Most US large corporations
Leaders like Facebook & Google
Hadoop Basics
• Designed to scale
• Uses commodity hardware
• Processes data in batches
• Can process very large scale of data (PBs)
Core Hadoop
• Core hadoop is built from two main systems:
– Hadoop Clustered file system - HDFS
– MapReduce programming framework
Hadoop architecture
• Hadoop Distributed File System (HDFS):
self-healing high-bandwidth clustered
storage.
– NameNode controls HDFS
whereas DataNodes does the
block replications, read/write
operations and drives the
workloads for HDFS
– Work in a master/slave mode.
Hadoop architecture
• MapReduce: Distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
– The JobTracker schedules
jobs and allocates activities
to TaskTracker nodes which
execute the map and reduce
processes requested
– Work in master/slave mode
Hadoop software architecture
MapReduce: Parallel data processing
framework for large data sets
HDFS: Hadoop
distributed File System
Oozie: MapReduce
job Scheduler
HBase: Key-value
database
Pig: Large data sets
analysis language
Hive: High-level language for
analyzing large data sets
ZooKeeper: distributed
coordination system
Solr / Lucene search
engine, query engine library
What Hadoop can’t do
• Hadoop lets you perform batch analysis on whatever
data you have stored within Hadoop. That data, does
not have to be structured
– Many solutions take advantage of the low storage expense of
Hadoop to store structured data there instead of RDBMS. But
shifting data back and forth between Hadoop and an RDBMS
would be overkill.
– Transactional data is highly complex, as a transaction on an
ecommerce site can generate many steps that all have to be
implemented quickly. That scenario is not ideal for Hadoop
– Structured data sets that require very minimal latency
Comparing RDBMS to MapReduce
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High
What Hadoop can do
• High data volume, stored in Hadoop, and queried at
length later using MapReduce functions
– index building
– pattern recognitions
– creating recommendation engines
– sentiment analysis
• Hadoop should be integrated within your existing IT
infrastructure in order to capitalize on the countless
pieces of data that flows into your organization.
Hadoop Maturity?!
• Inaccessible to analysts without programming ability
• clusters have no record of who changed which record and when
it was changed
• storage functionality they have always depended on (snapshots,
mirroring) are lacking in HDFS.
• Incompatibility with existing tools
• Data without structure has limited value and applying the
structure at query time requires a lot of Java code.
• Limited documentation
• Limited troubleshooting capabilities
Choosing your infrastructure
• Define what you want to achieve
– POC
– Scale (few, tens, hundreds)
– One-time, periodic, continuous
• Infrastructure design
– Servers, storage, network, rack-space
– Define a joined team Hadoop App/Dev and infrastructure
specialist (facilities/server/network) when building a solution
– Virtual machines vs. Physical machines (IO performance, High
CPU, Network)
Choosing your infrastructure
• Network infrastructure
– Data movement between nodes (rack-awareness,
replication factor)
– Data between sites (Hosting/Service)
• Storage (architecture, disks)
– Local disks, JBOD
– Increase default block-size
• Operations
– Monitor
– Backup (configuration files, journal, Checkpoint …)
Performance & Scale considerations
• Consider running on a dedicated/standalone not
shared with other Hadoop processes on the same
server
– Name Node, Secondary Name Node and/or Checkpoint
Node
– Job Tracker and the HBASE (or any DB) Master
• Consider a Physical dedicated environment
4. hadoop  גיא לבנברג
Thank you!
Hadoop - The Good, The Bad and the Ugly
Guy Loewenberg
SUPPORTING SLIDES
HDFS Architecture
Improving RDBMS with Hadoop
• Accelerating nightly batch business processes.
• Storage of extremely high volumes of enterprise data
• Creation of automatic redundant backups
• Improving the scalability of applications
• Use of Java for data processing instead of SQL.
• Produce just-in-time feeds for dashboards and business intelligence
• Handling urgent, ad hoc requests for data
• Turning unstructured data into relational data
• Taking on tasks that require massive parallelism
• Moving existing algorithms, code, frameworks, and components to
a highly distributed computing environment.

More Related Content

PDF
Big Data and Hadoop Ecosystem
PPTX
Column Stores and Google BigQuery
PPTX
Hadoop
PDF
Introduction To Hadoop Ecosystem
PPTX
PPT on Hadoop
PPTX
Big Data and Hadoop
PDF
Hadoop ecosystem
Big Data and Hadoop Ecosystem
Column Stores and Google BigQuery
Hadoop
Introduction To Hadoop Ecosystem
PPT on Hadoop
Big Data and Hadoop
Hadoop ecosystem

What's hot (19)

PPTX
Asbury Hadoop Overview
PDF
Hadoop Ecosystem
PPTX
Hadoop And Their Ecosystem
PPTX
Hadoop: Distributed Data Processing
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Introduction to Hadoop - The Essentials
PPTX
Big data and hadoop anupama
PPTX
Apache hadoop technology : Beginners
PPT
Hadoop
PDF
Hadoop Fundamentals I
PPT
Seminar Presentation Hadoop
ODP
Hadoop Ecosystem Overview
PPT
Nextag talk
PPTX
Hadoop overview
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Apache Hadoop
PPTX
Getting started big data
PPTX
Hadoop jon
Asbury Hadoop Overview
Hadoop Ecosystem
Hadoop And Their Ecosystem
Hadoop: Distributed Data Processing
Introduction to Big Data & Hadoop Architecture - Module 1
HADOOP TECHNOLOGY ppt
Introduction to Hadoop - The Essentials
Big data and hadoop anupama
Apache hadoop technology : Beginners
Hadoop
Hadoop Fundamentals I
Seminar Presentation Hadoop
Hadoop Ecosystem Overview
Nextag talk
Hadoop overview
Introduction to Apache Hadoop Eco-System
Apache Hadoop
Getting started big data
Hadoop jon
Ad

Similar to 4. hadoop גיא לבנברג (20)

PPTX
Big data Hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
Scaling Storage and Computation with Hadoop
PDF
Big data and hadoop overvew
PPT
Big data and hadoop
PPTX
2. hadoop fundamentals
PPTX
Big Data in the Microsoft Platform
PPTX
Bigdata workshop february 2015
PPTX
Hadoop ppt1
PPTX
Introduction to BIg Data and Hadoop
PPT
Apache hadoop, hdfs and map reduce Overview
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Hadoop Distributed File System
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPSX
Hadoop-Quick introduction
PPTX
Hadoop
PPTX
Big Data and Cloud Computing
Big data Hadoop
Hadoop and Big data in Big data and cloud.pptx
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Scaling Storage and Computation with Hadoop
Big data and hadoop overvew
Big data and hadoop
2. hadoop fundamentals
Big Data in the Microsoft Platform
Bigdata workshop february 2015
Hadoop ppt1
Introduction to BIg Data and Hadoop
Apache hadoop, hdfs and map reduce Overview
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Hadoop Distributed File System
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop-Quick introduction
Hadoop
Big Data and Cloud Computing
Ad

More from Taldor Group (12)

PPTX
7. emc isilon hdfs enterprise storage for hadoop
PPTX
5. big data vs it stki - pini cohen
PDF
3. ami big data hadoop on ucs seminar may 2013
PPTX
A new platform for a new era emc
PPTX
Yossi cohen 3 base
PPTX
פיני מנדל תובנות עסקיות מיישומי Hadoop
PPTX
נתן פרידחי הקדמה לכנס Hadoop
PDF
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
PDF
Dcl צביקה מנלה - סיפורי לקוחות
PDF
Taldor data quality einat shimoni - stki
PDF
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
PDF
Loshin operationalizingdatagovernance
7. emc isilon hdfs enterprise storage for hadoop
5. big data vs it stki - pini cohen
3. ami big data hadoop on ucs seminar may 2013
A new platform for a new era emc
Yossi cohen 3 base
פיני מנדל תובנות עסקיות מיישומי Hadoop
נתן פרידחי הקדמה לכנס Hadoop
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
Dcl צביקה מנלה - סיפורי לקוחות
Taldor data quality einat shimoni - stki
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
Loshin operationalizingdatagovernance

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced IT Governance
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced IT Governance
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx

4. hadoop גיא לבנברג

  • 1. The Good, The Bad and the Ugly How to tame the Big Data Beast Guy Loewenberg May 2013
  • 3. Overview • Big Data: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • Hadoop: A framework that allows distributed processing of large data-sets across clusters of computers using a simple programming model • 1000 Kilobytes = 1 Megabyte • 1000 Megabytes = 1 Gigabyte • 1000 Gigabytes = 1 Terabyte • 1000 Terabytes = 1 Petabyte • 1000 Petabytes = 1 Exabyte • 1000 Exabytes = 1 Zettabyte • 1000 Zettabytes = 1 Yottabyte • 1000 Yottabytes = 1 Brontobyte • 1000 Brontobytes = 1 Geopbyte Most US SME corporations Most US large corporations Leaders like Facebook & Google
  • 4. Hadoop Basics • Designed to scale • Uses commodity hardware • Processes data in batches • Can process very large scale of data (PBs)
  • 5. Core Hadoop • Core hadoop is built from two main systems: – Hadoop Clustered file system - HDFS – MapReduce programming framework
  • 6. Hadoop architecture • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. – NameNode controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS – Work in a master/slave mode.
  • 7. Hadoop architecture • MapReduce: Distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. – The JobTracker schedules jobs and allocates activities to TaskTracker nodes which execute the map and reduce processes requested – Work in master/slave mode
  • 8. Hadoop software architecture MapReduce: Parallel data processing framework for large data sets HDFS: Hadoop distributed File System Oozie: MapReduce job Scheduler HBase: Key-value database Pig: Large data sets analysis language Hive: High-level language for analyzing large data sets ZooKeeper: distributed coordination system Solr / Lucene search engine, query engine library
  • 9. What Hadoop can’t do • Hadoop lets you perform batch analysis on whatever data you have stored within Hadoop. That data, does not have to be structured – Many solutions take advantage of the low storage expense of Hadoop to store structured data there instead of RDBMS. But shifting data back and forth between Hadoop and an RDBMS would be overkill. – Transactional data is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not ideal for Hadoop – Structured data sets that require very minimal latency
  • 10. Comparing RDBMS to MapReduce RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Structure Fixed schema Unstructured schema Language SQL Procedural (Java, C++, Ruby, etc) Integrity High Low Scaling Nonlinear Linear Updates Read and write Write once, read many times Latency Low High
  • 11. What Hadoop can do • High data volume, stored in Hadoop, and queried at length later using MapReduce functions – index building – pattern recognitions – creating recommendation engines – sentiment analysis • Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the countless pieces of data that flows into your organization.
  • 12. Hadoop Maturity?! • Inaccessible to analysts without programming ability • clusters have no record of who changed which record and when it was changed • storage functionality they have always depended on (snapshots, mirroring) are lacking in HDFS. • Incompatibility with existing tools • Data without structure has limited value and applying the structure at query time requires a lot of Java code. • Limited documentation • Limited troubleshooting capabilities
  • 13. Choosing your infrastructure • Define what you want to achieve – POC – Scale (few, tens, hundreds) – One-time, periodic, continuous • Infrastructure design – Servers, storage, network, rack-space – Define a joined team Hadoop App/Dev and infrastructure specialist (facilities/server/network) when building a solution – Virtual machines vs. Physical machines (IO performance, High CPU, Network)
  • 14. Choosing your infrastructure • Network infrastructure – Data movement between nodes (rack-awareness, replication factor) – Data between sites (Hosting/Service) • Storage (architecture, disks) – Local disks, JBOD – Increase default block-size • Operations – Monitor – Backup (configuration files, journal, Checkpoint …)
  • 15. Performance & Scale considerations • Consider running on a dedicated/standalone not shared with other Hadoop processes on the same server – Name Node, Secondary Name Node and/or Checkpoint Node – Job Tracker and the HBASE (or any DB) Master • Consider a Physical dedicated environment
  • 17. Thank you! Hadoop - The Good, The Bad and the Ugly Guy Loewenberg
  • 20. Improving RDBMS with Hadoop • Accelerating nightly batch business processes. • Storage of extremely high volumes of enterprise data • Creation of automatic redundant backups • Improving the scalability of applications • Use of Java for data processing instead of SQL. • Produce just-in-time feeds for dashboards and business intelligence • Handling urgent, ad hoc requests for data • Turning unstructured data into relational data • Taking on tasks that require massive parallelism • Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.

Editor's Notes

  • #7: NameNode and DataNode are HDFS components that work in a master/slave mode. NameNode is a major component that controls HDFS whereas DataNodes does the block replications, read/write operations and drives the workloads for HDFS.
  • #8: JobTracker and TaskTracker are also components that work in master/slave mode where JobTracker tasks control the mapping and reducing tasks at individual nodes among other tasks. The TaskTrackers run at the node levels and maintains communications with JobTracker for all nodes within the cluster.
  • #9: The main components include:Hadoop. Java software framework to support data-intensive distributed applications ZooKeeper. A highly reliable distributed coordination system MapReduce. A flexible parallel data processing framework for large data sets HDFS. Hadoop Distributed File System Oozie. A MapReduce job scheduler HBase. Key-value database Hive. A high-level language built on top of MapReduce for analyzing large data sets Pig. Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language compiled into MapReduce for parallel data processing.