4. hadoop גיא לבנברג

The Good, The Bad and the Ugly
How to tame the Big Data Beast
Guy Loewenberg
May 2013

Overview
• Big Data: A collection of data sets so large and complex that
it becomes difficult to process using on-hand database
management tools or traditional data processing applications
• Hadoop: A framework that allows distributed
processing of large data-sets across clusters of
computers using a simple programming model
• 1000 Kilobytes = 1 Megabyte
• 1000 Megabytes = 1 Gigabyte
• 1000 Gigabytes = 1 Terabyte
• 1000 Terabytes = 1 Petabyte
• 1000 Petabytes = 1 Exabyte
• 1000 Exabytes = 1 Zettabyte
• 1000 Zettabytes = 1 Yottabyte
• 1000 Yottabytes = 1 Brontobyte
• 1000 Brontobytes = 1 Geopbyte
Most US SME corporations
Most US large corporations
Leaders like Facebook & Google

Hadoop Basics
• Designed to scale
• Uses commodity hardware
• Processes data in batches
• Can process very large scale of data (PBs)

Core Hadoop
• Core hadoop is built from two main systems:
– Hadoop Clustered file system - HDFS
– MapReduce programming framework

Hadoop architecture
• Hadoop Distributed File System (HDFS):
self-healing high-bandwidth clustered
storage.
– NameNode controls HDFS
whereas DataNodes does the
block replications, read/write
operations and drives the
workloads for HDFS
– Work in a master/slave mode.

Hadoop architecture
• MapReduce: Distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
– The JobTracker schedules
jobs and allocates activities
to TaskTracker nodes which
execute the map and reduce
processes requested
– Work in master/slave mode

Hadoop software architecture
MapReduce: Parallel data processing
framework for large data sets
HDFS: Hadoop
distributed File System
Oozie: MapReduce
job Scheduler
HBase: Key-value
database
Pig: Large data sets
analysis language
Hive: High-level language for
analyzing large data sets
ZooKeeper: distributed
coordination system
Solr / Lucene search
engine, query engine library

What Hadoop can’t do
• Hadoop lets you perform batch analysis on whatever
data you have stored within Hadoop. That data, does
not have to be structured
– Many solutions take advantage of the low storage expense of
Hadoop to store structured data there instead of RDBMS. But
shifting data back and forth between Hadoop and an RDBMS
would be overkill.
– Transactional data is highly complex, as a transaction on an
ecommerce site can generate many steps that all have to be
implemented quickly. That scenario is not ideal for Hadoop
– Structured data sets that require very minimal latency

Comparing RDBMS to MapReduce
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High

What Hadoop can do
• High data volume, stored in Hadoop, and queried at
length later using MapReduce functions
– index building
– pattern recognitions
– creating recommendation engines
– sentiment analysis
• Hadoop should be integrated within your existing IT
infrastructure in order to capitalize on the countless
pieces of data that flows into your organization.

Hadoop Maturity?!
• Inaccessible to analysts without programming ability
• clusters have no record of who changed which record and when
it was changed
• storage functionality they have always depended on (snapshots,
mirroring) are lacking in HDFS.
• Incompatibility with existing tools
• Data without structure has limited value and applying the
structure at query time requires a lot of Java code.
• Limited documentation
• Limited troubleshooting capabilities

Choosing your infrastructure
• Define what you want to achieve
– POC
– Scale (few, tens, hundreds)
– One-time, periodic, continuous
• Infrastructure design
– Servers, storage, network, rack-space
– Define a joined team Hadoop App/Dev and infrastructure
specialist (facilities/server/network) when building a solution
– Virtual machines vs. Physical machines (IO performance, High
CPU, Network)

Choosing your infrastructure
• Network infrastructure
– Data movement between nodes (rack-awareness,
replication factor)
– Data between sites (Hosting/Service)
• Storage (architecture, disks)
– Local disks, JBOD
– Increase default block-size
• Operations
– Monitor
– Backup (configuration files, journal, Checkpoint …)

Performance & Scale considerations
• Consider running on a dedicated/standalone not
shared with other Hadoop processes on the same
server
– Name Node, Secondary Name Node and/or Checkpoint
Node
– Job Tracker and the HBASE (or any DB) Master
• Consider a Physical dedicated environment

4. hadoop גיא לבנברג

Thank you!
Hadoop - The Good, The Bad and the Ugly
Guy Loewenberg

Improving RDBMS with Hadoop
• Accelerating nightly batch business processes.
• Storage of extremely high volumes of enterprise data
• Creation of automatic redundant backups
• Improving the scalability of applications
• Use of Java for data processing instead of SQL.
• Produce just-in-time feeds for dashboards and business intelligence
• Handling urgent, ad hoc requests for data
• Turning unstructured data into relational data
• Taking on tasks that require massive parallelism
• Moving existing algorithms, code, frameworks, and components to
a highly distributed computing environment.

4. hadoop גיא לבנברג

More Related Content

What's hot (19)

Similar to 4. hadoop גיא לבנברג (20)

More from Taldor Group (12)

Recently uploaded (20)

4. hadoop גיא לבנברג

Editor's Notes