Bigdata and Hadoop

What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?

Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
It was for just Transactional data analysis.
Later using Data warehouse for offline data.
(Analysis done within Enterprise)
Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.
Data is stored on central server.

‘Big Data’ is similar to ‘small data’, but
bigger
…but having data bigger it requires
different approaches:
› Techniques, tools and architecture
…with an aim to solve new problems
› …or old problems in a better way

Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types

 Open-source data storage and processing API
 Massively scalable, automatically parallelizable
 Based on work from Google
GFS + MapReduce + BigTable
 Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight

HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing

HDFS is a file system written in Java
Sits on top of a native file system
Provides redundant storage for massive
amounts of data
Use cheap, unreliable computers

Data is split into blocks and stored on multiple
nodes in the cluster
› Each block is usually 64 MB or 128 MB (conf)
Each block is replicated multiple times (conf)
› Replicas stored on different data nodes
Large files, 100 MB+

NameNode
› only 1 per cluster
› metadata server and database
› SecondaryNameNode helps with
some housekeeping
• JobTracker
• only 1 per cluster
• job scheduler

DataNodes
› 1-4000 per cluster
› block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution

A single NameNode stores all metadata
Filenames, locations on DataNodes of each
block, owner, group, etc.
All information maintained in RAM for fast lookup
File system metadata size is limited to the amount
of available RAM on the NameNode

DataNodes store file contents
Stored as opaque ‘blocks’ on the underlying
filesystem
Different blocks of the same file will be stored
on different DataNodes
Same block is stored on three (or more)
DataNodes for redundancy

DataNodes send heartbeats to NameNode
› After a period without any heartbeats, a DataNode is
assumed to be lost
› NameNode determines which blocks were on the lost
node
› NameNode finds other DataNodes with copies of
these blocks
› These DataNodes are instructed to copy the blocks
to other nodes
› Replication is actively maintained

The Secondary NameNode is not a failover
NameNode
Does memory-intensive administrative
functions for the NameNode
Should run on a separate machine

datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker

MapReduce
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
In our case: circe.rc.usf.edu

Preparing for
MapReduce
Loading
Files
64 MB
128 MB
File
System
Native file system
HDFS
Cloud
Output
Immutable
You
Define
Input, Map,
Reduce, Output
Use Java or other
programming
language
Work with key-
value pairs

Input: a set of key/value pairs
User supplies two functions:
› map(k,v)  list(k1,v1)
› reduce(k1, list(v1))  v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs

InputFormat
Map function
Partitioner
Sorting & Merging
Combiner
Shuffling
Merging
Reduce function
OutputFormat

• Task
Tracker
• Data Node
• Task
Tracker
• Data Node
• Task
Tracker
• Data Node
• Name Node
• Job Tracker
Master
Node
Slave
Node
1
Slave
Node
2
Slave
Node
3

InputFormat:
› TextInputFormat
› KeyValueTextInputFormat
› SequenceFileInputFormat
OutputFormat:
› TextOutputFormat
› SequenceFileOutputFormat

Probably the most complex aspect of MapReduce!
Map side
› Map outputs are buffered in memory in a circular buffer
› When buffer reaches threshold, contents are “spilled” to disk
› Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here
Reduce side
› First, map outputs are copied over to reducer machine
› “Sort” is a multi-pass merge of map outputs (happens in
memory and on disk): combiner runs here
› Final merge pass goes directly into reducer

Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner

Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be
of this type (but not values).
IntWritable
LongWritable
Text
…
Concrete classes for different data types.
SequenceFiles Binary encoded of a sequence of
key/value pairs

Map function
Reduce function
Run this program as a
MapReduce job

Hadoop Cluster
You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job
3a. Go back to Step 2
4. Retrieve data from HDFS

Applications for Big Data Analytics
Homeland Security
FinanceSmarter Healthcare
Multi-channel
sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO

CASE STUDY : 1
Environment Change Prediction to Assist Formers Using Hadoop

Bigdata and Hadoop

More Related Content

What's hot (19)

Viewers also liked (14)

Similar to Bigdata and Hadoop (20)

Recently uploaded (20)

Bigdata and Hadoop