SlideShare a Scribd company logo
 Bigdata and Hadoop
What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?
Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
It was for just Transactional data analysis.
Later using Data warehouse for offline data.
(Analysis done within Enterprise)
Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.
Data is stored on central server.
 Bigdata and Hadoop
‘Big Data’ is similar to ‘small data’, but
bigger
…but having data bigger it requires
different approaches:
› Techniques, tools and architecture
…with an aim to solve new problems
› …or old problems in a better way
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
HADOOP
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable
 Based on work from Google
GFS + MapReduce + BigTable
 Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing
HDFS is a file system written in Java
Sits on top of a native file system
Provides redundant storage for massive
amounts of data
Use cheap, unreliable computers
Data is split into blocks and stored on multiple
nodes in the cluster
› Each block is usually 64 MB or 128 MB (conf)
Each block is replicated multiple times (conf)
› Replicas stored on different data nodes
Large files, 100 MB+
Master Nodes Slave Nodes
NameNode
› only 1 per cluster
› metadata server and database
› SecondaryNameNode helps with
some housekeeping
• JobTracker
• only 1 per cluster
• job scheduler
DataNodes
› 1-4000 per cluster
› block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
A single NameNode stores all metadata
Filenames, locations on DataNodes of each
block, owner, group, etc.
All information maintained in RAM for fast lookup
File system metadata size is limited to the amount
of available RAM on the NameNode
DataNodes store file contents
Stored as opaque ‘blocks’ on the underlying
filesystem
Different blocks of the same file will be stored
on different DataNodes
Same block is stored on three (or more)
DataNodes for redundancy
DataNodes send heartbeats to NameNode
› After a period without any heartbeats, a DataNode is
assumed to be lost
› NameNode determines which blocks were on the lost
node
› NameNode finds other DataNodes with copies of
these blocks
› These DataNodes are instructed to copy the blocks
to other nodes
› Replication is actively maintained
The Secondary NameNode is not a failover
NameNode
Does memory-intensive administrative
functions for the NameNode
Should run on a separate machine
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
MapReduce
MapReduce
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
In our case: circe.rc.usf.edu
Preparing for
MapReduce
Loading
Files
64 MB
128 MB
File
System
Native file system
HDFS
Cloud
Output
Immutable
You
Define
Input, Map,
Reduce, Output
Use Java or other
programming
language
Work with key-
value pairs
Input: a set of key/value pairs
User supplies two functions:
› map(k,v)  list(k1,v1)
› reduce(k1, list(v1))  v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
InputFormat
Map function
Partitioner
Sorting & Merging
Combiner
Shuffling
Merging
Reduce function
OutputFormat
• Task
Tracker
• Data Node
• Task
Tracker
• Data Node
• Task
Tracker
• Data Node
• Name Node
• Job Tracker
Master
Node
Slave
Node
1
Slave
Node
2
Slave
Node
3
MapReduce Example -
WordCount
InputFormat:
› TextInputFormat
› KeyValueTextInputFormat
› SequenceFileInputFormat
OutputFormat:
› TextOutputFormat
› SequenceFileOutputFormat
2
8
Probably the most complex aspect of MapReduce!
Map side
› Map outputs are buffered in memory in a circular buffer
› When buffer reaches threshold, contents are “spilled” to disk
› Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here
Reduce side
› First, map outputs are copied over to reducer machine
› “Sort” is a multi-pass merge of map outputs (happens in
memory and on disk): combiner runs here
› Final merge pass goes directly into reducer
Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner
Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be
of this type (but not values).
IntWritable
LongWritable
Text
…
Concrete classes for different data types.
SequenceFiles Binary encoded of a sequence of
key/value pairs
Map function
Reduce function
Run this program as a
MapReduce job
Hadoop Cluster
You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job
3a. Go back to Step 2
4. Retrieve data from HDFS
Applications for Big Data Analytics
Homeland Security
FinanceSmarter Healthcare
Multi-channel
sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
CASE STUDY : 1
Environment Change Prediction to Assist Formers Using Hadoop
Thank you
Questions ?

More Related Content

PDF
Hadoop-2.6.0 Slides
PDF
Lecture 2 part 1
PDF
Pgxc scalability pg_open2012
PPTX
Hadoop Distributed File System
PPTX
MapReduce and Hadoop
PDF
Managing terabytes: When Postgres gets big
PDF
02.28.13 WANdisco ApacheCon 2013
PPTX
Understanding Distributed Databases Scalability
Hadoop-2.6.0 Slides
Lecture 2 part 1
Pgxc scalability pg_open2012
Hadoop Distributed File System
MapReduce and Hadoop
Managing terabytes: When Postgres gets big
02.28.13 WANdisco ApacheCon 2013
Understanding Distributed Databases Scalability

What's hot (19)

ODP
Apache hadoop
PPTX
Hadoop Distributed File System
PDF
Hadoop distributed computing framework for big data
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Hadoop architecture meetup
PDF
Intro to Apache Hadoop
PDF
Hadoop Distributed File System
PPTX
Map Reduce basics
PDF
Scaling Storage and Computation with Hadoop
PPT
9b. Document-Oriented Databases lab
PPTX
Big Data and Hadoop - An Introduction
PPT
Hadoop - Introduction to mapreduce
PDF
Hadoop HDFS
PPTX
Big data
PPTX
Hadoop Distributed File System
PPTX
Resilient Distributed DataSets - Apache SPARK
PPTX
Hadoop distributed file system
Apache hadoop
Hadoop Distributed File System
Hadoop distributed computing framework for big data
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop: Technology Overview
Hadoop training in hyderabad-kellytechnologies
Hadoop architecture meetup
Intro to Apache Hadoop
Hadoop Distributed File System
Map Reduce basics
Scaling Storage and Computation with Hadoop
9b. Document-Oriented Databases lab
Big Data and Hadoop - An Introduction
Hadoop - Introduction to mapreduce
Hadoop HDFS
Big data
Hadoop Distributed File System
Resilient Distributed DataSets - Apache SPARK
Hadoop distributed file system
Ad

Viewers also liked (14)

PPTX
Introducing virtual reality
PPTX
I twin technology
PPTX
Final presentation of virtual reality by monil
PPTX
Virtual reality Presentation
PPTX
Search engines
PDF
The Emerging Virtual Reality Landscape: a Primer
PPTX
Virtual Reality
PPTX
Virtual Reality-Seminar presentation
PPTX
Search Engine Powerpoint
PPT
Virtual Reality
PPTX
Big data and Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPT
Seminar Presentation Hadoop
PPTX
Big Data Analytics with Hadoop
Introducing virtual reality
I twin technology
Final presentation of virtual reality by monil
Virtual reality Presentation
Search engines
The Emerging Virtual Reality Landscape: a Primer
Virtual Reality
Virtual Reality-Seminar presentation
Search Engine Powerpoint
Virtual Reality
Big data and Hadoop
Hadoop introduction , Why and What is Hadoop ?
Seminar Presentation Hadoop
Big Data Analytics with Hadoop
Ad

Similar to Bigdata and Hadoop (20)

PPSX
Hadoop-Quick introduction
PPT
Hadoop
PPTX
Bigdata workshop february 2015
PPTX
HADOOP
PPTX
Hadoop
PPT
hadoop
PPT
hadoop
PPTX
Hadoop File system (HDFS)
PPTX
Big Data and Hadoop
PPTX
Hadoop by kamran khan
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Module 1- Introduction to Big Data and Hadoop
PPTX
Hadoop introduction
PPTX
2. hadoop fundamentals
PPTX
Introduction to Hadoop and Big Data
PPTX
Apache Hadoop Big Data Technology
PPTX
Unit-1 Introduction to Big Data.pptx
PDF
Hadoop introduction
PPTX
Introduction to hadoop and hdfs
PPTX
Managing Big data with Hadoop
Hadoop-Quick introduction
Hadoop
Bigdata workshop february 2015
HADOOP
Hadoop
hadoop
hadoop
Hadoop File system (HDFS)
Big Data and Hadoop
Hadoop by kamran khan
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Module 1- Introduction to Big Data and Hadoop
Hadoop introduction
2. hadoop fundamentals
Introduction to Hadoop and Big Data
Apache Hadoop Big Data Technology
Unit-1 Introduction to Big Data.pptx
Hadoop introduction
Introduction to hadoop and hdfs
Managing Big data with Hadoop

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to machine learning and Linear Models
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Reliability_Chapter_ presentation 1221.5784
1_Introduction to advance data techniques.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Acumen Training GuidePresentation.pptx
ISS -ESG Data flows What is ESG and HowHow
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Supervised vs unsupervised machine learning algorithms
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Bigdata and Hadoop

  • 2. What is the Need of Big data Technology when we have robust, high-performing, relational database management system ?
  • 3. Data Stored in structured format like PK, Rows, Columns, Tuples and FK . It was for just Transactional data analysis. Later using Data warehouse for offline data. (Analysis done within Enterprise) Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured. Data is stored on central server.
  • 5. ‘Big Data’ is similar to ‘small data’, but bigger …but having data bigger it requires different approaches: › Techniques, tools and architecture …with an aim to solve new problems › …or old problems in a better way
  • 8.  Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight
  • 10. HDFS is a file system written in Java Sits on top of a native file system Provides redundant storage for massive amounts of data Use cheap, unreliable computers
  • 11. Data is split into blocks and stored on multiple nodes in the cluster › Each block is usually 64 MB or 128 MB (conf) Each block is replicated multiple times (conf) › Replicas stored on different data nodes Large files, 100 MB+
  • 13. NameNode › only 1 per cluster › metadata server and database › SecondaryNameNode helps with some housekeeping • JobTracker • only 1 per cluster • job scheduler
  • 14. DataNodes › 1-4000 per cluster › block data storage • TaskTrackers • 1-4000 per cluster • task execution
  • 15. A single NameNode stores all metadata Filenames, locations on DataNodes of each block, owner, group, etc. All information maintained in RAM for fast lookup File system metadata size is limited to the amount of available RAM on the NameNode
  • 16. DataNodes store file contents Stored as opaque ‘blocks’ on the underlying filesystem Different blocks of the same file will be stored on different DataNodes Same block is stored on three (or more) DataNodes for redundancy
  • 17. DataNodes send heartbeats to NameNode › After a period without any heartbeats, a DataNode is assumed to be lost › NameNode determines which blocks were on the lost node › NameNode finds other DataNodes with copies of these blocks › These DataNodes are instructed to copy the blocks to other nodes › Replication is actively maintained
  • 18. The Secondary NameNode is not a failover NameNode Does memory-intensive administrative functions for the NameNode Should run on a separate machine
  • 19. datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node namenode namenode daemon job submission node jobtracker
  • 21. MapReduce JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance In our case: circe.rc.usf.edu
  • 22. Preparing for MapReduce Loading Files 64 MB 128 MB File System Native file system HDFS Cloud Output Immutable You Define Input, Map, Reduce, Output Use Java or other programming language Work with key- value pairs
  • 23. Input: a set of key/value pairs User supplies two functions: › map(k,v)  list(k1,v1) › reduce(k1, list(v1))  v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs
  • 24. InputFormat Map function Partitioner Sorting & Merging Combiner Shuffling Merging Reduce function OutputFormat
  • 25. • Task Tracker • Data Node • Task Tracker • Data Node • Task Tracker • Data Node • Name Node • Job Tracker Master Node Slave Node 1 Slave Node 2 Slave Node 3
  • 27. InputFormat: › TextInputFormat › KeyValueTextInputFormat › SequenceFileInputFormat OutputFormat: › TextOutputFormat › SequenceFileOutputFormat
  • 28. 2 8
  • 29. Probably the most complex aspect of MapReduce! Map side › Map outputs are buffered in memory in a circular buffer › When buffer reaches threshold, contents are “spilled” to disk › Spills merged in a single, partitioned file (sorted within each partition): combiner runs here Reduce side › First, map outputs are copied over to reducer machine › “Sort” is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs here › Final merge pass goes directly into reducer
  • 30. Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Combiner
  • 31. Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComprable Defines a sort order. All keys must be of this type (but not values). IntWritable LongWritable Text … Concrete classes for different data types. SequenceFiles Binary encoded of a sequence of key/value pairs
  • 32. Map function Reduce function Run this program as a MapReduce job
  • 33. Hadoop Cluster You 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 4. Retrieve data from HDFS
  • 34. Applications for Big Data Analytics Homeland Security FinanceSmarter Healthcare Multi-channel sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn, NBO
  • 35. CASE STUDY : 1 Environment Change Prediction to Assist Formers Using Hadoop