SlideShare a Scribd company logo
Simplified Data Processing
On Large Cluster
1
Present By
Dipen Shah
110420107064
Harsh Kevadia
110420107049
Nancy Sukhadia
110420107025
2
What Is Cluster
 A computer cluster consists of a set of loosely connected or
tightly connected computers that work together so that in many
respects they can be viewed as a single system.
 The components of a cluster are usually connected to each
other through fast local area networks.
 Clusters are usually deployed to improve performance and
availability over that of a single computer.
3
Introduction
 On the web large amount of data or say Big Data are being stored,
processed and retrieved in few milliseconds.
 Big data cannot be stored, processed, and retrieved from one
machine.
4
Contd..
 How huge IT companies store their data? And how the data is
processed and retrieved?
 A Big Data requires a lots of processing power for computing
(Processing) and storing a data.
5
How To Divide Large Input Set Into
Smaller Input Set?
 The master node takes the input, divides it into smaller sub-problems,
and distributes them to worker nodes.
 The worker node processes the smaller problem, and passes the
answer back to its master node.
 Sometimes it creates problem for the data which comes in
sequence.
 Output of one data can be input of another.
 only suitable to those data which are independent to each other so
that the processing can be done independently without waiting for
the output of previous data.
6
How To Divide Work Among Various
Worker Node In Same Cluster?
 Master node calculates the time require for a normal computation
and it also considers priority of the particular data processing.
 Checks all the worker node’s schedule and processing speed.
 After analysing this data the work is provided to worker node.
7
Dividing Input Creates Problem
And Affects The Output.
 Large set of input are interrelated with each other or sequence of
inputs are important and we must process input as a given
sequence.
 Need to develop an algorithm which takes care about all this
problem.
8
Dividing Input So That Optimized
Performance Can Be Achieved.
 How to divide a problem into sub problem so that we can get
optimized performance. Optimized in the sense of minimum time
require, minimum resource allocated to that process, coordinating
between worker nodes in cluster.
9
What If Worker Node Fails?
 Master Node divides work among the workers. It pings worker node
periodically.
 What if the worker node doesn’t respond or worker node fails?
10
What Happen When Master Node
Fails?
 There is only single master.
 All the computation gets aborted if master node fails.
11
Programing Model
 The computation takes a set of input key/value pairs, and produces a set of
output key/value pairs.
 The user of the MapReduce library expresses the computation as two
functions: Map and Reduce.
 Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs.
 The MapReduce library groups together all intermediate values associated
with the same intermediate key I and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values.
 Typically just zero or one output value is produced per Reduce invocation.
 The intermediate values are supplied to the user's reduce function via an
iterator.
 This allows us to handle lists of values that are too large to fit in memory.
12
Example:
 Consider the problem of counting the number of occurrences of each word in a large collection of
documents.
 The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
13
Cont.
• The map function emits each word plus an associated count of occurrences.
• The reduce function sums together all counts emitted for a particular word.
• In addition, the user writes code to file in a mapreduce specication object with the
names of the input and output files, and optional tuning parameters.
• The user then invokes the MapReduce function, passing it the specification object.
• The user's code is linked together with the MapReduce library (implemented in
C++).
14
Types:
 Even though the previous pseudo-code is written in terms of string
inputs and outputs, conceptually the map and reduce functions
supplied by the user have associated types:
map (k1,v1) -> list(k2,v2)
reduce (k2,list(v2)) -> list(v2)
 I.e., the input keys and values are drawn from a different domain
than the output keys and values.
 Furthermore, the intermediate keys and values are from the same
domain as the output keys and values.
 Our C++ implementation passes strings to and from the user-dened
functions and leaves it to the user code to convert between strings
and appropriate types.
15
Map Reduce: Example
 Distributed Grep
 Count of URL Access frequency
 Reverse Web Link graph
 Term Vector per host
 Inverted Index
16
Implementation
 Assumption
 Execution Overview
 Master Data Structure
 Fault Tolerance
 Implementation Issues
17
Assumption
 Each PC configuration cluster
 Networking
 Failures
 Storage
 Job scheduling system
18
Execution Overview
1. Split Input set
2. Master assign work to worker and copy it self
3. Worker read input set and produced output
4. Worker save in local disk
5. Reduce worker collect from local disk
6. External Sorting done because data is too large and give to master
7. Master create output file and wake up user program and give.
19
Function: 20
Master Data Structure
 State and identity of worker machine
 Intermediate file
 Update of location
 File size
21
Fault Tolerance
 Worker Failure
 Master Failure
 Master Election
1. Manually
2. Highest IP Address
3. Highest MAC Address
22
Implementation Issues
 Back up
 Network Bandwidth
 Locality
23
Conclusion
 attribute this success to several reasons. First, the model is easy to
use, even for programmers without experience with parallel and
distributed systems, since it hides the details of parallelization, fault-
tolerance, locality optimization, and load balancing. Second, a
large variety of problems are easily expressible.
 Google use for web search service, for sorting, for data mining, for
machine learning, and many other systems
24
References
1. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David
A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM
SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997.
2. Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David
Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the
Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10.22, Atlanta,
Georgia, May 1999.
3. Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web.
In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems,
1996.
4. Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture.
IEEE Micro, 23(2):22–28, April 2003.
5. John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit
control in a batch-aware distributed le system. In Proceedings of the 1st USENIX Symposium on
Networked Systems Design and Implementation NSDI, March 2004.
6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11),
November 1989.
7. Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based
scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles,
pages 78–91, Saint-Malo, France, 1997.
25
Thank You
Q/A!
26

More Related Content

PDF
A LIGHT-WEIGHT DISTRIBUTED SYSTEM FOR THE PROCESSING OF REPLICATED COUNTER-LI...
PDF
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
PDF
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
PDF
Parallel Computing - Lec 4
PDF
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
PDF
Survey on load balancing and data skew mitigation in mapreduce applications
PDF
Intake 37 10
PPTX
Tackling node failure in
A LIGHT-WEIGHT DISTRIBUTED SYSTEM FOR THE PROCESSING OF REPLICATED COUNTER-LI...
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
Parallel Computing - Lec 4
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
Survey on load balancing and data skew mitigation in mapreduce applications
Intake 37 10
Tackling node failure in

What's hot (20)

PDF
Efficient load rebalancing for distributed file system in Clouds
PDF
lec6_ref.pdf
PDF
A Survey on Big Data Analysis Techniques
PPTX
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
PDF
Mapreduce2008 cacm
PDF
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
PDF
Hadoop interview questions - Softwarequery.com
PPTX
Load balancing
PPTX
A Tale of Data Pattern Discovery in Parallel
PDF
Grouping of Hashtags using Co-relating the Occurrence in Microblogs
PDF
Operating Task Redistribution in Hyperconverged Networks
PDF
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
DOC
Seminar
PDF
lec1_ref.pdf
PDF
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
PDF
IRJET- Latin Square Computation of Order-3 using Open CL
PDF
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
PDF
Dynamically Partitioning Big Data Using Virtual Machine Mapping
PDF
Overlapped clustering approach for maximizing the service reliability of
PDF
Fusion method used to tolerate the faults occurred in disrtibuted system
Efficient load rebalancing for distributed file system in Clouds
lec6_ref.pdf
A Survey on Big Data Analysis Techniques
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Mapreduce2008 cacm
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Hadoop interview questions - Softwarequery.com
Load balancing
A Tale of Data Pattern Discovery in Parallel
Grouping of Hashtags using Co-relating the Occurrence in Microblogs
Operating Task Redistribution in Hyperconverged Networks
PROCESS OF LOAD BALANCING IN CLOUD COMPUTING USING GENETIC ALGORITHM
Seminar
lec1_ref.pdf
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
IRJET- Latin Square Computation of Order-3 using Open CL
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
Dynamically Partitioning Big Data Using Virtual Machine Mapping
Overlapped clustering approach for maximizing the service reliability of
Fusion method used to tolerate the faults occurred in disrtibuted system
Ad

Viewers also liked (17)

PPTX
Eating disorder
PDF
Kulinaria ketistvis (part 2)
PDF
X3Dcae brochure long
PDF
Themed collection 5
PPTX
Productividad
PPTX
SPE. WMOG-15 session i
PDF
Deshopping brochure
PDF
M09_Risk Assessment ebook
PDF
Los principios guía y scout
PDF
Taiwan deep ocean water
PPT
профессиональная социальная сеть как онлайн площадка для взаимодействия бизне...
PPT
Deep Ocean
PDF
منع الدكتاتورية :الضمانات الدستورية ضد إعادة إنتاج السلطوية
PDF
M18_Hazop ebook
PPT
U thận và hệ niệu
PPTX
TEORÍA DE LA ARQUITECTURA I
PPTX
Edvinas_Meskys_Brno conference_Research biobanks_FINAL
Eating disorder
Kulinaria ketistvis (part 2)
X3Dcae brochure long
Themed collection 5
Productividad
SPE. WMOG-15 session i
Deshopping brochure
M09_Risk Assessment ebook
Los principios guía y scout
Taiwan deep ocean water
профессиональная социальная сеть как онлайн площадка для взаимодействия бизне...
Deep Ocean
منع الدكتاتورية :الضمانات الدستورية ضد إعادة إنتاج السلطوية
M18_Hazop ebook
U thận và hệ niệu
TEORÍA DE LA ARQUITECTURA I
Edvinas_Meskys_Brno conference_Research biobanks_FINAL
Ad

Similar to Simplified Data Processing On Large Cluster (20)

PPTX
MapReduce.pptx
PPTX
mapreduce.pptx
PPTX
ch02-mapreduce.pptx
PPTX
Big data & Hadoop
PPTX
MapReduce presentation
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPT
Map reduce - simplified data processing on large clusters
PPTX
This gives a brief detail about big data
PDF
2 mapreduce-model-principles
PPT
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
MapReduce : Simplified Data Processing on Large Clusters
PPTX
Distributed computing poli
PPTX
Hadoop training-in-hyderabad
PDF
MapReduce: Simplified Data Processing on Large Clusters
PDF
Hadoop & MapReduce
PPT
Map reducecloudtech
PDF
Introduction of MapReduce
PPT
Hadoop - Introduction to HDFS
MapReduce.pptx
mapreduce.pptx
ch02-mapreduce.pptx
Big data & Hadoop
MapReduce presentation
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Map reduce - simplified data processing on large clusters
This gives a brief detail about big data
2 mapreduce-model-principles
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Lecture2-MapReduce - An introductory lecture to Map Reduce
MapReduce : Simplified Data Processing on Large Clusters
Distributed computing poli
Hadoop training-in-hyderabad
MapReduce: Simplified Data Processing on Large Clusters
Hadoop & MapReduce
Map reducecloudtech
Introduction of MapReduce
Hadoop - Introduction to HDFS

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
1_Introduction to advance data techniques.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to machine learning and Linear Models
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
1_Introduction to advance data techniques.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Simplified Data Processing On Large Cluster

  • 2. Present By Dipen Shah 110420107064 Harsh Kevadia 110420107049 Nancy Sukhadia 110420107025 2
  • 3. What Is Cluster  A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system.  The components of a cluster are usually connected to each other through fast local area networks.  Clusters are usually deployed to improve performance and availability over that of a single computer. 3
  • 4. Introduction  On the web large amount of data or say Big Data are being stored, processed and retrieved in few milliseconds.  Big data cannot be stored, processed, and retrieved from one machine. 4
  • 5. Contd..  How huge IT companies store their data? And how the data is processed and retrieved?  A Big Data requires a lots of processing power for computing (Processing) and storing a data. 5
  • 6. How To Divide Large Input Set Into Smaller Input Set?  The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.  The worker node processes the smaller problem, and passes the answer back to its master node.  Sometimes it creates problem for the data which comes in sequence.  Output of one data can be input of another.  only suitable to those data which are independent to each other so that the processing can be done independently without waiting for the output of previous data. 6
  • 7. How To Divide Work Among Various Worker Node In Same Cluster?  Master node calculates the time require for a normal computation and it also considers priority of the particular data processing.  Checks all the worker node’s schedule and processing speed.  After analysing this data the work is provided to worker node. 7
  • 8. Dividing Input Creates Problem And Affects The Output.  Large set of input are interrelated with each other or sequence of inputs are important and we must process input as a given sequence.  Need to develop an algorithm which takes care about all this problem. 8
  • 9. Dividing Input So That Optimized Performance Can Be Achieved.  How to divide a problem into sub problem so that we can get optimized performance. Optimized in the sense of minimum time require, minimum resource allocated to that process, coordinating between worker nodes in cluster. 9
  • 10. What If Worker Node Fails?  Master Node divides work among the workers. It pings worker node periodically.  What if the worker node doesn’t respond or worker node fails? 10
  • 11. What Happen When Master Node Fails?  There is only single master.  All the computation gets aborted if master node fails. 11
  • 12. Programing Model  The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.  The user of the MapReduce library expresses the computation as two functions: Map and Reduce.  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.  The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.  Typically just zero or one output value is produced per Reduce invocation.  The intermediate values are supplied to the user's reduce function via an iterator.  This allows us to handle lists of values that are too large to fit in memory. 12
  • 13. Example:  Consider the problem of counting the number of occurrences of each word in a large collection of documents.  The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 13
  • 14. Cont. • The map function emits each word plus an associated count of occurrences. • The reduce function sums together all counts emitted for a particular word. • In addition, the user writes code to file in a mapreduce specication object with the names of the input and output files, and optional tuning parameters. • The user then invokes the MapReduce function, passing it the specification object. • The user's code is linked together with the MapReduce library (implemented in C++). 14
  • 15. Types:  Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types: map (k1,v1) -> list(k2,v2) reduce (k2,list(v2)) -> list(v2)  I.e., the input keys and values are drawn from a different domain than the output keys and values.  Furthermore, the intermediate keys and values are from the same domain as the output keys and values.  Our C++ implementation passes strings to and from the user-dened functions and leaves it to the user code to convert between strings and appropriate types. 15
  • 16. Map Reduce: Example  Distributed Grep  Count of URL Access frequency  Reverse Web Link graph  Term Vector per host  Inverted Index 16
  • 17. Implementation  Assumption  Execution Overview  Master Data Structure  Fault Tolerance  Implementation Issues 17
  • 18. Assumption  Each PC configuration cluster  Networking  Failures  Storage  Job scheduling system 18
  • 19. Execution Overview 1. Split Input set 2. Master assign work to worker and copy it self 3. Worker read input set and produced output 4. Worker save in local disk 5. Reduce worker collect from local disk 6. External Sorting done because data is too large and give to master 7. Master create output file and wake up user program and give. 19
  • 21. Master Data Structure  State and identity of worker machine  Intermediate file  Update of location  File size 21
  • 22. Fault Tolerance  Worker Failure  Master Failure  Master Election 1. Manually 2. Highest IP Address 3. Highest MAC Address 22
  • 23. Implementation Issues  Back up  Network Bandwidth  Locality 23
  • 24. Conclusion  attribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault- tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible.  Google use for web search service, for sorting, for data mining, for machine learning, and many other systems 24
  • 25. References 1. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997. 2. Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10.22, Atlanta, Georgia, May 1999. 3. Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996. 4. Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, April 2003. 5. John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed le system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004. 6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11), November 1989. 7. Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78–91, Saint-Malo, France, 1997. 25