SlideShare a Scribd company logo
By: Ahmed Gamil
 Big data usually includes data sets with sizes
beyond the ability of commonly used
software tools to capture , manage, and
process data within a tolerable elapsed time.
 Archives: Scanned documents, statements,
medical records, e-mails etc..
 Docs: XLS, PDF, CSV, HTML, JSON etc
 Business Apps: CRM, ERP systems, HR,
project management etc.
 Media: Images, video, audio etc.
 Social Networks: Twitter, Facebook, Google+,
LinkedIn etc
 Public Web: Wikipedia, news, weather, public
finance etc
 Data Storages: RDBMS, NoSQL, Hadoop, file
systems etc.
 Machine Log Data: Application logs, event
logs, server data, CDRs, clickstream data etc.
 Sensor Data: Smart electric meters, medical
devices, car sensors, road cameras etc.
Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types
• Facebook ingests 500 terabytes of new data
every day.
• Boeing 737 will generate 240 terabytes in one
trip.
• The smart phones, the data they create and
consume; sensors embedded into everyday
objects will soon result in billions of new,
constantly-updated data feeds containing
environmental, location, and other information,
including video.
 Clickstreams and ad impressions capture user behavior
at millions of events per second.
 High-frequency stock trading algorithms reflect market
changes within microseconds.
 Machine to machine processes exchange data between
billions of devices.
 Infrastructure and sensors generate massive log data in
real-time.
 On-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
 Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files
and social media.
 Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure.
 Big Data analysis includes different types of data
 Every day we create 2.5 quintillion (1018 ) bytes of
data .
 90% of the data in the world today has been created in
the last two years alone
k kilo 103 = 10001 210 = 10241
M mega 106 = 10002 220 = 10242
G giga 109 = 10003 230 = 10243
T tera 1012 = 10004 240 = 10244
P peta 1015 = 10005 250 = 10245
E exa 1018 = 10006 260 = 10246
Z zetta 1021 = 10007 270 = 10247
Y yotta 1024 = 10008 280 = 10248
 Examining large amount of data
 Appropriate information
 Identification of hidden patterns, unknown
correlations
 Competitive advantage
 Better business decisions: strategic and operational
 Effective marketing, customer satisfaction,
increased revenue
 Data Storage (Standard disk is 1 TB)
 Data Processing
 Data Transfer (100 MB/s)
• Fragment data into small Pieces
• Process Data in Parallel
• Collect Results
15
16
 Open-source software framework for
distributed storage and distributed
processing of very large data sets on
computer clusters.
 Google File System – 2003
 MapReduce – 2004
 Hadoop 0.1.0 released – 2006
 Hadoop Release 2.6.4 – 2016
 Storage part:
◦ Hadoop Distributed File System (HDFS)
 Processing part:
◦ Map Reduce
 Distributed
 Scalable
 Portable file-system
 Written in Java
Big data & Hadoop
 An HDFS cluster consists of:
◦ Single NameNode—a master server that manages
the file system namespace and regulates access to
files by clients.
◦ There are a number of DataNodes, usually one per
computer node in the cluster, which manage
storage attached to the nodes that they run on.
 An HDFS file consists of a number of blocks.
 Each block is typically 64 MBytes.
 Each block is replicated some specified
number of times.
 The replicas of the blocks are stored on
different DataNodes chosen to reflect loading
on a DataNode as well as to provide both
speed in transfer and resiliency in case of
failure of a rack.
 A standard directory structure is used in HDFS.
 HDFS files exist in directories that may in turn be
sub-directories of other directories, and so on.
 There is no concept of a current directory within
HDFS.
 The NameNode executes HDFS file system
namespace operations like opening, closing, and
renaming files and directories. It also determines
the mapping of blocks to DataNodes.
 The list of HDFS files belonging to each block,
the current location of the block replicas on
the DataNodes, the state of the file, and the
access control information is the metadata for
the cluster and is managed by the NameNode.
 DataNodes are responsible for serving read and
write requests from the HDFS file system’s
clients. The DataNodes also perform block
replica creation, deletion, and replication upon
instruction from the NameNode..
Big data & Hadoop
 Is the heart of Hadoop®.
 It is this programming paradigm that allows
for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.
 Provides a parallel programming model.
 Moves computation to where the data is.
 Handles scheduling, fault tolerance.
 Status reporting and monitoring.
 Introduced by Google.
 The data set should be big enough to ensure
that splitting up the data will increase overall
performance and will not be detrimental to it
 The computations are generally not
dependent on external input.
 The calculations/ processing that runs on one
subset of the data needs to be merged with
another subset.
 The resultant data set should be smaller than
the initial data set.
 Map
◦ Takes the input pair and produces an intermediate
key/value pair.
◦ All intermediate pairs are then grouped according to a
common intermediate key .
 Reduce
◦ Function accepts an intermediate key , and a set of
values for that key.
◦ It merges these values together to form possibly smaller
values.
◦ The Reduce function typically produces only a zero or an
single output value per function invocation/call.
 User Program
 Map Workers
 Reduce Workers
 Return to the User Program
 Execution typically begins with the user program.
 MapReduce libraries that are imported into the
program are used in splitting operations that are
performed on the input data set.
 Every machine in the cluster has a separate instance
of the Mapper programming running on it.
 One of the copies of the program is special. It is
called the Master.
 The rest of the programs are assigned to work under
the master and are referred to as Workers.
 There are X number of tasks and Y reduce operations
to perform. The Master picks idle workers and
assigns each of them a map task or a reduce task.
 The worker that is assigned the Map task takes
the split input data and generates the key/value
pair for each segment of input data.
 The worker then invokes the user-defined Map
function.
 The resultant values of the Map function are
buffered in the memory. The data in these
temporary buffers is later written to a disk.
 The physical address of these contents is passed
to the Master.
 The Master then finds idle workers are passes
these physical memory addresses to them to
perform the Reduce task.
 A reduce worker, when notified by the uses remote
procedure calls to access the buffered data from the
Map workers.
 When a reduce worker has read all the intermediate
data, it groups together all the data of the same
intermediate key.
 Many different keys map to the same task because of
the parallel processing nature of the tasks. Hence the
above sorting step is required.
 Each unique key and its data are passed by the
reduce worker to the Reduce function for each user.
 The Output of the reduce function is written to an
output usually to a distributed file system.
 After all Map and Reduce functions have been
run. The Master sends control back to the
user side.
Big data & Hadoop
Big data & Hadoop
Input Output
Big data & Hadoop
 Input: In this step, the sample file is input to MapReduce.
 Split: In this step, Hadoop splits / divides our sample input file into four
parts, each part made up of one line from the input file.
 Map: In this step, each split is fed to a mapper which is the map()
function containing the logic on how to process the input data, which in
our case is the line of text present in the split.
 Combine: This is an optional step and is often used to improve the
performance by reducing the amount of data transferred across the
network. This is essentially the same as the reducer (reduce() function)
and acts on output from each mapper. In our example, the key value
pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and
the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)".
 Shuffle and Sort: In this step, output of all the mappers is collected,
shuffled, and sorted and arranged to be sent to reducer.
 Reduce: In this step, the collective data from various mappers, after
being shuffled and sorted, is combined / aggregated and the word
counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5),
and so on.
 Output: In this step, the output of the reducer is written to a file on
HDFS. The following image is the output of our word count example
Thank You

More Related Content

PPTX
Hadoop training-in-hyderabad
PPTX
Map Reduce basics
PDF
Survey of Parallel Data Processing in Context with MapReduce
PPT
Introduccion a Hadoop / Introduction to Hadoop
DOC
PPTX
Map Reduce
Hadoop training-in-hyderabad
Map Reduce basics
Survey of Parallel Data Processing in Context with MapReduce
Introduccion a Hadoop / Introduction to Hadoop
Map Reduce

What's hot (17)

PDF
A Survey on Big Data Analysis Techniques
PDF
MapReduce in Cloud Computing
PDF
Hadoop scalability
PPT
Hadoop trainting-in-hyderabad@kelly technologies
PDF
Big data presentation (2014)
PPT
hadoop
PPTX
Stratosphere with big_data_analytics
PPTX
PDF
Applying stratosphere for big data analytics
PPTX
Analysing of big data using map reduce
PPT
Hadoop - Introduction to mapreduce
PPT
Hadoop online-training
PPT
Hadoop and Mapreduce Introduction
PDF
E031201032036
PDF
A data aware caching 2415
PPTX
Features of Hadoop
A Survey on Big Data Analysis Techniques
MapReduce in Cloud Computing
Hadoop scalability
Hadoop trainting-in-hyderabad@kelly technologies
Big data presentation (2014)
hadoop
Stratosphere with big_data_analytics
Applying stratosphere for big data analytics
Analysing of big data using map reduce
Hadoop - Introduction to mapreduce
Hadoop online-training
Hadoop and Mapreduce Introduction
E031201032036
A data aware caching 2415
Features of Hadoop
Ad

Viewers also liked (15)

PPTX
Intoduction to web services
PPTX
Json processing
PPTX
Are You an Accidental or Intention Software Architect
PDF
TOGAF ADM Steps reference
PPTX
Enterprise architecture: A Problamatic Approach
PDF
Business Transformation Using TOGAF
PPTX
Togaf 9 an introduction
PPTX
Togaf introduction and core concepts
PDF
Enterprise Architecture using TOGAF 's ADM - Architecture Delivery Method (...
PDF
TOGAF 9 Architectural Artifacts
PPTX
A TOGAF Case Study
PPS
Understanding and Applying The Open Group Architecture Framework (TOGAF)
PPTX
Learn Togaf 9.1 in 100 slides!
PPTX
Implementing Effective Enterprise Architecture
Intoduction to web services
Json processing
Are You an Accidental or Intention Software Architect
TOGAF ADM Steps reference
Enterprise architecture: A Problamatic Approach
Business Transformation Using TOGAF
Togaf 9 an introduction
Togaf introduction and core concepts
Enterprise Architecture using TOGAF 's ADM - Architecture Delivery Method (...
TOGAF 9 Architectural Artifacts
A TOGAF Case Study
Understanding and Applying The Open Group Architecture Framework (TOGAF)
Learn Togaf 9.1 in 100 slides!
Implementing Effective Enterprise Architecture
Ad

Similar to Big data & Hadoop (20)

PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PDF
Seminar_Report_hadoop
PDF
big data analytics introduction chapter 2
PPT
Seminar Presentation Hadoop
PDF
CMT 428 Intro to Hadoop Platform Chapter 1.pdf
PPTX
Hadoop bigdata overview
PPTX
Big data
PPT
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
PPTX
Big Data and Hadoop
PDF
Simplified Data Processing On Large Cluster
PPTX
Hadoop Integration with Microstrategy
PPT
hadoop
PPTX
Hadoop File System was developed using distributed file system design.
PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
PPTX
hadoop
PPTX
Big Data & Hadoop
PDF
IRJET - Survey Paper on Map Reduce Processing using HADOOP
PPTX
Hadoop and It_s Components_PPT .pptx
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
DOCX
hadoop seminar training report
Hadoop and MapReduce addDdaDadadDDAD.pptx
Seminar_Report_hadoop
big data analytics introduction chapter 2
Seminar Presentation Hadoop
CMT 428 Intro to Hadoop Platform Chapter 1.pdf
Hadoop bigdata overview
Big data
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
Big Data and Hadoop
Simplified Data Processing On Large Cluster
Hadoop Integration with Microstrategy
hadoop
Hadoop File System was developed using distributed file system design.
My Other Computer is a Data Center: The Sector Perspective on Big Data
hadoop
Big Data & Hadoop
IRJET - Survey Paper on Map Reduce Processing using HADOOP
Hadoop and It_s Components_PPT .pptx
Mapreduce is for Hadoop Ecosystem in Data Science
hadoop seminar training report

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...

Big data & Hadoop

  • 2.  Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture , manage, and process data within a tolerable elapsed time.
  • 3.  Archives: Scanned documents, statements, medical records, e-mails etc..  Docs: XLS, PDF, CSV, HTML, JSON etc  Business Apps: CRM, ERP systems, HR, project management etc.
  • 4.  Media: Images, video, audio etc.  Social Networks: Twitter, Facebook, Google+, LinkedIn etc  Public Web: Wikipedia, news, weather, public finance etc  Data Storages: RDBMS, NoSQL, Hadoop, file systems etc.
  • 5.  Machine Log Data: Application logs, event logs, server data, CDRs, clickstream data etc.  Sensor Data: Smart electric meters, medical devices, car sensors, road cameras etc.
  • 7. • Facebook ingests 500 terabytes of new data every day. • Boeing 737 will generate 240 terabytes in one trip. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
  • 8.  Clickstreams and ad impressions capture user behavior at millions of events per second.  High-frequency stock trading algorithms reflect market changes within microseconds.  Machine to machine processes exchange data between billions of devices.  Infrastructure and sensors generate massive log data in real-time.  On-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
  • 9.  Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.  Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.  Big Data analysis includes different types of data
  • 10.  Every day we create 2.5 quintillion (1018 ) bytes of data .  90% of the data in the world today has been created in the last two years alone
  • 11. k kilo 103 = 10001 210 = 10241 M mega 106 = 10002 220 = 10242 G giga 109 = 10003 230 = 10243 T tera 1012 = 10004 240 = 10244 P peta 1015 = 10005 250 = 10245 E exa 1018 = 10006 260 = 10246 Z zetta 1021 = 10007 270 = 10247 Y yotta 1024 = 10008 280 = 10248
  • 12.  Examining large amount of data  Appropriate information  Identification of hidden patterns, unknown correlations  Competitive advantage  Better business decisions: strategic and operational  Effective marketing, customer satisfaction, increased revenue
  • 13.  Data Storage (Standard disk is 1 TB)  Data Processing  Data Transfer (100 MB/s)
  • 14. • Fragment data into small Pieces • Process Data in Parallel • Collect Results
  • 15. 15
  • 16. 16
  • 17.  Open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters.
  • 18.  Google File System – 2003  MapReduce – 2004  Hadoop 0.1.0 released – 2006  Hadoop Release 2.6.4 – 2016
  • 19.  Storage part: ◦ Hadoop Distributed File System (HDFS)  Processing part: ◦ Map Reduce
  • 20.  Distributed  Scalable  Portable file-system  Written in Java
  • 22.  An HDFS cluster consists of: ◦ Single NameNode—a master server that manages the file system namespace and regulates access to files by clients. ◦ There are a number of DataNodes, usually one per computer node in the cluster, which manage storage attached to the nodes that they run on.
  • 23.  An HDFS file consists of a number of blocks.  Each block is typically 64 MBytes.  Each block is replicated some specified number of times.  The replicas of the blocks are stored on different DataNodes chosen to reflect loading on a DataNode as well as to provide both speed in transfer and resiliency in case of failure of a rack.
  • 24.  A standard directory structure is used in HDFS.  HDFS files exist in directories that may in turn be sub-directories of other directories, and so on.  There is no concept of a current directory within HDFS.  The NameNode executes HDFS file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
  • 25.  The list of HDFS files belonging to each block, the current location of the block replicas on the DataNodes, the state of the file, and the access control information is the metadata for the cluster and is managed by the NameNode.  DataNodes are responsible for serving read and write requests from the HDFS file system’s clients. The DataNodes also perform block replica creation, deletion, and replication upon instruction from the NameNode..
  • 27.  Is the heart of Hadoop®.  It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.
  • 28.  Provides a parallel programming model.  Moves computation to where the data is.  Handles scheduling, fault tolerance.  Status reporting and monitoring.  Introduced by Google.
  • 29.  The data set should be big enough to ensure that splitting up the data will increase overall performance and will not be detrimental to it  The computations are generally not dependent on external input.  The calculations/ processing that runs on one subset of the data needs to be merged with another subset.  The resultant data set should be smaller than the initial data set.
  • 30.  Map ◦ Takes the input pair and produces an intermediate key/value pair. ◦ All intermediate pairs are then grouped according to a common intermediate key .  Reduce ◦ Function accepts an intermediate key , and a set of values for that key. ◦ It merges these values together to form possibly smaller values. ◦ The Reduce function typically produces only a zero or an single output value per function invocation/call.
  • 31.  User Program  Map Workers  Reduce Workers  Return to the User Program
  • 32.  Execution typically begins with the user program.  MapReduce libraries that are imported into the program are used in splitting operations that are performed on the input data set.  Every machine in the cluster has a separate instance of the Mapper programming running on it.  One of the copies of the program is special. It is called the Master.  The rest of the programs are assigned to work under the master and are referred to as Workers.  There are X number of tasks and Y reduce operations to perform. The Master picks idle workers and assigns each of them a map task or a reduce task.
  • 33.  The worker that is assigned the Map task takes the split input data and generates the key/value pair for each segment of input data.  The worker then invokes the user-defined Map function.  The resultant values of the Map function are buffered in the memory. The data in these temporary buffers is later written to a disk.  The physical address of these contents is passed to the Master.  The Master then finds idle workers are passes these physical memory addresses to them to perform the Reduce task.
  • 34.  A reduce worker, when notified by the uses remote procedure calls to access the buffered data from the Map workers.  When a reduce worker has read all the intermediate data, it groups together all the data of the same intermediate key.  Many different keys map to the same task because of the parallel processing nature of the tasks. Hence the above sorting step is required.  Each unique key and its data are passed by the reduce worker to the Reduce function for each user.  The Output of the reduce function is written to an output usually to a distributed file system.
  • 35.  After all Map and Reduce functions have been run. The Master sends control back to the user side.
  • 40.  Input: In this step, the sample file is input to MapReduce.  Split: In this step, Hadoop splits / divides our sample input file into four parts, each part made up of one line from the input file.  Map: In this step, each split is fed to a mapper which is the map() function containing the logic on how to process the input data, which in our case is the line of text present in the split.  Combine: This is an optional step and is often used to improve the performance by reducing the amount of data transferred across the network. This is essentially the same as the reducer (reduce() function) and acts on output from each mapper. In our example, the key value pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)".  Shuffle and Sort: In this step, output of all the mappers is collected, shuffled, and sorted and arranged to be sent to reducer.  Reduce: In this step, the collective data from various mappers, after being shuffled and sorted, is combined / aggregated and the word counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.  Output: In this step, the output of the reducer is written to a file on HDFS. The following image is the output of our word count example