Big data & Hadoop

 Big data usually includes data sets with sizes
beyond the ability of commonly used
software tools to capture , manage, and
process data within a tolerable elapsed time.

 Archives: Scanned documents, statements,
medical records, e-mails etc..
 Docs: XLS, PDF, CSV, HTML, JSON etc
 Business Apps: CRM, ERP systems, HR,
project management etc.

 Media: Images, video, audio etc.
 Social Networks: Twitter, Facebook, Google+,
LinkedIn etc
 Public Web: Wikipedia, news, weather, public
finance etc
 Data Storages: RDBMS, NoSQL, Hadoop, file
systems etc.

 Machine Log Data: Application logs, event
logs, server data, CDRs, clickstream data etc.
 Sensor Data: Smart electric meters, medical
devices, car sensors, road cameras etc.

Volume
•Data
quantity
Velocity
•Data
Speed
Variety
•Data
Types

• Facebook ingests 500 terabytes of new data
every day.
• Boeing 737 will generate 240 terabytes in one
trip.
• The smart phones, the data they create and
consume; sensors embedded into everyday
objects will soon result in billions of new,
constantly-updated data feeds containing
environmental, location, and other information,
including video.

 Clickstreams and ad impressions capture user behavior
at millions of events per second.
 High-frequency stock trading algorithms reflect market
changes within microseconds.
 Machine to machine processes exchange data between
billions of devices.
 Infrastructure and sensors generate massive log data in
real-time.
 On-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.

 Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files
and social media.
 Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure.
 Big Data analysis includes different types of data

 Every day we create 2.5 quintillion (1018 ) bytes of
data .
 90% of the data in the world today has been created in
the last two years alone

k kilo 103 = 10001 210 = 10241
M mega 106 = 10002 220 = 10242
G giga 109 = 10003 230 = 10243
T tera 1012 = 10004 240 = 10244
P peta 1015 = 10005 250 = 10245
E exa 1018 = 10006 260 = 10246
Z zetta 1021 = 10007 270 = 10247
Y yotta 1024 = 10008 280 = 10248

 Examining large amount of data
 Appropriate information
 Identification of hidden patterns, unknown
correlations
 Competitive advantage
 Better business decisions: strategic and operational
 Effective marketing, customer satisfaction,
increased revenue

 Data Storage (Standard disk is 1 TB)
 Data Processing
 Data Transfer (100 MB/s)

• Fragment data into small Pieces
• Process Data in Parallel
• Collect Results

 Open-source software framework for
distributed storage and distributed
processing of very large data sets on
computer clusters.

 Google File System – 2003
 MapReduce – 2004
 Hadoop 0.1.0 released – 2006
 Hadoop Release 2.6.4 – 2016

 Storage part:
◦ Hadoop Distributed File System (HDFS)
 Processing part:
◦ Map Reduce

 Distributed
 Scalable
 Portable file-system
 Written in Java

 An HDFS cluster consists of:
◦ Single NameNode—a master server that manages
the file system namespace and regulates access to
files by clients.
◦ There are a number of DataNodes, usually one per
computer node in the cluster, which manage
storage attached to the nodes that they run on.

 An HDFS file consists of a number of blocks.
 Each block is typically 64 MBytes.
 Each block is replicated some specified
number of times.
 The replicas of the blocks are stored on
different DataNodes chosen to reflect loading
on a DataNode as well as to provide both
speed in transfer and resiliency in case of
failure of a rack.

 A standard directory structure is used in HDFS.
 HDFS files exist in directories that may in turn be
sub-directories of other directories, and so on.
 There is no concept of a current directory within
HDFS.
 The NameNode executes HDFS file system
namespace operations like opening, closing, and
renaming files and directories. It also determines
the mapping of blocks to DataNodes.

 The list of HDFS files belonging to each block,
the current location of the block replicas on
the DataNodes, the state of the file, and the
access control information is the metadata for
the cluster and is managed by the NameNode.
 DataNodes are responsible for serving read and
write requests from the HDFS file system’s
clients. The DataNodes also perform block
replica creation, deletion, and replication upon
instruction from the NameNode..

 Is the heart of Hadoop®.
 It is this programming paradigm that allows
for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.

 Provides a parallel programming model.
 Moves computation to where the data is.
 Handles scheduling, fault tolerance.
 Status reporting and monitoring.
 Introduced by Google.

 The data set should be big enough to ensure
that splitting up the data will increase overall
performance and will not be detrimental to it
 The computations are generally not
dependent on external input.
 The calculations/ processing that runs on one
subset of the data needs to be merged with
another subset.
 The resultant data set should be smaller than
the initial data set.

 Map
◦ Takes the input pair and produces an intermediate
key/value pair.
◦ All intermediate pairs are then grouped according to a
common intermediate key .
 Reduce
◦ Function accepts an intermediate key , and a set of
values for that key.
◦ It merges these values together to form possibly smaller
values.
◦ The Reduce function typically produces only a zero or an
single output value per function invocation/call.

 User Program
 Map Workers
 Reduce Workers
 Return to the User Program

 Execution typically begins with the user program.
 MapReduce libraries that are imported into the
program are used in splitting operations that are
performed on the input data set.
 Every machine in the cluster has a separate instance
of the Mapper programming running on it.
 One of the copies of the program is special. It is
called the Master.
 The rest of the programs are assigned to work under
the master and are referred to as Workers.
 There are X number of tasks and Y reduce operations
to perform. The Master picks idle workers and
assigns each of them a map task or a reduce task.

 The worker that is assigned the Map task takes
the split input data and generates the key/value
pair for each segment of input data.
 The worker then invokes the user-defined Map
function.
 The resultant values of the Map function are
buffered in the memory. The data in these
temporary buffers is later written to a disk.
 The physical address of these contents is passed
to the Master.
 The Master then finds idle workers are passes
these physical memory addresses to them to
perform the Reduce task.

 A reduce worker, when notified by the uses remote
procedure calls to access the buffered data from the
Map workers.
 When a reduce worker has read all the intermediate
data, it groups together all the data of the same
intermediate key.
 Many different keys map to the same task because of
the parallel processing nature of the tasks. Hence the
above sorting step is required.
 Each unique key and its data are passed by the
reduce worker to the Reduce function for each user.
 The Output of the reduce function is written to an
output usually to a distributed file system.

 After all Map and Reduce functions have been
run. The Master sends control back to the
user side.

 Input: In this step, the sample file is input to MapReduce.
 Split: In this step, Hadoop splits / divides our sample input file into four
parts, each part made up of one line from the input file.
 Map: In this step, each split is fed to a mapper which is the map()
function containing the logic on how to process the input data, which in
our case is the line of text present in the split.
 Combine: This is an optional step and is often used to improve the
performance by reducing the amount of data transferred across the
network. This is essentially the same as the reducer (reduce() function)
and acts on output from each mapper. In our example, the key value
pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and
the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)".
 Shuffle and Sort: In this step, output of all the mappers is collected,
shuffled, and sorted and arranged to be sent to reducer.
 Reduce: In this step, the collective data from various mappers, after
being shuffled and sorted, is combined / aggregated and the word
counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5),
and so on.
 Output: In this step, the output of the reducer is written to a file on
HDFS. The following image is the output of our word count example

Big data & Hadoop

More Related Content

What's hot (17)

Viewers also liked (15)

Similar to Big data & Hadoop (20)

Recently uploaded (20)

Big data & Hadoop