SlideShare a Scribd company logo
Big Data
1
What is Big data?
 ‘Big Data’ is similar to ‘small data’, but bigger in size.
 but having data bigger it requires different approaches:
-Techniques, tools and architecture
 Big data is a term for data sets that are so large or complex
that traditional data processing applications are inadequate
to deal with them.
2
Sources of Big Data
Social Media Data
Black Box Data
Stock Exchange Data
Transport Data
Power Grid Data
Search Engine Data
3
 Social Media Data: Social media such as Facebook and Twitter
hold information and views posted by millions of people across
the globe.
 Black Box Data: It is a component of helicopter, airplanes, and
jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of
the aircraft.
 Stock Exchange Data: The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a share
of different companies made by the customers.
4
 Transport Data: Transport data includes model, capacity,
distance and availability of a vehicle.
 Search Engine Data: Search engines retrieve lots of data
from different databases.
 Power Grid Data: The power grid data holds information
consumed by a particular node with respect to a base
station.
5
Three Vs of Big Data
Velocity
• Data speed
Volume
• Data
quantity
Variety
• Data Types
6
Velocity
 high-frequency stock trading algorithms reflect market
changes within microseconds
 machine to machine processes exchange data between
billions of devices
 on-line gaming systems support millions of concurrent users,
each producing multiple inputs per second.
7
Volume
• A typical PC might have had 10 gigabytes of storage in 2000.
• Today, Facebook ingests 600 terabytes of new data every
day.
• The smart phones, the data they create and consume;
sensors embedded into everyday objects will soon result in
billions of new, constantly-updated data feeds containing
environmental, location, and other information, including
video.
8
Variety
 Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
 Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
 Big Data analysis includes different types of data.
9
Challenges
Storage
Searching
Sharing
Transfer
Analysis
10
Hadoop
11
History of Hadoop
 Hadoop was created by computer scientists Doug Cutting and
Mike Cafarella in 2005.
 It was inspired by Google's MapReduce, a software framework
in which an application is broken down into numerous small
parts.
 Doug named it after his son’s toy elephant.
 In November 2016 Apache Hadoop became a registered
trademark of the Apache Software Foundation.
12
What is Hadoop?
 Hadoop is an open source, Java-based programming framework
that supports the processing and storage of extremely large data
sets in a distributed computing environment.
 Hadoop runs applications using the mapreduce algorithm, where
the data is processed in parallel on different CPU nodes.
 Its distributed file system facilitates rapid data transfer
rates among nodes and allows the system to continue operating in
case of a node failure.
 Hadoop can perform complete statistical analysis for a huge
amount of data.
13
Hadoop Architecture 14
HADOOP
MapReduce
(Distributed Computation)
HDFS
(Distributed Storage)
YARN
Framework
Common
HADOOP COMMON:
 Common refers to the collection of common utilities and
libraries that support other Hadoop modules.
 These libraries provides file system and OS level abstraction
and contains the necessary Java files and scripts required to
start Hadoop.
HADOOP YARN:
 Yet Another Resource Negotiator
 a resource-management platform responsible for managing
computing resources in clusters and using them for
scheduling of users' applications
15
HDFS
 Hadoop Distributed File System.
 Hadoop file system that runs on top of existing file system
 Designed to handle very large files with streaming data
access patterns
 Uses blocks to store a file or parts of a file.
16
HDFS - Blocks
File Blocks
 64MB (default), 128MB (recommended) – compare to 4 KB in
UNIX
 Behind the scenes, 1 HDFS block is supported by multiple
operating system (OS) blocks
 Fits well with replication to provide fault tolerance and
availability
17
. . .
128 MB
OS Block
HDFS Block
Advantages of blocks
 Fixed size – easy to calculate how many fit on a disk
 file can be larger than any single disk in the network
 If a file or a chunk of the file is smaller than the block
size, only needed space is used. Eg: 420MB file is split as:
18
128 MB 128 MB 128 MB 36 MB
HDFS -Replication
 Blocks with data are replicated to multiple nodes
 Allows for node failure without data loss
19
Writing a file to HDFS 20
21
22
23
24
25
HADOOP
MapReduce
26
COMPONENTS OF HADOOP 27
• HDFS
• MapReduce
• YARN Framework
• Libraries
A DEFINITION
 MapReduce is the heart of Hadoop. It is this programming
paradigm that allows for massive scalability across hundreds
or thousands of servers in a Hadoop cluster.
 MapReduce is the original framework for writing applications
that process large amounts of structured and unstructured
data stored in the Hadoop Distributed File System (HDFS).
 MapReduce is a patented framework by GOOGLE to support
distributed computing on large data sets.
28
INSPIRATION
 The name MapReduce comes from functional programming
 map is the name of a higher-order function that applies a
given function to each element of a list.
 reduce is the name of a higher-order function that analyze a
recursive data structure and recombine through use of a given
combining operation the results of recursively processing its
constituent parts, building up a return value.
29
• DATA NODE
• TASK TRACKER
• DATA NODE
• TASK TRACKER
• DATA NODE
• TASK TRACKER
• NAME NODE
• JOB TRACKER
MASTE SLAVE
SLAVESLAVE
30
HOW MapReduce WORKS?
 Init - Hadoop divides the input file stored on HDFS into splits (typically of the size of
an HDFS block) and assigns every split to a different mapper, trying to assign every
split to the mapper where the split physically resides
 Mapper - Hadoop reads the split of the mapper line by line. Hadoop calls the
method map() of the mapper for every line passing it as the key/value parameters -
the mapper computes its application logic and emits other key/value pairs
 Shuffle and sort -Hadoop's partitioner divides the emitted output of the mapper
into partitions, each of those is sent to a different reducer. Hadoop collects all the
different partitions received from the mappers and sort them by key
 Reducer -Hadoop reads the aggregated partitions line by line. Hadoop calls the
reduce() method on the reducer for every line of the input - the reducer computes its
application logic and emits other key/value pairs - locally, Hadoop writes the emitted
pairs output (the emitted pairs) to HDFS
31
32
COMMON JOBS FOR MapReduce 33
TEXT MINING
INDEX
BUILDING
GRAPHS
PATTERNS FILTERING PREDICTION
RISK
ANALYSIS
WORD COUNT USING MapReduce 34
BENEFITS
Simplicity
Scalability
Speed
Recovery
Minimal data
motion
35
JAQL
36
INTRODUCTION
 Functional data processing and query language.
 Commonly used for JSON query processing on BigData.
 Started as an Open Source project at Google.
 Taken over by Google as primary data processing language
for their Hadoop software package BigInsights.
37
 Supports a variety of data sources like JSON,CSV , TSV, XML .
 Loosely typed functional language with lazy evaluation, so
expressions are only materialized when needed.
 Can process structure and non traditional data.
 Inspired by many programming and query languages,
including Lisp, SQL, XQuery, and Pig.
38
Jaql Allows Users To:
 Access and load data from different sources (local file system,
web, twitter, HDFS, Hbase, etc.)
 Query data (databases)
 Transform, aggregate and filter data
 Write data into different places (local file system, HDFS, HBase,
databases, etc.)
39
Jaql environment
Jaql shell from a
command prompt
Eclipse Environment
40
KEY GOALS
 Flexibility
 Scalability
 Physical Transparency
 Modularity
41
JAQL I/O
I/O
layer
JSON
JAQL
Interpreter
JSON
I/O
layer
42
source destinationExecution
43
• cd
$BIGINSIGHTS_HOME/jaql/bin
jaql
• ./jaqlshelljaqlshell
ADVANTAGES
 Easy to create user-defined functions written entirely in Jaql.
 Simplicity
facilitates the development
makes easier the distribution between nodes of the
program
 Map reduce jobs can be directly called
44
HIVE
45
 Originated as an internal project by facebook.
 data warehouse infrastructure built on top of Hadoop.
 SQL-like interface to query called HiveQL.
 Compiles query as Map Reduce jobs and runs them in cluster.
 Structures data into well defined database concept.
46
HIVE ARCHITECTURE 47
Apache PIG
48
What is Pig?
 Pigs Eat Anything
Pig can operate on data whether it has metadata or not. It
can operate on data that is relational, nested, or
unstructured. And it can easily be extended to operate on
data beyond files, including key/value stores, databases, etc.
 Pigs Live Anywhere
Pig is intended to be a language for parallel data
processing. It is not tied to one particular parallel
framework. It has been implemented first on Hadoop, but we
do not intend that to be only on Hadoop.
 Pigs Are Domestic Animals
Pig is designed to be easily controlled and modified by its
users.
49
 Pig Latin was designed to fit in a sweet spot between the
declarative style of SQL, and the low-level, procedural style
of MapReduce.
 Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs.
 Pig's infrastructure layer consists of
 a compiler that produces sequences of Map-Reduce
programs,
 Pig's language layer currently consists of a textual
language called Pig Latin.
50
KEY PROPERTIES OF PIG LATIN
Ease of programming. It is trivial to achieve parallel
execution of simple, "embarrassingly parallel" data
analysis tasks. Complex tasks comprised of multiple
interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to
write, understand, and maintain.
Optimization opportunities. The way in which tasks
are encoded permits the system to optimize their
execution automatically, allowing the user to focus on
semantics rather than efficiency.
Extensibility. Users can create their own functions to
do special-purpose processing.
51
52
•Cd $PIG_HOME/binpig
•./pig –x localgrunt
Twitter data analytics
USING HADOOP
53
Twitter Data
 Real twitter data
purchased by IBM
 Available in JSON (
JavaScript Object
Notation) format
54
55
Objective
 use Jaql core operators to manipulate JSON data found in Twitter feeds.
 Filter arrays to remove values
 Sort arrays in either ascending or descending sequence
 Write data to HDFS
56
Procedure
 Start jaql shell
 cd $BIGINSIGHTS_HOME/jaql/bin
 ./jaqlshell
 Locate json twitter records
 tweets =read(file("file:///home/labfiles/SampleData/Twitter%20Search.json"));
 tweets;
 Retrieve single field using transform
 tweets->transform {$.created_at, $.from_user, $.iso_language_code, $.text};
57
 Create a new record, tweetsrecs
 tweetrecs = tweets->transform {$.created_at, $.from_user, language:$.iso_language_code,
$.text};
 Use the filter operator to see all non-english language records from tweetrecs.
 tweetrecs -> filter $.language != 'en';
 Aggregate data and count the number of tweets for each language.
 tweetrecs -> group by key = $.language into {language: key, num: count($)};
 Create the target directory in HDFS
 hdfsShell(’-mkdir /user/biadmin/jaql’);
 Write the results of the previous aggregation to a JSON file in HDFS.
 tweetrecs->group by key = $.language into {language: key, num: count($)}-
>write(seq("hdfs:/user/biadmin/jaql/twittercount.seq"));
58
THANK YOU!
-MISHIKA BHARADWAJ
59

More Related Content

PPTX
Introduction to Hadoop
PDF
Hadoop Ecosystem
PDF
Hadoop Overview & Architecture
 
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPT
Seminar Presentation Hadoop
PDF
Intro to HBase
Introduction to Hadoop
Hadoop Ecosystem
Hadoop Overview & Architecture
 
Introduction to Hadoop and Hadoop component
Hadoop And Their Ecosystem ppt
Hadoop introduction , Why and What is Hadoop ?
Seminar Presentation Hadoop
Intro to HBase

What's hot (20)

PPT
Hadoop Map Reduce
PPTX
Hadoop File system (HDFS)
PPTX
Map Reduce
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PDF
Hadoop YARN
PDF
Map Reduce
PPTX
Map reduce presentation
PPTX
Introduction to Hadoop Technology
PDF
Apache Pig: A big data processor
PPTX
Hadoop
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPT
Cloud Computing: Hadoop
PDF
Hadoop & MapReduce
PPTX
Big Data and Hadoop
PDF
Hadoop ecosystem
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Introduction to Apache Spark
PPT
Map Reduce
PPTX
Data models in NoSQL
Hadoop Map Reduce
Hadoop File system (HDFS)
Map Reduce
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop YARN
Map Reduce
Map reduce presentation
Introduction to Hadoop Technology
Apache Pig: A big data processor
Hadoop
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Cloud Computing: Hadoop
Hadoop & MapReduce
Big Data and Hadoop
Hadoop ecosystem
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Introduction to Apache Spark
Map Reduce
Data models in NoSQL
Ad

Viewers also liked (14)

PDF
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PPTX
Big Data Overview 2013-2014
PDF
Realtime Apache Hadoop at Facebook
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Pig, Making Hadoop Easy
PDF
Integration of Hive and HBase
PDF
introduction to data processing using Hadoop and Pig
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PDF
Hive Quick Start Tutorial
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PPT
Introduction To Map Reduce
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPTX
Big Data Analytics with Hadoop
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Big Data Overview 2013-2014
Realtime Apache Hadoop at Facebook
Practical Problem Solving with Apache Hadoop & Pig
Pig, Making Hadoop Easy
Integration of Hive and HBase
introduction to data processing using Hadoop and Pig
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Hive Quick Start Tutorial
Hadoop, Pig, and Twitter (NoSQL East 2009)
Introduction To Map Reduce
HIVE: Data Warehousing & Analytics on Hadoop
Big Data Analytics with Hadoop
Ad

Similar to Big data Analytics Hadoop (20)

PPTX
Big data ppt
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPT
Big Data & Hadoop
PPTX
Hadoop and Big Data: Revealed
PDF
Intro to Big Data - Spark
PPTX
Big data concepts
PPTX
Introduction to BIg Data and Hadoop
PDF
Big data and hadoop overvew
PDF
Big data and hadoop
DOCX
Hadoop Seminar Report
PDF
Hadoop Master Class : A concise overview
PPTX
Big data Presentation
PPTX
002 Introduction to hadoop v3
PDF
DBA to Data Scientist
PDF
The Hadoop Ecosystem for Developers
PPTX
Big data
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Introduction of Big data, NoSQL & Hadoop
PPT
Big Data and Hadoop Basics
Big data ppt
Hadoop_EcoSystem slide by CIDAC India.pptx
Big Data & Hadoop
Hadoop and Big Data: Revealed
Intro to Big Data - Spark
Big data concepts
Introduction to BIg Data and Hadoop
Big data and hadoop overvew
Big data and hadoop
Hadoop Seminar Report
Hadoop Master Class : A concise overview
Big data Presentation
002 Introduction to hadoop v3
DBA to Data Scientist
The Hadoop Ecosystem for Developers
Big data
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Introduction of Big data, NoSQL & Hadoop
Big Data and Hadoop Basics

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Sustainable Sites - Green Building Construction
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Welding lecture in detail for understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Well-logging-methods_new................
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
web development for engineering and engineering
Construction Project Organization Group 2.pptx
Lesson 3_Tessellation.pptx finite Mathematics
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CH1 Production IntroductoryConcepts.pptx
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Structs to JSON How Go Powers REST APIs.pdf
Welding lecture in detail for understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Internet of Things (IOT) - A guide to understanding
OOP with Java - Java Introduction (Basics)
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Well-logging-methods_new................
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Lecture Notes Electrical Wiring System Components
Model Code of Practice - Construction Work - 21102022 .pdf
web development for engineering and engineering

Big data Analytics Hadoop

  • 2. What is Big data?  ‘Big Data’ is similar to ‘small data’, but bigger in size.  but having data bigger it requires different approaches: -Techniques, tools and architecture  Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. 2
  • 3. Sources of Big Data Social Media Data Black Box Data Stock Exchange Data Transport Data Power Grid Data Search Engine Data 3
  • 4.  Social Media Data: Social media such as Facebook and Twitter hold information and views posted by millions of people across the globe.  Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.  Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers. 4
  • 5.  Transport Data: Transport data includes model, capacity, distance and availability of a vehicle.  Search Engine Data: Search engines retrieve lots of data from different databases.  Power Grid Data: The power grid data holds information consumed by a particular node with respect to a base station. 5
  • 6. Three Vs of Big Data Velocity • Data speed Volume • Data quantity Variety • Data Types 6
  • 7. Velocity  high-frequency stock trading algorithms reflect market changes within microseconds  machine to machine processes exchange data between billions of devices  on-line gaming systems support millions of concurrent users, each producing multiple inputs per second. 7
  • 8. Volume • A typical PC might have had 10 gigabytes of storage in 2000. • Today, Facebook ingests 600 terabytes of new data every day. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video. 8
  • 9. Variety  Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.  Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.  Big Data analysis includes different types of data. 9
  • 12. History of Hadoop  Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2005.  It was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts.  Doug named it after his son’s toy elephant.  In November 2016 Apache Hadoop became a registered trademark of the Apache Software Foundation. 12
  • 13. What is Hadoop?  Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.  Hadoop runs applications using the mapreduce algorithm, where the data is processed in parallel on different CPU nodes.  Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure.  Hadoop can perform complete statistical analysis for a huge amount of data. 13
  • 14. Hadoop Architecture 14 HADOOP MapReduce (Distributed Computation) HDFS (Distributed Storage) YARN Framework Common
  • 15. HADOOP COMMON:  Common refers to the collection of common utilities and libraries that support other Hadoop modules.  These libraries provides file system and OS level abstraction and contains the necessary Java files and scripts required to start Hadoop. HADOOP YARN:  Yet Another Resource Negotiator  a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications 15
  • 16. HDFS  Hadoop Distributed File System.  Hadoop file system that runs on top of existing file system  Designed to handle very large files with streaming data access patterns  Uses blocks to store a file or parts of a file. 16
  • 17. HDFS - Blocks File Blocks  64MB (default), 128MB (recommended) – compare to 4 KB in UNIX  Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks  Fits well with replication to provide fault tolerance and availability 17 . . . 128 MB OS Block HDFS Block
  • 18. Advantages of blocks  Fixed size – easy to calculate how many fit on a disk  file can be larger than any single disk in the network  If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 18 128 MB 128 MB 128 MB 36 MB
  • 19. HDFS -Replication  Blocks with data are replicated to multiple nodes  Allows for node failure without data loss 19
  • 20. Writing a file to HDFS 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 27. COMPONENTS OF HADOOP 27 • HDFS • MapReduce • YARN Framework • Libraries
  • 28. A DEFINITION  MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.  MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS).  MapReduce is a patented framework by GOOGLE to support distributed computing on large data sets. 28
  • 29. INSPIRATION  The name MapReduce comes from functional programming  map is the name of a higher-order function that applies a given function to each element of a list.  reduce is the name of a higher-order function that analyze a recursive data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value. 29
  • 30. • DATA NODE • TASK TRACKER • DATA NODE • TASK TRACKER • DATA NODE • TASK TRACKER • NAME NODE • JOB TRACKER MASTE SLAVE SLAVESLAVE 30
  • 31. HOW MapReduce WORKS?  Init - Hadoop divides the input file stored on HDFS into splits (typically of the size of an HDFS block) and assigns every split to a different mapper, trying to assign every split to the mapper where the split physically resides  Mapper - Hadoop reads the split of the mapper line by line. Hadoop calls the method map() of the mapper for every line passing it as the key/value parameters - the mapper computes its application logic and emits other key/value pairs  Shuffle and sort -Hadoop's partitioner divides the emitted output of the mapper into partitions, each of those is sent to a different reducer. Hadoop collects all the different partitions received from the mappers and sort them by key  Reducer -Hadoop reads the aggregated partitions line by line. Hadoop calls the reduce() method on the reducer for every line of the input - the reducer computes its application logic and emits other key/value pairs - locally, Hadoop writes the emitted pairs output (the emitted pairs) to HDFS 31
  • 32. 32
  • 33. COMMON JOBS FOR MapReduce 33 TEXT MINING INDEX BUILDING GRAPHS PATTERNS FILTERING PREDICTION RISK ANALYSIS
  • 34. WORD COUNT USING MapReduce 34
  • 37. INTRODUCTION  Functional data processing and query language.  Commonly used for JSON query processing on BigData.  Started as an Open Source project at Google.  Taken over by Google as primary data processing language for their Hadoop software package BigInsights. 37
  • 38.  Supports a variety of data sources like JSON,CSV , TSV, XML .  Loosely typed functional language with lazy evaluation, so expressions are only materialized when needed.  Can process structure and non traditional data.  Inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. 38
  • 39. Jaql Allows Users To:  Access and load data from different sources (local file system, web, twitter, HDFS, Hbase, etc.)  Query data (databases)  Transform, aggregate and filter data  Write data into different places (local file system, HDFS, HBase, databases, etc.) 39
  • 40. Jaql environment Jaql shell from a command prompt Eclipse Environment 40
  • 41. KEY GOALS  Flexibility  Scalability  Physical Transparency  Modularity 41
  • 44. ADVANTAGES  Easy to create user-defined functions written entirely in Jaql.  Simplicity facilitates the development makes easier the distribution between nodes of the program  Map reduce jobs can be directly called 44
  • 46.  Originated as an internal project by facebook.  data warehouse infrastructure built on top of Hadoop.  SQL-like interface to query called HiveQL.  Compiles query as Map Reduce jobs and runs them in cluster.  Structures data into well defined database concept. 46
  • 49. What is Pig?  Pigs Eat Anything Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.  Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.  Pigs Are Domestic Animals Pig is designed to be easily controlled and modified by its users. 49
  • 50.  Pig Latin was designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce.  Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.  Pig's infrastructure layer consists of  a compiler that produces sequences of Map-Reduce programs,  Pig's language layer currently consists of a textual language called Pig Latin. 50
  • 51. KEY PROPERTIES OF PIG LATIN Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing. 51
  • 54. Twitter Data  Real twitter data purchased by IBM  Available in JSON ( JavaScript Object Notation) format 54
  • 55. 55
  • 56. Objective  use Jaql core operators to manipulate JSON data found in Twitter feeds.  Filter arrays to remove values  Sort arrays in either ascending or descending sequence  Write data to HDFS 56
  • 57. Procedure  Start jaql shell  cd $BIGINSIGHTS_HOME/jaql/bin  ./jaqlshell  Locate json twitter records  tweets =read(file("file:///home/labfiles/SampleData/Twitter%20Search.json"));  tweets;  Retrieve single field using transform  tweets->transform {$.created_at, $.from_user, $.iso_language_code, $.text}; 57
  • 58.  Create a new record, tweetsrecs  tweetrecs = tweets->transform {$.created_at, $.from_user, language:$.iso_language_code, $.text};  Use the filter operator to see all non-english language records from tweetrecs.  tweetrecs -> filter $.language != 'en';  Aggregate data and count the number of tweets for each language.  tweetrecs -> group by key = $.language into {language: key, num: count($)};  Create the target directory in HDFS  hdfsShell(’-mkdir /user/biadmin/jaql’);  Write the results of the previous aggregation to a JSON file in HDFS.  tweetrecs->group by key = $.language into {language: key, num: count($)}- >write(seq("hdfs:/user/biadmin/jaql/twittercount.seq")); 58