SlideShare a Scribd company logo
Big Data
(Hadoop)
Instructor: Thanh Binh Nguyen
September 1st, 2019
S3
Lab
Smart Software System Laboratory
1
S3LAB.
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems
Big Data 2
S3LAB.
Introduction
● Hadoop is a software framework for distributed processing of large
datasets (terabytes or petabytes of data) across large clusters (thousands
of nodes) of computers. Included some key components as below:
○ Hadoop Common: common utilities
○ Hadoop Distributed File System (HDFS) (Storage Component): A distributed file system
that provides high-throughput access
○ Hadoop YARN (Scheduling): a framework for job scheduling & cluster resource
management (available from Hadoop 2.x)
○ Hadoop MapReduce (Processing): A yarn-based system for parallel processing of large
data sets
3
Big Data S3LAB.
Introduction
● Hadoop is a large and active ecosystem.
● Hadoop emerged as a solution for big data problems.
● Open source under the friendly Apache License
● Originally built as a Infrastructure for the “Nutch” project.
● Based on Google’s mapreduce and google File System.
4
Big Data S3LAB.
Core components
Introduction
5
Big Data S3LAB.
Features
6
Big Data S3LAB.
Architecture
Multi-Node Cluster
7
Big Data S3LAB.
● Data locality and Shared Nothing: Moving computation to data, instead
of moving data to computation. Each node can independently process a
much smaller subset of the entire dataset without needing to
communicate with one another.
● Simplified programming model: allows user to quickly write and test
● Schema-on-read system ( same as NoSQL platforms) #
Schema-on-write system
● Automatic distribution of data and work across machines
What makes Hadoop unique
8
Big Data S3LAB.
Architecture
Architecture in different perspective
9
Big Data S3LAB.
● Hadoop Distributed File System (HDFS) is designed to reliably store very
large files across machines in a large cluster. It is inspired by the Google File
System.
● Distribute large data file into blocks
● Blocks are managed by different nodes in the cluster
● Each block is replicated on multiple nodes
HDFS
10
Big Data S3LAB.
● NameNode:
○ Master of the system, daemon runs on the master machine
○ Maintains, monitoring and manages the blocks which are present on the DataNodes
○ records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
○ captures all the changes to the metadata like deletion, creation and renaming of the file in edit
logs.
○ It regularly receives heartbeat and block reports from the DataNodes.
HDFS
11
Big Data S3LAB.
● All of the Hadoop server processes (daemons) serve a web UI. For
NameNode, it was on port 50070.
HDFS
12
Big Data S3LAB.
● DataNode:
○ DataNode runs on the slave machine.
○ It stores the actual business data.
○ It serves the read-write request from the user.
○ DataNode does the ground work of creating, replicating and deleting the blocks on the
command of NameNode.
○ After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of
HDFS.
HDFS
13
Big Data S3LAB.
● Distributed file system for redundant storage
● Designed to reliably store data on commodity hardware
● Built to expect hardware failures
● Intended for
○ Large files
○ Batch inserts
HDFS
14
Big Data S3LAB.
HDFS
Architecture
15
Big Data S3LAB.
HDFS
Architecture
16
Big Data S3LAB.
HDFS
Write & Read files
17
Big Data S3LAB.
● Removes tight coupling of Block
● Storage and Namespace
● Scalability & Isolation
● High Availability
● Increased performance
HDFS
HDFS 2 - Federation
18
Big Data S3LAB.
HDFS
HDFS 2 - Federation
19
Big Data S3LAB.
HDFS
HDFS 2 (Federation): Quorum based storage
20
Big Data S3LAB.
● Programming model for distributed computations at a massive scale
● Execution framework for organizing and performing such computations
● Data locality is king
● JobTracker:
○ takes care of all the job scheduling and assign tasks to Task Trackers.
● TaskTracker
○ a node in the cluster that accepts tasks - Map, Reduce & Shuffle operations - from
jobtracker
Map Reduce
21
Big Data S3LAB.
Map Reduce
flow
22
Big Data S3LAB.
● The Mapper:
○ Each block is processed in isolation by a map task called mapper
○ Map task runs on the node where the block is stored
○ Iterate over a large number of records
○ Extract something of interest from each
● Shuffle and sort intermediate results
● The Reducer:
○ Consolidate result from different mappers
○ Aggregate intermediate results
○ Produce final output
Map Reduce
23
Big Data S3LAB.
Map Reduce
flow
24
Big Data S3LAB.
Map Reduce
Combined Hadoop architecture
25
Big Data S3LAB.
● Embarrassingly parallel algorithms
● Summing, grouping, filtering, joining
● Off-line batch jobs on massive data sets
● Analyzing an entire large dataset
Map Reduce
Good for ...
26
Big Data S3LAB.
● Iterative jobs (i.e., graph algorithms)
○ Each iteration must read/write data to disk
○ I/O and latency cost of an iteration is high
Map Reduce
OK for ...
27
Big Data S3LAB.
● Jobs that need shared state/coordination
○ Tasks are shared-nothing
○ Shared-state requires scalable state store
● Low-latency jobs
● Jobs on small datasets
● Finding individual records
Map Reduce
Not good for ...
28
Big Data S3LAB.
● Scalability
○ Maximum cluster size ~ 4,500 nodes, concurrent tasks – 40,000
○ Coarse synchronization in JobTracker
● Availability
○ Failure kills all queued and running jobs
● Hard partition of resources into map & reduce slots
○ Low resource utilization
● Lacks support for alternate paradigms and services
○ Iterative applications implemented using MapReduce are 10x slower
Map Reduce
limitations
29
Big Data S3LAB.
● Basically a porting to the YARN architecture
● MapReduce becomes a user-land library
● No need to rewrite MapReduce jobs
● Increased scalability & availability
● Better cluster utilization
Map Reduce
V2
30
Big Data S3LAB.
● Originally conceived & architected by the team at Yahoo!
○ Arun Murthy created the original JIRA in 2008 and now is the YARN release
manager
● The team at Hortonworks has been working on YARN for 4 years:
○ 90% of code from Hortonworks & Yahoo!
● Hadoop 2.0 based architecture running at scale at Yahoo!
○ Deployed on 35,000 nodes for 6+ months
Hadoop 2.0
history
31
Big Data S3LAB.
Hadoop 2.0
Next-gen platform
32
Big Data S3LAB.
Hadoop 2.0
Taking Hadoop beyond batch
33
Big Data S3LAB.
Yarn
Architecture
34
Big Data S3LAB.
● Resource Manager
○ Resource Manager runs on the master node.
○ It knows where the location of slaves (Rack Awareness).
○ It is aware about how much resources each slave have.
○ Resource Scheduler is one of the important service run by the Resource Manager.
○ Resource Scheduler decides how the resources get assigned to various tasks.
○ Application Manager is one more service run by Resource Manager.
○ Application Manager negotiates the first container for an application.
○ Resource Manager keeps track of the heart beats from the Node Manager.
Yarn
Architecture
35
Big Data S3LAB.
● Resource Manager serves an embedded Web UI on port 8088
Yarn
Architecture
36
Big Data S3LAB.
● Node Manager
○ It runs on slave machines.
○ It manages containers. Containers are nothing but a fraction of Node Manager’s
resource capacity
○ Node manager monitors resource utilization of each container.
○ It sends heartbeat to Resource Manager.
Yarn
Architecture
37
Big Data S3LAB.
● Job submitter
○ The client submits the job to Resource Manager.
○ Resource Manager contacts Resource Scheduler and allocates container.
○ Now Resource Manager contacts the relevant Node Manager to launch the container.
○ Container runs Application Master
Yarn
Architecture
38
Big Data S3LAB.
● Application Master
○ Per-application
○ Manages application scheduling and task execution
○ e.g. MapReduce Application Master
Yarn
Architecture
39
Big Data S3LAB.
Yarn
Architecture
40
Big Data S3LAB.
Ecosystem
41
Big Data S3LAB.
Ecosystem
42
Big Data S3LAB.
Ecosystem
● HDFS -> Hadoop Distributed File System
● YARN -> Yet Another Resource Negotiator
● MapReduce -> Data processing using programming
● PIG -> Top-level data processing engine
● HIVE-> Data warehouse on top of hadoop, using Query (SQL-like)
● HBase -> Column oriented NoSQL Database
Core
Components
High-level data
processing
components
NoSQL
43
Big Data S3LAB.
Ecosystem
● Apache Drill: Schema-free SQL Query Engine for Hadoop
● Solr & Lucene: High-performance text search engine
● Mahout, Spark MLlib -> Data Analysis and machine Learning
● Avro -> data serialization framework
● Thrift -> Interface definition language and binary communication protocol
● Oozie -> Server-based workflow scheduling system
● HCatalog -> Table and storage management layer
Hadoop Data
Analysis
components
Data
serialization
Management
Components
44
Big Data S3LAB.
Ecosystem
● Flume -> Data collection & aggregation system
● Sqoop -> tool designed for efficiently transferring bulk Data between
hadoop and RDBMS
● Chukwa -> Data collection system for monitoring large distributed systems
● Ambari -> Hadoop deployment, management & monitoring tool
● Zookeeper -> Highly reliable distributed coordination system
● Hue (Hadoop user experience) -> Open-source hadoop web interface
Hadoop Data
transfer (ingest)
components
Monitoring
components
45
Big Data S3LAB.
Ecosystem
● Spark -> in memory dataflow engine
● Kafka, storm -> support stream data
46
Big Data S3LAB.
Network Topology In Hadoop
● Topology of network:
○ Performance
○ Availability & handling of failures
● Hadoop cluster:
○ Processes on the same node
○ Different nodes on the same rack
○ Nodes on different racks of the same
data center
○ Nodes in different data centers
47
Big Data S3LAB.
Real-time Analytics
When Not to Use Hadoop
● Hadoop works on batch processing.
● Slow processing speed with mapreduce on large data sets.
● Hadoop is not so efficient for iterative processing, as Hadoop does not
support cyclic data flow
● Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache
the intermediate data in memory for a further requirement which
diminishes the performance of Hadoop.
48
Big Data S3LAB.
Real-time Analytics
When Not to Use Hadoop
● Solution:
○ Store the big data in HDFS
○ Mount Spark (in-memory processing data)
over HDFS (100x faster than mapreduce)
○ Or even using FLINK, it processes faster than spark.
49
Big Data S3LAB.
Not a Replacement for Existing Infrastructure
When Not to Use Hadoop
● Hadoop is not a replacement for your existing data processing
infrastructure. However, you can use Hadoop along with it.
● Solution:
○ Hadoop is not going to replace your database, but your database isn’t likely to replace
Hadoop either.
○ Different tools for different jobs, as simple as that.
50
Big Data S3LAB.
Multiple Smaller Datasets
When Not to Use Hadoop
● Hadoop is not recommended for small-structured datasets, small files (<
default block size: 128MB). For a small data analytics, Hadoop can be
costlier than other tools.
● Solution:
○ Merge the small file to create bigger files and then copy bigger files to HDFS
51
Big Data S3LAB.
Multiple Smaller Datasets
When Not to Use Hadoop
● Solution:
○ Using HAR files (Hadoop archives)
○ Sequence files: we use filename -> key, file contents -> value. Put them into a single
sequence file with a program for files and process them in a streaming fashion operating on
the sequence file. MapReduce can break the sequence file into chunks and operate on each
chunk independently because the sequence file is splittable.
○ Storing files in HBase. We are not actually storing millions of small files into HBase, rather
adding the binary content of the file to a cell.
52
Big Data S3LAB.
Novice Hadoopers
When Not to Use Hadoop
● Hadoop is a technology which should come with a disclaimer: “Handle
with care”. You should know it before you use it or else you will end up like
the kid below.
● Not easy to use, does not have any type of abstraction, mapreduce has
no interactive mode.
53
Big Data S3LAB.
Novice Hadoopers
When Not to Use Hadoop
● Solutions:
○ Using Spark with RDD (Resilient Distributed DataSet) abstraction for the batch or Flink
has dataset abstraction.
○ Spark has interactive mode, Flink has high-level operators.
54
Big Data S3LAB.
Where Security is the primary Concern?
When Not to Use Hadoop
● Many enterprises — especially within highly regulated industries dealing
with sensitive data — aren’t able to move as quickly as they would like
towards implementing Big Data projects and Hadoop.
● Hadoop is missing encryption.
● Supports Kerberos authentication, which is hard to manage.
55
Big Data S3LAB.
Security
When Not to Use Hadoop
● Solution:
○ Using Spark, even spark also can use HDFS ACLs and file-level permissions in HDFS.
○ Spark can run on yarn -> Kerberos authentication
56
Big Data S3LAB.
NoSQL Database
● HBase is an open source, non-relational, distributed database
modeled after Google's BigTable.
● It runs on top of Hadoop and HDFS, providing BigTable-like
capabilities for Hadoop.
57
HBase
Big Data S3LAB.
NoSQL Database
58
HBase
Big Data S3LAB.
● Type of NoSql database (column oriented)
● Strongly consistent read and write
● Automatic sharding
● Automatic RegionServer failover
● Hadoop / HDFS Integration
HBase - Features
59
NoSQL Database
Big Data S3LAB.
● HBase supports massively parallelized processing via MapReduce
for using HBase as both source and sink.
● HBase supports an easy to use Java API for programmatic access.
● HBase also supports Thrift and REST for non-Java front-ends.
HBase - Features
60
NoSQL Database
Big Data S3LAB.
● When there is real big data: millions or billions of rows, in other way
data can not store in a single node.
● When random read/write access to big data
● When require to do thousands of operations on big data
● When there is no need of extra features of RDMS like typed columns,
secondary indexes, transactions, advanced query languages, etc.
● When there is enough hardware.
HBase - When to use
61
NoSQL Database
Big Data S3LAB.
HBase - When to use
62
NoSQL Database
Big Data S3LAB.
Difference between Hbase and HDFS
HDFS Hbase
Good for storing large file Built on top of HDFS. Good for hosting very large
tables like billions of rows X millions of column
Write once. Append to files in some of recent
versions but not commonly used
Read/write many
No random read/write Random read/write
No individual record lookup rather read all data Fast records lookup(update)
63
NoSQL Database
Big Data S3LAB.
● Create on-demand HBase clusters
● Configure different HBase instances differently
● Better isolation
● Create (transient) HBase clusters from MapReduce jobs
● Elasticity of clusters for analytic / batch workload processing
● Better cluster resources utilization
Hoya: HBase on Yarn
64
NoSQL Database
Big Data S3LAB.
High-Level Data Process Components
● An sql like interface to Hadoop.
● Data warehouse infrastructure built on top of Hadoop
● Provide data summarization, query and analysis
● Query execution via MapReduce
● Hive interpreter convert the query to Map-reduce format.
● Open source project.
● Developed by Facebook
● Also used by Netflix, Cnet, Digg, eHarmony etc.
65
Hive
Big Data S3LAB.
Hive - architecture
66
High-Level Data Process Components
Big Data S3LAB.
● HiveQL example:
SELECT customerId, max(total_cost) from hive_purchases GROUP BY
customerId HAVING count(*) > 3;
67
High-Level Data Process Components
Hive
Big Data S3LAB.
● A scripting platform for processing and analyzing large data sets
● Apache Pig allows to write complex MapReduce programs using a
simple scripting language.
● Made of two components:
○ High level language: Pig Latin (data flow language).
○ Pig translate Pig Latin script into MapReduce to execute within Hadoop.
● Open source project
● Developed by Yahoo
68
High-Level Data Process Components
Pig
Big Data S3LAB.
● Pig Latin example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,
gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
69
High-Level Data Process Components
Pig
Big Data S3LAB.
● Both requires compiler to generate Mapreduce jobs
● Hence high latency queries when used for real time responses to
ad-hoc queries
● Both are good for batch processing and ETL jobs
● Fault tolerant
70
High-Level Data Process Components
Hive & Pig
Big Data S3LAB.
● Cloudera Impala is a query engine that runs on Apache Hadoop.
● Similar to HiveQL.
● Does not use Map-reduce
● Optimized for low latency queries
● Open source apache project
● Developed by Cloudera
● Much faster than Hive or pig
71
High-Level Data Process Components
Impala
Big Data S3LAB.
Description of Feature Pig Hive Impala
SQL based query language No yes yes
Schema optional required required
Process data with external scripts yes yes no
Extensible file format support yes yes no
Query speed slow slow fast
Accessible via ODBC/JDBC no yes yes
Impala, Pig and Hive
72
High-Level Data Process Components
Big Data S3LAB.
Data transfer components
● Command-line interface for transforming data between RDBMS & Hadoop
● Parallelized data transfer with MapReduce
● Support incremental imports
● Imports use to populate tables in Hadoop
● Exports use to put data from Hadoop into relational database
● Sqoop2 -> Sqoop-as-a-Service: server-based implementation of Sqoop
73
Sqoop (Sql-to-hadoop)
Big Data S3LAB.
● The dataset being transferred is broken into small blocks.
● Map only job is launched.
● Individual mapper is responsible for transferring a block of the
dataset.
Sqoop - How it works
74
Data transfer components
Big Data S3LAB.
Sqoop - How it works
75
Data transfer components
Big Data S3LAB.
● Apache Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
streaming data (ingest) into the Hadoop Distributed File System
(HDFS).
76
Flume
Data transfer components
Big Data S3LAB.
Flume - How it works
77
Data transfer components
Big Data S3LAB.
● Data flows like
● Agent tier -> Collector tier -> Storage tier
● Agent nodes are typically installed on the machines that generate
the logs and are data’s initial point of contact with Flume. They
forward data to the next tier of collector nodes, which aggregate the
separate data flows and forward them to the final storage tier.
78
Data transfer components
Flume - How it works
Big Data S3LAB.
Flume - Agent architecture
79
● Sources:
○ HTTP, Syslog, JMS, Kafka,
Avro, Twitter - stream api
for tweets download, …
● Sink:
○ HDFS, Hive, HBase, Kafka,
Solr, …
● Channel:
○ File, JDBC, Kafka, ...
Data transfer components
Big Data S3LAB.
● a scalable system for collecting logs and other monitoring data and
processing the data with MapReduce
80
Data transfer components
Chukwa
Big Data S3LAB.
● Agents that run on each machine and emit data.
○ Adaptor: dynamically controllable data sources. Emit data in Chunks (a sequence
of bytes, with some metadata).
● Collectors that receive data from the agent and write it to stable
storage.
● MapReduce jobs for parsing and archiving the data.
● HICC, the Hadoop Infrastructure Care Center; a web-portal style
interface for displaying data.
81
Data transfer components
Chukwa
Big Data S3LAB.
Chukwa - HICC
82
Data transfer components
Big Data S3LAB.
Data Serialization Components
● A language-neutral data serialization system.
● Avro uses JSON based schemas
● Uses RPC calls to send data
● During the data exchange, schema’s sent.
83
AVRO
Big Data S3LAB.
Avro - features
84
Data Serialization Components
Big Data S3LAB.
Avro - General working
● Step 1 − Create schemas. Here you need to design Avro schema
according to your data.
● Step 2 − Read the schemas into your program. It is done in two
ways −
○ By Generating a Class Corresponding to Schema − Compile the schema using
Avro. This generates a class file corresponding to the schema
○ By Using Parsers Library − You can directly read the schema using parsers library.
85
Data Serialization Components
Big Data S3LAB.
Avro - General working
● Step 3 − Serialize the data using the serialization API provided for
Avro, which is found in the package org.apache.avro.specific.
● Step 4 − Deserialize the data using deserialization API provided for
Avro, which is found in the package org.apache.avro.specific.
86
Data Serialization Components
Big Data S3LAB.
● Define data types and service interface with IDL (interface definition
language)
● The IDL content are stored as .thrift.
● Thrift provides clean abstractions for data transport, data
serialization, and application level processing.
87
Data Serialization Components
Thrift
Big Data S3LAB.
● Apache Thrift is a set of code-generation tools
that allows developers to build RPC clients and
servers by just defining the data types and
service interfaces in a simple definition file.
Given this file as an input, code is generated to
build RPC clients and servers that communicate
seamlessly across programming languages
88
Data Serialization Components
Thrift
Big Data S3LAB.
89
Data Serialization Components
Thrift
Big Data S3LAB.
Monitoring Components
● Graphical front end to the cluster.
● Open source web interface.
● Makes Hadoop platform (HDFS, yarn, Solr, Pig, Map reduce, oozie,
Hive, sqoop, impala, etc.) easy to use
90
Hue
Big Data S3LAB.
91
Monitoring Components
Hue
Big Data S3LAB.
● Because coordinating distributed systems is a Zoo.
● ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services.
92
Monitoring Components
ZooKeeper
Big Data S3LAB.
● an open source web-based
management tool which manages,
monitors as well as provisions the
health of Hadoop clusters
93
Monitoring Components
Ambari
Big Data S3LAB.
94
Monitoring Components
Ambari
Big Data S3LAB.
● A low latency schema-free query engine for big data
● Users can query the data using a standard SQL and BI Tools, which
doesn’t require to create and manage schemas
● Drill uses a JSON document model internally which allows it to query
data of any structure.
● Drill works with a variety of non-relational data stores, including
Hadoop, NoSQL databases (MongoDB, HBase), Local files, NAS and
cloud storage like Amazon S3, Azure Blob Storage, etc
95
Data Analysis Components
Drill
Big Data S3LAB.
96
Data Analysis Components
Drill
Big Data S3LAB.
● A mahout is one who drives an elephant as its master.
● is an open source project that is primarily used for creating scalable
machine learning algorithms, implemented on top of Apache
Hadoop® and using the MapReduce paradigm. It implements
popular machine learning techniques such as:
○ Recommendation
○ Classification
○ Clustering
97
Data Analysis Components
Mahout
Big Data S3LAB.
● Lucene is an open-source Java full-text search library which makes
it easy to add search functionality to an application or website.
98
Data Analysis Components
Lucene
Big Data S3LAB.
● Workflow scheduler to manage hadoop and related jobs
● Developed first in Banglore by Yahoo
● DAG(Direct Acyclic Graph): Acyclic means a graph cannot have any
loops and action members of the graph provide control dependency.
Control dependency means a second job cannot run until a first
action is completed
● Ozzie definitions are written in hadoop process definition language
(hPDL) and coded as an xml file (WORKFLOW.XML)
99
Management Components
Oozies
Big Data S3LAB.
● Workflow contains:
○ Control flow nodes (defines start, end and execution path of the workflow):
START, FORK, JOIN, DECISION, KILL, END
○ Action nodes (trigger execution of tasks): Java MapReduce, Streaming
MapReduce, Pig, Hive, Sqoop, FileSystem tasks, Distributed copy, Java programs,
Shell scripts, Http, Email, Oozie sub workflows, ...
100
Management Components
Oozies
Big Data S3LAB.
101
Management Components
Oozies
Big Data S3LAB.
● A table storage management tool for Hadoop.
● It exposes the tabular data of Hive metastore to other Hadoop
applications.
● It enables users with different data processing tools (Pig,
MapReduce) to easily write data onto a grid. It ensures that users
don’t have to worry about where or in what format their data is
stored.
102
Management Components
HCatalog
Big Data S3LAB.
103
Management Components
HCatalog
Big Data S3LAB.
Use case
Classical enterprise platform
104
Big Data S3LAB.
Use case
With big data
105
Big Data S3LAB.
Use case
Refine data
106
Big Data S3LAB.
Use case
Explore data
107
Big Data S3LAB.
Use case
Enrich data
108
Big Data S3LAB.
An Example
● 6 billion ad deliveries per day
● Reports (and bills) for the advertising companies needed
● Own C++ solution did not scale
● Adding functions was a nightmare
Digital Advertising
109
Big Data S3LAB.
An Example
Digital Advertising
110
Big Data S3LAB.
Q & A
Cảm ơn đã theo dõi
Chúng tôi hy vọng cùng nhau đi đến thành công.
111
Big Data S3LAB.

More Related Content

PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PPTX
002 Introduction to hadoop v3
PPT
Basic premise for hadoop's architectures
PDF
Aioug big data and hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPT
Big Data & Hadoop
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Hadoop ppt on the basics and architecture
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
002 Introduction to hadoop v3
Basic premise for hadoop's architectures
Aioug big data and hadoop
Hadoop and Big data in Big data and cloud.pptx
Big Data & Hadoop
Big Data Architecture Workshop - Vahid Amiri
Hadoop ppt on the basics and architecture

Similar to big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf (20)

PDF
Big data presentation
PDF
Elastic Data Analytics Platform @Datadog
PPTX
Big Data and Big Data Analytics PowerPoint lecture notes
PDF
Understanding Hadoop
PPTX
Apache-Hadoop-Slides.pptx
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Big data and hadoop anupama
PPTX
Hadoop ppt1
DOCX
hadoop seminar training report
PPTX
عصر کلان داده، چرا و چگونه؟
PPTX
2. hadoop fundamentals
PPTX
Big data
PPTX
Hadoop-2022.pptx
PPTX
Big Data and Hadoop
PDF
BIGDATA MODULE 3.pdf
PPTX
Architecting Your First Big Data Implementation
PPTX
MOD-2 presentation on engineering students
PPTX
Hadoop
Big data presentation
Elastic Data Analytics Platform @Datadog
Big Data and Big Data Analytics PowerPoint lecture notes
Understanding Hadoop
Apache-Hadoop-Slides.pptx
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Big data and hadoop anupama
Hadoop ppt1
hadoop seminar training report
عصر کلان داده، چرا و چگونه؟
2. hadoop fundamentals
Big data
Hadoop-2022.pptx
Big Data and Hadoop
BIGDATA MODULE 3.pdf
Architecting Your First Big Data Implementation
MOD-2 presentation on engineering students
Hadoop
Ad

Recently uploaded (20)

PPT
Project quality management in manufacturing
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
PPT on Performance Review to get promotions
PPT
Mechanical Engineering MATERIALS Selection
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Geodesy 1.pptx...............................................
Project quality management in manufacturing
Internet of Things (IOT) - A guide to understanding
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
OOP with Java - Java Introduction (Basics)
Lecture Notes Electrical Wiring System Components
Operating System & Kernel Study Guide-1 - converted.pdf
Well-logging-methods_new................
composite construction of structures.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT on Performance Review to get promotions
Mechanical Engineering MATERIALS Selection
Structs to JSON How Go Powers REST APIs.pdf
Geodesy 1.pptx...............................................
Ad

big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf

  • 1. Big Data (Hadoop) Instructor: Thanh Binh Nguyen September 1st, 2019 S3 Lab Smart Software System Laboratory 1 S3LAB.
  • 2. “Big data is at the foundation of all the megatrends that are happening today, from social to mobile to cloud to gaming.” – Chris Lynch, Vertica Systems Big Data 2 S3LAB.
  • 3. Introduction ● Hadoop is a software framework for distributed processing of large datasets (terabytes or petabytes of data) across large clusters (thousands of nodes) of computers. Included some key components as below: ○ Hadoop Common: common utilities ○ Hadoop Distributed File System (HDFS) (Storage Component): A distributed file system that provides high-throughput access ○ Hadoop YARN (Scheduling): a framework for job scheduling & cluster resource management (available from Hadoop 2.x) ○ Hadoop MapReduce (Processing): A yarn-based system for parallel processing of large data sets 3 Big Data S3LAB.
  • 4. Introduction ● Hadoop is a large and active ecosystem. ● Hadoop emerged as a solution for big data problems. ● Open source under the friendly Apache License ● Originally built as a Infrastructure for the “Nutch” project. ● Based on Google’s mapreduce and google File System. 4 Big Data S3LAB.
  • 8. ● Data locality and Shared Nothing: Moving computation to data, instead of moving data to computation. Each node can independently process a much smaller subset of the entire dataset without needing to communicate with one another. ● Simplified programming model: allows user to quickly write and test ● Schema-on-read system ( same as NoSQL platforms) # Schema-on-write system ● Automatic distribution of data and work across machines What makes Hadoop unique 8 Big Data S3LAB.
  • 9. Architecture Architecture in different perspective 9 Big Data S3LAB.
  • 10. ● Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. ● Distribute large data file into blocks ● Blocks are managed by different nodes in the cluster ● Each block is replicated on multiple nodes HDFS 10 Big Data S3LAB.
  • 11. ● NameNode: ○ Master of the system, daemon runs on the master machine ○ Maintains, monitoring and manages the blocks which are present on the DataNodes ○ records the metadata of the files like the location of blocks, file size, permission, hierarchy etc. ○ captures all the changes to the metadata like deletion, creation and renaming of the file in edit logs. ○ It regularly receives heartbeat and block reports from the DataNodes. HDFS 11 Big Data S3LAB.
  • 12. ● All of the Hadoop server processes (daemons) serve a web UI. For NameNode, it was on port 50070. HDFS 12 Big Data S3LAB.
  • 13. ● DataNode: ○ DataNode runs on the slave machine. ○ It stores the actual business data. ○ It serves the read-write request from the user. ○ DataNode does the ground work of creating, replicating and deleting the blocks on the command of NameNode. ○ After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS. HDFS 13 Big Data S3LAB.
  • 14. ● Distributed file system for redundant storage ● Designed to reliably store data on commodity hardware ● Built to expect hardware failures ● Intended for ○ Large files ○ Batch inserts HDFS 14 Big Data S3LAB.
  • 17. HDFS Write & Read files 17 Big Data S3LAB.
  • 18. ● Removes tight coupling of Block ● Storage and Namespace ● Scalability & Isolation ● High Availability ● Increased performance HDFS HDFS 2 - Federation 18 Big Data S3LAB.
  • 19. HDFS HDFS 2 - Federation 19 Big Data S3LAB.
  • 20. HDFS HDFS 2 (Federation): Quorum based storage 20 Big Data S3LAB.
  • 21. ● Programming model for distributed computations at a massive scale ● Execution framework for organizing and performing such computations ● Data locality is king ● JobTracker: ○ takes care of all the job scheduling and assign tasks to Task Trackers. ● TaskTracker ○ a node in the cluster that accepts tasks - Map, Reduce & Shuffle operations - from jobtracker Map Reduce 21 Big Data S3LAB.
  • 23. ● The Mapper: ○ Each block is processed in isolation by a map task called mapper ○ Map task runs on the node where the block is stored ○ Iterate over a large number of records ○ Extract something of interest from each ● Shuffle and sort intermediate results ● The Reducer: ○ Consolidate result from different mappers ○ Aggregate intermediate results ○ Produce final output Map Reduce 23 Big Data S3LAB.
  • 25. Map Reduce Combined Hadoop architecture 25 Big Data S3LAB.
  • 26. ● Embarrassingly parallel algorithms ● Summing, grouping, filtering, joining ● Off-line batch jobs on massive data sets ● Analyzing an entire large dataset Map Reduce Good for ... 26 Big Data S3LAB.
  • 27. ● Iterative jobs (i.e., graph algorithms) ○ Each iteration must read/write data to disk ○ I/O and latency cost of an iteration is high Map Reduce OK for ... 27 Big Data S3LAB.
  • 28. ● Jobs that need shared state/coordination ○ Tasks are shared-nothing ○ Shared-state requires scalable state store ● Low-latency jobs ● Jobs on small datasets ● Finding individual records Map Reduce Not good for ... 28 Big Data S3LAB.
  • 29. ● Scalability ○ Maximum cluster size ~ 4,500 nodes, concurrent tasks – 40,000 ○ Coarse synchronization in JobTracker ● Availability ○ Failure kills all queued and running jobs ● Hard partition of resources into map & reduce slots ○ Low resource utilization ● Lacks support for alternate paradigms and services ○ Iterative applications implemented using MapReduce are 10x slower Map Reduce limitations 29 Big Data S3LAB.
  • 30. ● Basically a porting to the YARN architecture ● MapReduce becomes a user-land library ● No need to rewrite MapReduce jobs ● Increased scalability & availability ● Better cluster utilization Map Reduce V2 30 Big Data S3LAB.
  • 31. ● Originally conceived & architected by the team at Yahoo! ○ Arun Murthy created the original JIRA in 2008 and now is the YARN release manager ● The team at Hortonworks has been working on YARN for 4 years: ○ 90% of code from Hortonworks & Yahoo! ● Hadoop 2.0 based architecture running at scale at Yahoo! ○ Deployed on 35,000 nodes for 6+ months Hadoop 2.0 history 31 Big Data S3LAB.
  • 33. Hadoop 2.0 Taking Hadoop beyond batch 33 Big Data S3LAB.
  • 35. ● Resource Manager ○ Resource Manager runs on the master node. ○ It knows where the location of slaves (Rack Awareness). ○ It is aware about how much resources each slave have. ○ Resource Scheduler is one of the important service run by the Resource Manager. ○ Resource Scheduler decides how the resources get assigned to various tasks. ○ Application Manager is one more service run by Resource Manager. ○ Application Manager negotiates the first container for an application. ○ Resource Manager keeps track of the heart beats from the Node Manager. Yarn Architecture 35 Big Data S3LAB.
  • 36. ● Resource Manager serves an embedded Web UI on port 8088 Yarn Architecture 36 Big Data S3LAB.
  • 37. ● Node Manager ○ It runs on slave machines. ○ It manages containers. Containers are nothing but a fraction of Node Manager’s resource capacity ○ Node manager monitors resource utilization of each container. ○ It sends heartbeat to Resource Manager. Yarn Architecture 37 Big Data S3LAB.
  • 38. ● Job submitter ○ The client submits the job to Resource Manager. ○ Resource Manager contacts Resource Scheduler and allocates container. ○ Now Resource Manager contacts the relevant Node Manager to launch the container. ○ Container runs Application Master Yarn Architecture 38 Big Data S3LAB.
  • 39. ● Application Master ○ Per-application ○ Manages application scheduling and task execution ○ e.g. MapReduce Application Master Yarn Architecture 39 Big Data S3LAB.
  • 43. Ecosystem ● HDFS -> Hadoop Distributed File System ● YARN -> Yet Another Resource Negotiator ● MapReduce -> Data processing using programming ● PIG -> Top-level data processing engine ● HIVE-> Data warehouse on top of hadoop, using Query (SQL-like) ● HBase -> Column oriented NoSQL Database Core Components High-level data processing components NoSQL 43 Big Data S3LAB.
  • 44. Ecosystem ● Apache Drill: Schema-free SQL Query Engine for Hadoop ● Solr & Lucene: High-performance text search engine ● Mahout, Spark MLlib -> Data Analysis and machine Learning ● Avro -> data serialization framework ● Thrift -> Interface definition language and binary communication protocol ● Oozie -> Server-based workflow scheduling system ● HCatalog -> Table and storage management layer Hadoop Data Analysis components Data serialization Management Components 44 Big Data S3LAB.
  • 45. Ecosystem ● Flume -> Data collection & aggregation system ● Sqoop -> tool designed for efficiently transferring bulk Data between hadoop and RDBMS ● Chukwa -> Data collection system for monitoring large distributed systems ● Ambari -> Hadoop deployment, management & monitoring tool ● Zookeeper -> Highly reliable distributed coordination system ● Hue (Hadoop user experience) -> Open-source hadoop web interface Hadoop Data transfer (ingest) components Monitoring components 45 Big Data S3LAB.
  • 46. Ecosystem ● Spark -> in memory dataflow engine ● Kafka, storm -> support stream data 46 Big Data S3LAB.
  • 47. Network Topology In Hadoop ● Topology of network: ○ Performance ○ Availability & handling of failures ● Hadoop cluster: ○ Processes on the same node ○ Different nodes on the same rack ○ Nodes on different racks of the same data center ○ Nodes in different data centers 47 Big Data S3LAB.
  • 48. Real-time Analytics When Not to Use Hadoop ● Hadoop works on batch processing. ● Slow processing speed with mapreduce on large data sets. ● Hadoop is not so efficient for iterative processing, as Hadoop does not support cyclic data flow ● Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the intermediate data in memory for a further requirement which diminishes the performance of Hadoop. 48 Big Data S3LAB.
  • 49. Real-time Analytics When Not to Use Hadoop ● Solution: ○ Store the big data in HDFS ○ Mount Spark (in-memory processing data) over HDFS (100x faster than mapreduce) ○ Or even using FLINK, it processes faster than spark. 49 Big Data S3LAB.
  • 50. Not a Replacement for Existing Infrastructure When Not to Use Hadoop ● Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it. ● Solution: ○ Hadoop is not going to replace your database, but your database isn’t likely to replace Hadoop either. ○ Different tools for different jobs, as simple as that. 50 Big Data S3LAB.
  • 51. Multiple Smaller Datasets When Not to Use Hadoop ● Hadoop is not recommended for small-structured datasets, small files (< default block size: 128MB). For a small data analytics, Hadoop can be costlier than other tools. ● Solution: ○ Merge the small file to create bigger files and then copy bigger files to HDFS 51 Big Data S3LAB.
  • 52. Multiple Smaller Datasets When Not to Use Hadoop ● Solution: ○ Using HAR files (Hadoop archives) ○ Sequence files: we use filename -> key, file contents -> value. Put them into a single sequence file with a program for files and process them in a streaming fashion operating on the sequence file. MapReduce can break the sequence file into chunks and operate on each chunk independently because the sequence file is splittable. ○ Storing files in HBase. We are not actually storing millions of small files into HBase, rather adding the binary content of the file to a cell. 52 Big Data S3LAB.
  • 53. Novice Hadoopers When Not to Use Hadoop ● Hadoop is a technology which should come with a disclaimer: “Handle with care”. You should know it before you use it or else you will end up like the kid below. ● Not easy to use, does not have any type of abstraction, mapreduce has no interactive mode. 53 Big Data S3LAB.
  • 54. Novice Hadoopers When Not to Use Hadoop ● Solutions: ○ Using Spark with RDD (Resilient Distributed DataSet) abstraction for the batch or Flink has dataset abstraction. ○ Spark has interactive mode, Flink has high-level operators. 54 Big Data S3LAB.
  • 55. Where Security is the primary Concern? When Not to Use Hadoop ● Many enterprises — especially within highly regulated industries dealing with sensitive data — aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop. ● Hadoop is missing encryption. ● Supports Kerberos authentication, which is hard to manage. 55 Big Data S3LAB.
  • 56. Security When Not to Use Hadoop ● Solution: ○ Using Spark, even spark also can use HDFS ACLs and file-level permissions in HDFS. ○ Spark can run on yarn -> Kerberos authentication 56 Big Data S3LAB.
  • 57. NoSQL Database ● HBase is an open source, non-relational, distributed database modeled after Google's BigTable. ● It runs on top of Hadoop and HDFS, providing BigTable-like capabilities for Hadoop. 57 HBase Big Data S3LAB.
  • 59. ● Type of NoSql database (column oriented) ● Strongly consistent read and write ● Automatic sharding ● Automatic RegionServer failover ● Hadoop / HDFS Integration HBase - Features 59 NoSQL Database Big Data S3LAB.
  • 60. ● HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. ● HBase supports an easy to use Java API for programmatic access. ● HBase also supports Thrift and REST for non-Java front-ends. HBase - Features 60 NoSQL Database Big Data S3LAB.
  • 61. ● When there is real big data: millions or billions of rows, in other way data can not store in a single node. ● When random read/write access to big data ● When require to do thousands of operations on big data ● When there is no need of extra features of RDMS like typed columns, secondary indexes, transactions, advanced query languages, etc. ● When there is enough hardware. HBase - When to use 61 NoSQL Database Big Data S3LAB.
  • 62. HBase - When to use 62 NoSQL Database Big Data S3LAB.
  • 63. Difference between Hbase and HDFS HDFS Hbase Good for storing large file Built on top of HDFS. Good for hosting very large tables like billions of rows X millions of column Write once. Append to files in some of recent versions but not commonly used Read/write many No random read/write Random read/write No individual record lookup rather read all data Fast records lookup(update) 63 NoSQL Database Big Data S3LAB.
  • 64. ● Create on-demand HBase clusters ● Configure different HBase instances differently ● Better isolation ● Create (transient) HBase clusters from MapReduce jobs ● Elasticity of clusters for analytic / batch workload processing ● Better cluster resources utilization Hoya: HBase on Yarn 64 NoSQL Database Big Data S3LAB.
  • 65. High-Level Data Process Components ● An sql like interface to Hadoop. ● Data warehouse infrastructure built on top of Hadoop ● Provide data summarization, query and analysis ● Query execution via MapReduce ● Hive interpreter convert the query to Map-reduce format. ● Open source project. ● Developed by Facebook ● Also used by Netflix, Cnet, Digg, eHarmony etc. 65 Hive Big Data S3LAB.
  • 66. Hive - architecture 66 High-Level Data Process Components Big Data S3LAB.
  • 67. ● HiveQL example: SELECT customerId, max(total_cost) from hive_purchases GROUP BY customerId HAVING count(*) > 3; 67 High-Level Data Process Components Hive Big Data S3LAB.
  • 68. ● A scripting platform for processing and analyzing large data sets ● Apache Pig allows to write complex MapReduce programs using a simple scripting language. ● Made of two components: ○ High level language: Pig Latin (data flow language). ○ Pig translate Pig Latin script into MapReduce to execute within Hadoop. ● Open source project ● Developed by Yahoo 68 High-Level Data Process Components Pig Big Data S3LAB.
  • 69. ● Pig Latin example: A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); X = FOREACH A GENERATE name,$2; DUMP X; 69 High-Level Data Process Components Pig Big Data S3LAB.
  • 70. ● Both requires compiler to generate Mapreduce jobs ● Hence high latency queries when used for real time responses to ad-hoc queries ● Both are good for batch processing and ETL jobs ● Fault tolerant 70 High-Level Data Process Components Hive & Pig Big Data S3LAB.
  • 71. ● Cloudera Impala is a query engine that runs on Apache Hadoop. ● Similar to HiveQL. ● Does not use Map-reduce ● Optimized for low latency queries ● Open source apache project ● Developed by Cloudera ● Much faster than Hive or pig 71 High-Level Data Process Components Impala Big Data S3LAB.
  • 72. Description of Feature Pig Hive Impala SQL based query language No yes yes Schema optional required required Process data with external scripts yes yes no Extensible file format support yes yes no Query speed slow slow fast Accessible via ODBC/JDBC no yes yes Impala, Pig and Hive 72 High-Level Data Process Components Big Data S3LAB.
  • 73. Data transfer components ● Command-line interface for transforming data between RDBMS & Hadoop ● Parallelized data transfer with MapReduce ● Support incremental imports ● Imports use to populate tables in Hadoop ● Exports use to put data from Hadoop into relational database ● Sqoop2 -> Sqoop-as-a-Service: server-based implementation of Sqoop 73 Sqoop (Sql-to-hadoop) Big Data S3LAB.
  • 74. ● The dataset being transferred is broken into small blocks. ● Map only job is launched. ● Individual mapper is responsible for transferring a block of the dataset. Sqoop - How it works 74 Data transfer components Big Data S3LAB.
  • 75. Sqoop - How it works 75 Data transfer components Big Data S3LAB.
  • 76. ● Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data (ingest) into the Hadoop Distributed File System (HDFS). 76 Flume Data transfer components Big Data S3LAB.
  • 77. Flume - How it works 77 Data transfer components Big Data S3LAB.
  • 78. ● Data flows like ● Agent tier -> Collector tier -> Storage tier ● Agent nodes are typically installed on the machines that generate the logs and are data’s initial point of contact with Flume. They forward data to the next tier of collector nodes, which aggregate the separate data flows and forward them to the final storage tier. 78 Data transfer components Flume - How it works Big Data S3LAB.
  • 79. Flume - Agent architecture 79 ● Sources: ○ HTTP, Syslog, JMS, Kafka, Avro, Twitter - stream api for tweets download, … ● Sink: ○ HDFS, Hive, HBase, Kafka, Solr, … ● Channel: ○ File, JDBC, Kafka, ... Data transfer components Big Data S3LAB.
  • 80. ● a scalable system for collecting logs and other monitoring data and processing the data with MapReduce 80 Data transfer components Chukwa Big Data S3LAB.
  • 81. ● Agents that run on each machine and emit data. ○ Adaptor: dynamically controllable data sources. Emit data in Chunks (a sequence of bytes, with some metadata). ● Collectors that receive data from the agent and write it to stable storage. ● MapReduce jobs for parsing and archiving the data. ● HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. 81 Data transfer components Chukwa Big Data S3LAB.
  • 82. Chukwa - HICC 82 Data transfer components Big Data S3LAB.
  • 83. Data Serialization Components ● A language-neutral data serialization system. ● Avro uses JSON based schemas ● Uses RPC calls to send data ● During the data exchange, schema’s sent. 83 AVRO Big Data S3LAB.
  • 84. Avro - features 84 Data Serialization Components Big Data S3LAB.
  • 85. Avro - General working ● Step 1 − Create schemas. Here you need to design Avro schema according to your data. ● Step 2 − Read the schemas into your program. It is done in two ways − ○ By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema ○ By Using Parsers Library − You can directly read the schema using parsers library. 85 Data Serialization Components Big Data S3LAB.
  • 86. Avro - General working ● Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific. ● Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific. 86 Data Serialization Components Big Data S3LAB.
  • 87. ● Define data types and service interface with IDL (interface definition language) ● The IDL content are stored as .thrift. ● Thrift provides clean abstractions for data transport, data serialization, and application level processing. 87 Data Serialization Components Thrift Big Data S3LAB.
  • 88. ● Apache Thrift is a set of code-generation tools that allows developers to build RPC clients and servers by just defining the data types and service interfaces in a simple definition file. Given this file as an input, code is generated to build RPC clients and servers that communicate seamlessly across programming languages 88 Data Serialization Components Thrift Big Data S3LAB.
  • 90. Monitoring Components ● Graphical front end to the cluster. ● Open source web interface. ● Makes Hadoop platform (HDFS, yarn, Solr, Pig, Map reduce, oozie, Hive, sqoop, impala, etc.) easy to use 90 Hue Big Data S3LAB.
  • 92. ● Because coordinating distributed systems is a Zoo. ● ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. 92 Monitoring Components ZooKeeper Big Data S3LAB.
  • 93. ● an open source web-based management tool which manages, monitors as well as provisions the health of Hadoop clusters 93 Monitoring Components Ambari Big Data S3LAB.
  • 95. ● A low latency schema-free query engine for big data ● Users can query the data using a standard SQL and BI Tools, which doesn’t require to create and manage schemas ● Drill uses a JSON document model internally which allows it to query data of any structure. ● Drill works with a variety of non-relational data stores, including Hadoop, NoSQL databases (MongoDB, HBase), Local files, NAS and cloud storage like Amazon S3, Azure Blob Storage, etc 95 Data Analysis Components Drill Big Data S3LAB.
  • 97. ● A mahout is one who drives an elephant as its master. ● is an open source project that is primarily used for creating scalable machine learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. It implements popular machine learning techniques such as: ○ Recommendation ○ Classification ○ Clustering 97 Data Analysis Components Mahout Big Data S3LAB.
  • 98. ● Lucene is an open-source Java full-text search library which makes it easy to add search functionality to an application or website. 98 Data Analysis Components Lucene Big Data S3LAB.
  • 99. ● Workflow scheduler to manage hadoop and related jobs ● Developed first in Banglore by Yahoo ● DAG(Direct Acyclic Graph): Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed ● Ozzie definitions are written in hadoop process definition language (hPDL) and coded as an xml file (WORKFLOW.XML) 99 Management Components Oozies Big Data S3LAB.
  • 100. ● Workflow contains: ○ Control flow nodes (defines start, end and execution path of the workflow): START, FORK, JOIN, DECISION, KILL, END ○ Action nodes (trigger execution of tasks): Java MapReduce, Streaming MapReduce, Pig, Hive, Sqoop, FileSystem tasks, Distributed copy, Java programs, Shell scripts, Http, Email, Oozie sub workflows, ... 100 Management Components Oozies Big Data S3LAB.
  • 102. ● A table storage management tool for Hadoop. ● It exposes the tabular data of Hive metastore to other Hadoop applications. ● It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don’t have to worry about where or in what format their data is stored. 102 Management Components HCatalog Big Data S3LAB.
  • 104. Use case Classical enterprise platform 104 Big Data S3LAB.
  • 105. Use case With big data 105 Big Data S3LAB.
  • 109. An Example ● 6 billion ad deliveries per day ● Reports (and bills) for the advertising companies needed ● Own C++ solution did not scale ● Adding functions was a nightmare Digital Advertising 109 Big Data S3LAB.
  • 111. Q & A Cảm ơn đã theo dõi Chúng tôi hy vọng cùng nhau đi đến thành công. 111 Big Data S3LAB.