SlideShare a Scribd company logo
Getting Started with Hadoop
Getting Started with Hadoop
3
Agenda
• 1999 the Database Story
• Intro to Hadoop
• Basic Hadoop Eco-system
• Why CDH is Awesome
• QA
Getting Started with Hadoop
5
6
Indexing the Web
• Web is huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real-time!
7
8
Databases in 1999
1. Buy a really big machine
2. Install an expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another really big machine as
a backup
9
Database Limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data
10
Google Does Something Different
• Designed their own storage and processing
infrastructure
• Google File System and MapReduce
• Goals:
• Cheap
• Scalable
• Reliable
11
Google Does Something Different
• It worked!
• Powered Google Search for many years
• General framework for large-scale batch
computation tasks
• Still used internally at Google to this day
Getting Started with Hadoop
Getting Started with Hadoop
14
15
Google’s messages from the future
• Google was benevolent enough to publish
• 2003: Google File System (GFS) paper
• 2004: MapReduce paper
• Already mature technologies at this point
16
Google’s messages from the future
• Community didn’t get it immediately
• DB people thought it was silly
• Non-Google weren’t at the same scale yet
• Google had little interest in releasing GFS and
MapReduce
• Business was ads, not infrastructure
17
Birth of Hadoop
• Doug Cutting and Mike Cafarella
• Nutch
• Open-source search platform
• Ran into scaling issues
• 4 nodes
• Hard to program
• Hard to manage
• Immediate application for GFS and MR
18
Birth of Hadoop
• 2004-2006:
Implemented GFS/MR
and ported Nutch to it
• 2006: Spun out into
Apache Hadoop
• Name of Doug’s son’s
stuffed elephant
Birth of Hadoop
20
Summary
• The web is huge and unstructured
• Databases didn’t fit the problem
• Didn’t scale, expensive, SQL limitations
• Google did their own thing: GFS + MR
• Hadoop is based on the Google papers
21
HDFS and MapReduce
22
HDFS
• Based on GFS
• Distributed, fault-tolerant filesystem
• Primarily designed for cost and scale
• Works on commodity hardware
• 20PB / 4000 node cluster at Facebook
23
HDFS design assumptions
• Failures are common
• Massive scale means more failures
• Disks, network, node
• Files are append-only
• Files are large (GBs to TBs)
• Accesses are large and sequential
24
Quick primers
• Filesystems
• Hard drives
• Datacenter networking
25
Quick filesystem primer
• Same concepts as the FS on your laptop
• Directory tree
• Create, read, write, delete files
• Filesystems store metadata and data
• Metadata: filename, size, permissions, …
• Data: contents of a file
• Other concerns
• Data integrity, durability, management
26
Quick disk primer
• Disk does a seek for each I/O operation
• Seeks are expensive (~10ms)
• Throughput / IOPS tradeoff
• 100 MB/s and 10 IOPS
• 10MB/s and 100 IOPS
• Big I/Os mean better throughput
27
Quick networking primer
Rack
Top-of-rack switch
Core switch
28
Quick networking primer
40 Gbit
1, 2, 4, 10
Gbit
10 Gbit
29
HDFS Architecture Overview
Secondary
Namenode
Host 2
Namenode
Host 1
DataNode
Host 3
DataNode
Host 4
DataNode
Host 5
DataNode
Host n
30
Block Size = 64MB
Replication Factor = 3
HDFS Block Replication
1
2
3
4
5 2
3
4
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
Node 1 Node 2
Node 3
Node 4
Node 5
Blocks
31
HDFS Write Path
• Talk to NameNode
• Store metadata for new file
• Get topology-aware list of DataNodes
• Setup the write pipeline
• Stream data to pipeline
• Tell NameNode when done
32
HDFS Fault-tolerance
• Many different failure modes
• Disk corruption, node failure, switch failure
• Primary concern
• Data is safe!!!
• Secondary concerns
• Keep accepting reads and writes
• Do it transparently to clients
33
MapReduce – Map
• Records from the data source (lines out of files, rows of a database, etc)
are fed into the map function as key*value pairs: e.g., (filename, line).
• map() produces one or more intermediate values along with an output
key from the input.
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
34
MapReduce – Reduce
• After the map phase is over, all the intermediate values for a
given output key are combined together into a list
• reduce() combines those intermediate values into one or
more final values for that same output key
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
35
MapReduce – Shuffle and Sort
36
Word Count Example
The cat sat on the mat
The aardvark sat on the sofa
The, 1
cat, 1
sat, 1
on, 1
the, 1
mat, 1
The, 1
aardvark, 1
sat, 1
on, 1
the, 1
sofa, 1
Mapper Input
Mapping
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on [1, 1]
sat [1, 1]
sofa, 1
the [1, 1, 1, 1]
Shuffling Reducing
Final Result
37
Summary
• GFS and MR co-design
• Cheap, simple, effective at scale
• Fault-tolerance baked in
• Replicate data 3x
• Incrementally re-execute computation
• Avoid single points of failure
38
Hadoop Ecosystem Overview
What Are All These Things?
39
Sqoop
Performs bidirectional
data transfers between
Hadoop and almost
any SQL database with
a JDBC driver
40
Flume
Client
Client
Client
Client
Agent
Agent
Agent
A streaming data
collection and
aggregation system
for massive volumes
of data, such as RPC
services, Log4J,
Syslog, etc.
41
Hive
SELECT
s.word, s.freq, k.freq
FROM shakespeare
JOIN ON (s.word=
k.word)
WHERE s.freq >= 5;
• Relational database
abstraction using a SQL
like dialect called HiveQL
• Statements are executed
as one or more
MapReduce Jobs
42
Pig
• High-level scripting language
for for executing one or more
MapReduce jobs
• Created to simplify authoring
of MapReduce jobs
• Can be extended with user
defined functions
emps = LOAD 'people.txt’ AS
(id,name,salary);
rich = FILTER emps BY salary >
200000;
sorted_rich = ORDER rich BY
salary DESC;
STORE sorted_rich INTO
’rich_people.txt';
43
HBase
• Low-latency, distributed,
columnar key-value store
• Based on BigTable
• Efficient random
reads/writes on HDFS
• Useful for frontend
applications
44
Oozie
A workflow engine
and scheduler built
specifically for large-
scale job orchestration
on a Hadoop cluster
45
Hue
• Hue is an open source web-
based application for making it
easier to use Apache Hadoop.
• Hue features
• File Browser for HDFS
• Job Designer/Browser for MapReduce
• Query editors for Hive, Pig and
Cloudera Impala
• Oozie
46
Zookeeper
• Zookeeper is a distributed consensus
engine
• Provides well-defined concurrent
access semantics:
• Leader election
• Service discovery
• Distributed locking / mutual
exclusion
• Message board / mailboxes
47
Hadoop Ecosystem
INGEST STORE EXPLORE PROCESS ANALYZE SERVE
CONNECTORS
STORAGE
RESOURCE MGMT
& COORDINATION
USER INTERFACE WORKFLOW MGMT METADATACLOUD
INTEGRATION
YA
YARN
ZO
ZOOKEEPER
HDFS
HADOOP DFS
HB
HBASE
HU
HUE
OO
OOZIE
WH
WHIRR
SQ
SQOOP
FL
FLUME
FILE
FUSE-DFS
REST
WEBHDFS / HTTPFS
SQL
ODBC / JDBC
MS
META STORE
AC
ACCESS
BI ETL RDBMS
BATCH COMPUTE
BATCH PROCESSING REAL-TIME ACCESS
& COMPUTE
MR
MAPREDUCE
MR2
MAPREDUCE2
HI
HIVE
PI
PIG
MA
MAHOUT
DF
DATAFU
IM
IMPALA
MANAGEMENT SOFTWARE &
TECHNICAL SUPPORT
SUBSCIPTION OPTIONS
CLOUDERA NAVIGATOR
CLOUDERA MANAGER
CORE
(REQUIRED)
RTD RTQ
BDR
AUDIT
(v1.0)
LINEAGE
ACCESS
(v1.0)
LIFECYCLE
EXPLORE
CORE
48
The Cloudera Advantage
49
Cloudera Impala
Cost-effective, ad hoc query environment that offloads the
data warehouse for:
• Interactive BI/analytics on more data
• Asking new questions
• Data processing with tight SLAs
• Query-able archive w/full fidelity
50
Cloudera Impala
Interactive SQL for Hadoop
• Responses in seconds
• Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
• Purpose-built for low-latency queries
• Separate runtime from MapReduce
• Designed as part of the Hadoop ecosystem
Open Source
• Apache-licensed
51
Impala Key Features
Fast Flexible Secure
Easy to Implement Easy to Use Simple to Manage
• In-memory data transfers
• Partitioned joins
• Fully distributed aggregations
• Query data in HDFS & HBase
• Supports multiple file formats &
compression algorithms
• Integrated with Hadoop security
• Kerberos authentication
• Authorization (Sentry)
• Leverages Hive’s ODBC/JDBC
connectors, metastore & SQL syntax
• Open source
• Interact with data via SQL
• Certified with leading BI tools
• Deploy, configure & monitor with
Cloudera Manager
• Integrated with Hadoop resource
management
52
The Impala Advantage
BI Partners: Building on the Enterprise Standard
53
Cloudera Search
Offer easy access to non-technical resources
Explore data prior to processing and modeling
Gain immediate access and find correlations in
mission-critical data
Powerful, proven search capabilities
that let organizations:
54
Cloudera Search
Interactive Search for All Data
• Full-text and faceted navigation
• Batch, near real-time, and on-demand indexing
Apache Solr Integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Hadoop ecosystem
Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
55
Search Key Features
Scalable Flexible Timely
Mature Simple to Use Easy to Manage
• Index storage & retrieval on HDFS
• Indexing with MapReduce and
Flume
• Shard management with Zookeeper
• Indexing and query of any data in
HDFS and HBase
• Support for multiple file formats
• Field mapping and matching with
Morphlines
• Indexing in batch, on-demand, and
in near real-time
• Scalable extraction and mapping
with built-in Solr sink for Flume
• Proven, enterprise-ready
technology
• Rich ecosystem and knowledge
within community
• Familiar full-text search and faceted
navigation
• Out-of-the-box Search GUI
• Known, readily available standard
Solr APIs
• Integrated with Cloudera Manager
and Apache Sentry
• Integrated coordination and
execution of jobs
• GoLive for incremental changes
56
The Search Advantage
Search Partners:
Building on the
Cloudera Enterprise
Data Hub
57
Sentry
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration
Open Source
Submitted to ASF
Open Source authorization module for Impala & Hive
58
Defining Security Functions
Perimeter
Guarding access to the cluster
itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the cluster
from unauthorized visibility
Technical Concepts:
Encryption, Tokenization
Data masking
Access
Defining what users and
applications can do with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where data
came from and how it’s being
used
Technical Concepts:
Auditing
Lineage
59
Enabling Enterprise Security
Perimeter
Guarding access to the cluster
itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the cluster
from unauthorized visibility
Technical Concepts:
Encryption, Tokenization
Data masking
Access
Defining what users and
applications can do with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where data
came from and how it’s being
used
Technical Concepts:
Auditing
Lineage
Apache SentryKerberos, AD/LDAP Cloudera NavigatorCertified Partners
60
Authorization Requirements
Secure Authorization
Ability to control access to data and/or privileges on data for authenticated users
Fine-Grained Authorization
Ability to give users access to a subset of data (e.g. column) in a database
Role-Based Authorization
Ability to create/apply templatized privileges based on functional roles
Multitenant Administration
Ability for central admin group to empower lower-level admins to manage
security for each database/schema
61
Key Capabilities of Sentry
Fine-Grained Authorization
Specify security for SERVERS, DATABASES, TABLES & VIEWS
Role-Based Authorization
SELECT privilege on views & tables
INSERT privilege on tables
TRANSFORM privilege on servers
ALL privilege on the server, databases, tables & views
ALL privilege is needed to create/modify schema
Multitenant Administration
Separate policies for each database/schema
Can be maintained by separate admins
62
Challenges with Hadoop without Management
Hadoop is more than a dozen services running across many machines
• Hundreds of hardware components
• Thousands of settings
• Limitless permutations
Complexity
Hadoop is a system, not just a collection of parts
• Everything is interrelated
• Raw data about individual pieces is not enough
• Must extract what’s important
Context
Managing Hadoop with multiple tools and manual process takes longer
• Complicated, error-prone workflows
• Longer issue resolution
• Lack of consistent and repeatable processes
Efficiency
63
Cloudera Manager
End-to-End Administration for Your Enterprise Data Hub
Manage
Easily deploy, configure & optimize clusters1
Monitor
Maintain a central view of all activity2
Diagnose
Easily identify and resolve issues3
Integrate
Use Cloudera Manager with existing tools4
64
One Tool For Everything
Managing Complexity
+
DEPLOYMENT &
CONFIGURATION
MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING
DO-IT-YOURSELF
WITH CLOUDERA
65
$7M
$5M
$3M
$1.8M
$852K
$3.5M
$1.7M
$8.4M
$5.4M
SELF
MANAGE
CLOUDERA
ENTERPRISE
SELF
MANAGE
CLOUDERA
ENTERPRISE
SELF
MANAGE
CLOUDERA
ENTERPRISE
25 NODES
$948K SAVINGS
50 NODES
$1.8M SAVINGS
100 NODES
$3M SAVINGS
Three-Year TCO Comparison
Maximizing Efficiency
66
Why Cloudera Manager
End-to-end administration for the Enterprise Data Hub in a single tool
Simple
Manages Hadoop at a system level – Cloudera’s experience realized in software
Intelligent
Simplifies complex workflows and makes administrators more productive
Efficient
The only enterprise-grade Hadoop management application available
Best-in-Class
67
Why Backup and Disaster Recovery?
Cloudera Enterprise is a Mission-Critical Part
of the Data Management Infrastructure
• Stores valuable data and runs important workloads
• Business continuity is a MUST HAVE
1
Managing Business Continuity for Hadoop is
Complex
• Different services that store data – HDFS, HBase, Hive
• Backup and disaster recovery is configured separately for each
• Processes are manual
2
68
BDR in Cloudera Enterprise
Simplified Management of Backup & DR Policies
Central Configuration
• HDFS - Select files & directories to replicate
• Hive - Select tables to replicate
• Schedule replication jobs for optimal times
HDFS
HIVE
NODES
Monitoring & Alerting
• Track progress of replication jobs
• Get notified when data is out of sync
Performance & Reliability
• High performance replication using MapReduce
• CDH-optimized version of DistCP
SITE A SITE B
HDFS
HIVE
NODES
69
Benefits of BDR
Reduce Complexity • Centrally manage backup and DR workflows
• Simple setup via an intuitive user interface
Maximize Efficiency
• Simplify processes to meet or exceed SLAs and
Recovery Time Objectives (RTOs)
• Optimize system performance and network
impact through scheduling
Reduce Risk & Exposure
• Eliminate error-prone manual processes
• Get notified when issues occur
• The only solution for metadata replication (Hive)
70
One Tool For Everything
Managing Complexity
+
DEPLOYMENT &
CONFIGURATION
MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING
ACTIVITY
MONITORING
DO-IT-YOURSELF
WITH CLOUDERA
71
1 2 3Find Nodes Install Components Assign Roles
Enter the names of the hosts which will be
included in the Hadoop cluster. Click Continue.
Cloudera Manager automatically installs the CDH
components on the hosts you specified.
Verify the roles of the nodes within your cluster.
Make changes as necessary.
Install a Cluster in Three Simple Steps
Cloudera Manager Key Features
72
View Service Health and Performance
72
Cloudera Manager Key Features
73
Monitor and Diagnose Cluster Workloads
Cloudera Manager Key Features
74
Rolling Upgrades
Cloudera Manager Key Features
75
Why You Need Cloudera Navigator
Lots of Data Landing in Cloudera Enterprise
• Huge quantities
• Many different sources – structured and unstructured
• Varying levels of sensitivity
1
Many Users Working with the Data
• Administrators and compliance officers
• Analysts and data scientists
• Business users
2
Need to Effectively Control and Consume Data
• Get visibility and control over the environment
• Discover and explore data
3
76
Cloudera Navigator
Data Management Layer for Cloudera Enterprise
Audit & Access Control
Ensuring appropriate permissions and reporting on
data access for compliance
Discovery & Exploration
Finding out what data is available and what it
looks like
Lineage
Tracing data back to its original source
Lifecycle Management
Migration of data based on policies
Enterprise Metadata Repository
• Business metadata
• Lineage metadata
• Operational metadata
Audit &
Access
Control
Discovery &
Exploration
Lineage
Lifecycle
Mgmt.
HDFS HBASE HIVE
CLOUDERA NAVIGATOR
CDH
77
Cloudera Navigator
Data Audit & Access Control
Verify Permissions
View which users and groups have access to files and
directories
Audit Configuration
Configuration of audit tracking for HDFS, HBase
and Hive
Audit Dashboard
Simple, queryable interface to view data access
Information Export
Export audit information for integration with SIEM tools
HDFS
HBASE
HIVE
IAM / LDAP SYSTEM
3rd PARTY SIEM / GRC SYSTEM
ACCESS
SERVICE
AUDIT LOG
SERVICE
VIEW PERMISSIONS AUDIT LOG CONFIG
AUDIT LOG
COLLECTION
CLOUDERA NAVIGATOR 1.0
78
Benefits of Cloudera Navigator
Control • Store sensitive data
• Maintain full audit history
• The first and only centralized audit tool for Hadoop
Visibility
• Verify access permissions to files and directories
• Report on data access by user and type
Integration • View permissions for LDAP/IAM users
• Export audit data for integration with third-party SIEM tools
Getting Started with Hadoop

More Related Content

PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
A Survey of HBase Application Archetypes
PPTX
SQL Server 2012 and Big Data
PDF
HBase Status Report - Hadoop Summit Europe 2014
PPTX
Apache HBase™
PPTX
HBase in Practice
PDF
HBase and Impala Notes - Munich HUG - 20131017
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
A Survey of HBase Application Archetypes
SQL Server 2012 and Big Data
HBase Status Report - Hadoop Summit Europe 2014
Apache HBase™
HBase in Practice
HBase and Impala Notes - Munich HUG - 20131017
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

What's hot (20)

PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Cisco connect toronto 2015 big data sean mc keown
PPTX
HBaseCon 2013: Full-Text Indexing for Apache HBase
PDF
Data Evolution in HBase
PDF
Efficient in situ processing of various storage types on apache tajo
PDF
Realtime Analytics with Hadoop and HBase
PPTX
Apache drill
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
PPTX
Microsoft's Big Play for Big Data
PPTX
Big data and hadoop anupama
PDF
Apachecon Europe 2012: Operating HBase - Things you need to know
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PPT
NoSQL_Night
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
Hadoop hbase mapreduce
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
PPTX
Apache HBase - Introduction & Use Cases
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Cisco connect toronto 2015 big data sean mc keown
HBaseCon 2013: Full-Text Indexing for Apache HBase
Data Evolution in HBase
Efficient in situ processing of various storage types on apache tajo
Realtime Analytics with Hadoop and HBase
Apache drill
Hadoop meets Agile! - An Agile Big Data Model
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Microsoft's Big Play for Big Data
Big data and hadoop anupama
Apachecon Europe 2012: Operating HBase - Things you need to know
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
NoSQL_Night
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Hadoop hbase mapreduce
Apache Tez - A unifying Framework for Hadoop Data Processing
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Apache HBase - Introduction & Use Cases
Ad

Viewers also liked (10)

PPTX
Marco tulio
PPTX
Auditoria
PDF
Независимая аналитика Инвесткафе
PDF
5 Tips for Making Charitable Donations Count
PDF
Letter of reference
PDF
Malt Shovel Article
DOCX
Resilience Program Development
DOCX
Ecological Effects of CSA
DOCX
DOCX
Majemite Kelvin Esisi-CV Nigeria
Marco tulio
Auditoria
Независимая аналитика Инвесткафе
5 Tips for Making Charitable Donations Count
Letter of reference
Malt Shovel Article
Resilience Program Development
Ecological Effects of CSA
Majemite Kelvin Esisi-CV Nigeria
Ad

Similar to Getting Started with Hadoop (20)

PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
PPTX
Hadoop ppt1
PDF
Scaling Storage and Computation with Hadoop
PDF
Apache Hadoop 1.1
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PPTX
Big data - Online Training
PPTX
Hadoop ppt on the basics and architecture
PPTX
SQL on Hadoop
PPTX
Hadoop Training in Hyderabad
PPTX
Hadoop Training in Hyderabad
PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
PPTX
Scaling etl with hadoop shapira 3
PDF
Rapid Cluster Computing with Apache Spark 2016
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Big Data Developers Moscow Meetup 1 - sql on hadoop
Hadoop and Big data in Big data and cloud.pptx
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Hadoop ppt1
Scaling Storage and Computation with Hadoop
Apache Hadoop 1.1
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Big data - Online Training
Hadoop ppt on the basics and architecture
SQL on Hadoop
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Scaling etl with hadoop shapira 3
Rapid Cluster Computing with Apache Spark 2016

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Getting Started with Hadoop

  • 3. 3 Agenda • 1999 the Database Story • Intro to Hadoop • Basic Hadoop Eco-system • Why CDH is Awesome • QA
  • 5. 5
  • 6. 6 Indexing the Web • Web is huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real-time!
  • 7. 7
  • 8. 8 Databases in 1999 1. Buy a really big machine 2. Install an expensive DBMS on it 3. Point your workload at it 4. Hope it doesn’t fail 5. Ambitious: buy another really big machine as a backup
  • 9. 9 Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data
  • 10. 10 Google Does Something Different • Designed their own storage and processing infrastructure • Google File System and MapReduce • Goals: • Cheap • Scalable • Reliable
  • 11. 11 Google Does Something Different • It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day
  • 14. 14
  • 15. 15 Google’s messages from the future • Google was benevolent enough to publish • 2003: Google File System (GFS) paper • 2004: MapReduce paper • Already mature technologies at this point
  • 16. 16 Google’s messages from the future • Community didn’t get it immediately • DB people thought it was silly • Non-Google weren’t at the same scale yet • Google had little interest in releasing GFS and MapReduce • Business was ads, not infrastructure
  • 17. 17 Birth of Hadoop • Doug Cutting and Mike Cafarella • Nutch • Open-source search platform • Ran into scaling issues • 4 nodes • Hard to program • Hard to manage • Immediate application for GFS and MR
  • 18. 18 Birth of Hadoop • 2004-2006: Implemented GFS/MR and ported Nutch to it • 2006: Spun out into Apache Hadoop • Name of Doug’s son’s stuffed elephant
  • 20. 20 Summary • The web is huge and unstructured • Databases didn’t fit the problem • Didn’t scale, expensive, SQL limitations • Google did their own thing: GFS + MR • Hadoop is based on the Google papers
  • 22. 22 HDFS • Based on GFS • Distributed, fault-tolerant filesystem • Primarily designed for cost and scale • Works on commodity hardware • 20PB / 4000 node cluster at Facebook
  • 23. 23 HDFS design assumptions • Failures are common • Massive scale means more failures • Disks, network, node • Files are append-only • Files are large (GBs to TBs) • Accesses are large and sequential
  • 24. 24 Quick primers • Filesystems • Hard drives • Datacenter networking
  • 25. 25 Quick filesystem primer • Same concepts as the FS on your laptop • Directory tree • Create, read, write, delete files • Filesystems store metadata and data • Metadata: filename, size, permissions, … • Data: contents of a file • Other concerns • Data integrity, durability, management
  • 26. 26 Quick disk primer • Disk does a seek for each I/O operation • Seeks are expensive (~10ms) • Throughput / IOPS tradeoff • 100 MB/s and 10 IOPS • 10MB/s and 100 IOPS • Big I/Os mean better throughput
  • 28. 28 Quick networking primer 40 Gbit 1, 2, 4, 10 Gbit 10 Gbit
  • 29. 29 HDFS Architecture Overview Secondary Namenode Host 2 Namenode Host 1 DataNode Host 3 DataNode Host 4 DataNode Host 5 DataNode Host n
  • 30. 30 Block Size = 64MB Replication Factor = 3 HDFS Block Replication 1 2 3 4 5 2 3 4 2 4 5 1 3 5 1 2 5 1 3 4 HDFS Node 1 Node 2 Node 3 Node 4 Node 5 Blocks
  • 31. 31 HDFS Write Path • Talk to NameNode • Store metadata for new file • Get topology-aware list of DataNodes • Setup the write pipeline • Stream data to pipeline • Tell NameNode when done
  • 32. 32 HDFS Fault-tolerance • Many different failure modes • Disk corruption, node failure, switch failure • Primary concern • Data is safe!!! • Secondary concerns • Keep accepting reads and writes • Do it transparently to clients
  • 33. 33 MapReduce – Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 34. 34 MapReduce – Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 36. 36 Word Count Example The cat sat on the mat The aardvark sat on the sofa The, 1 cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 Mapper Input Mapping aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] Shuffling Reducing Final Result
  • 37. 37 Summary • GFS and MR co-design • Cheap, simple, effective at scale • Fault-tolerance baked in • Replicate data 3x • Incrementally re-execute computation • Avoid single points of failure
  • 38. 38 Hadoop Ecosystem Overview What Are All These Things?
  • 39. 39 Sqoop Performs bidirectional data transfers between Hadoop and almost any SQL database with a JDBC driver
  • 40. 40 Flume Client Client Client Client Agent Agent Agent A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.
  • 41. 41 Hive SELECT s.word, s.freq, k.freq FROM shakespeare JOIN ON (s.word= k.word) WHERE s.freq >= 5; • Relational database abstraction using a SQL like dialect called HiveQL • Statements are executed as one or more MapReduce Jobs
  • 42. 42 Pig • High-level scripting language for for executing one or more MapReduce jobs • Created to simplify authoring of MapReduce jobs • Can be extended with user defined functions emps = LOAD 'people.txt’ AS (id,name,salary); rich = FILTER emps BY salary > 200000; sorted_rich = ORDER rich BY salary DESC; STORE sorted_rich INTO ’rich_people.txt';
  • 43. 43 HBase • Low-latency, distributed, columnar key-value store • Based on BigTable • Efficient random reads/writes on HDFS • Useful for frontend applications
  • 44. 44 Oozie A workflow engine and scheduler built specifically for large- scale job orchestration on a Hadoop cluster
  • 45. 45 Hue • Hue is an open source web- based application for making it easier to use Apache Hadoop. • Hue features • File Browser for HDFS • Job Designer/Browser for MapReduce • Query editors for Hive, Pig and Cloudera Impala • Oozie
  • 46. 46 Zookeeper • Zookeeper is a distributed consensus engine • Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Message board / mailboxes
  • 47. 47 Hadoop Ecosystem INGEST STORE EXPLORE PROCESS ANALYZE SERVE CONNECTORS STORAGE RESOURCE MGMT & COORDINATION USER INTERFACE WORKFLOW MGMT METADATACLOUD INTEGRATION YA YARN ZO ZOOKEEPER HDFS HADOOP DFS HB HBASE HU HUE OO OOZIE WH WHIRR SQ SQOOP FL FLUME FILE FUSE-DFS REST WEBHDFS / HTTPFS SQL ODBC / JDBC MS META STORE AC ACCESS BI ETL RDBMS BATCH COMPUTE BATCH PROCESSING REAL-TIME ACCESS & COMPUTE MR MAPREDUCE MR2 MAPREDUCE2 HI HIVE PI PIG MA MAHOUT DF DATAFU IM IMPALA MANAGEMENT SOFTWARE & TECHNICAL SUPPORT SUBSCIPTION OPTIONS CLOUDERA NAVIGATOR CLOUDERA MANAGER CORE (REQUIRED) RTD RTQ BDR AUDIT (v1.0) LINEAGE ACCESS (v1.0) LIFECYCLE EXPLORE CORE
  • 49. 49 Cloudera Impala Cost-effective, ad hoc query environment that offloads the data warehouse for: • Interactive BI/analytics on more data • Asking new questions • Data processing with tight SLAs • Query-able archive w/full fidelity
  • 50. 50 Cloudera Impala Interactive SQL for Hadoop • Responses in seconds • Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine • Purpose-built for low-latency queries • Separate runtime from MapReduce • Designed as part of the Hadoop ecosystem Open Source • Apache-licensed
  • 51. 51 Impala Key Features Fast Flexible Secure Easy to Implement Easy to Use Simple to Manage • In-memory data transfers • Partitioned joins • Fully distributed aggregations • Query data in HDFS & HBase • Supports multiple file formats & compression algorithms • Integrated with Hadoop security • Kerberos authentication • Authorization (Sentry) • Leverages Hive’s ODBC/JDBC connectors, metastore & SQL syntax • Open source • Interact with data via SQL • Certified with leading BI tools • Deploy, configure & monitor with Cloudera Manager • Integrated with Hadoop resource management
  • 52. 52 The Impala Advantage BI Partners: Building on the Enterprise Standard
  • 53. 53 Cloudera Search Offer easy access to non-technical resources Explore data prior to processing and modeling Gain immediate access and find correlations in mission-critical data Powerful, proven search capabilities that let organizations:
  • 54. 54 Cloudera Search Interactive Search for All Data • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing Apache Solr Integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 55. 55 Search Key Features Scalable Flexible Timely Mature Simple to Use Easy to Manage • Index storage & retrieval on HDFS • Indexing with MapReduce and Flume • Shard management with Zookeeper • Indexing and query of any data in HDFS and HBase • Support for multiple file formats • Field mapping and matching with Morphlines • Indexing in batch, on-demand, and in near real-time • Scalable extraction and mapping with built-in Solr sink for Flume • Proven, enterprise-ready technology • Rich ecosystem and knowledge within community • Familiar full-text search and faceted navigation • Out-of-the-box Search GUI • Known, readily available standard Solr APIs • Integrated with Cloudera Manager and Apache Sentry • Integrated coordination and execution of jobs • GoLive for incremental changes
  • 56. 56 The Search Advantage Search Partners: Building on the Cloudera Enterprise Data Hub
  • 57. 57 Sentry Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Open Source Submitted to ASF Open Source authorization module for Impala & Hive
  • 58. 58 Defining Security Functions Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization Data masking Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage
  • 59. 59 Enabling Enterprise Security Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization Data masking Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage Apache SentryKerberos, AD/LDAP Cloudera NavigatorCertified Partners
  • 60. 60 Authorization Requirements Secure Authorization Ability to control access to data and/or privileges on data for authenticated users Fine-Grained Authorization Ability to give users access to a subset of data (e.g. column) in a database Role-Based Authorization Ability to create/apply templatized privileges based on functional roles Multitenant Administration Ability for central admin group to empower lower-level admins to manage security for each database/schema
  • 61. 61 Key Capabilities of Sentry Fine-Grained Authorization Specify security for SERVERS, DATABASES, TABLES & VIEWS Role-Based Authorization SELECT privilege on views & tables INSERT privilege on tables TRANSFORM privilege on servers ALL privilege on the server, databases, tables & views ALL privilege is needed to create/modify schema Multitenant Administration Separate policies for each database/schema Can be maintained by separate admins
  • 62. 62 Challenges with Hadoop without Management Hadoop is more than a dozen services running across many machines • Hundreds of hardware components • Thousands of settings • Limitless permutations Complexity Hadoop is a system, not just a collection of parts • Everything is interrelated • Raw data about individual pieces is not enough • Must extract what’s important Context Managing Hadoop with multiple tools and manual process takes longer • Complicated, error-prone workflows • Longer issue resolution • Lack of consistent and repeatable processes Efficiency
  • 63. 63 Cloudera Manager End-to-End Administration for Your Enterprise Data Hub Manage Easily deploy, configure & optimize clusters1 Monitor Maintain a central view of all activity2 Diagnose Easily identify and resolve issues3 Integrate Use Cloudera Manager with existing tools4
  • 64. 64 One Tool For Everything Managing Complexity + DEPLOYMENT & CONFIGURATION MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING DO-IT-YOURSELF WITH CLOUDERA
  • 66. 66 Why Cloudera Manager End-to-end administration for the Enterprise Data Hub in a single tool Simple Manages Hadoop at a system level – Cloudera’s experience realized in software Intelligent Simplifies complex workflows and makes administrators more productive Efficient The only enterprise-grade Hadoop management application available Best-in-Class
  • 67. 67 Why Backup and Disaster Recovery? Cloudera Enterprise is a Mission-Critical Part of the Data Management Infrastructure • Stores valuable data and runs important workloads • Business continuity is a MUST HAVE 1 Managing Business Continuity for Hadoop is Complex • Different services that store data – HDFS, HBase, Hive • Backup and disaster recovery is configured separately for each • Processes are manual 2
  • 68. 68 BDR in Cloudera Enterprise Simplified Management of Backup & DR Policies Central Configuration • HDFS - Select files & directories to replicate • Hive - Select tables to replicate • Schedule replication jobs for optimal times HDFS HIVE NODES Monitoring & Alerting • Track progress of replication jobs • Get notified when data is out of sync Performance & Reliability • High performance replication using MapReduce • CDH-optimized version of DistCP SITE A SITE B HDFS HIVE NODES
  • 69. 69 Benefits of BDR Reduce Complexity • Centrally manage backup and DR workflows • Simple setup via an intuitive user interface Maximize Efficiency • Simplify processes to meet or exceed SLAs and Recovery Time Objectives (RTOs) • Optimize system performance and network impact through scheduling Reduce Risk & Exposure • Eliminate error-prone manual processes • Get notified when issues occur • The only solution for metadata replication (Hive)
  • 70. 70 One Tool For Everything Managing Complexity + DEPLOYMENT & CONFIGURATION MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING DO-IT-YOURSELF WITH CLOUDERA
  • 71. 71 1 2 3Find Nodes Install Components Assign Roles Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue. Cloudera Manager automatically installs the CDH components on the hosts you specified. Verify the roles of the nodes within your cluster. Make changes as necessary. Install a Cluster in Three Simple Steps Cloudera Manager Key Features
  • 72. 72 View Service Health and Performance 72 Cloudera Manager Key Features
  • 73. 73 Monitor and Diagnose Cluster Workloads Cloudera Manager Key Features
  • 75. 75 Why You Need Cloudera Navigator Lots of Data Landing in Cloudera Enterprise • Huge quantities • Many different sources – structured and unstructured • Varying levels of sensitivity 1 Many Users Working with the Data • Administrators and compliance officers • Analysts and data scientists • Business users 2 Need to Effectively Control and Consume Data • Get visibility and control over the environment • Discover and explore data 3
  • 76. 76 Cloudera Navigator Data Management Layer for Cloudera Enterprise Audit & Access Control Ensuring appropriate permissions and reporting on data access for compliance Discovery & Exploration Finding out what data is available and what it looks like Lineage Tracing data back to its original source Lifecycle Management Migration of data based on policies Enterprise Metadata Repository • Business metadata • Lineage metadata • Operational metadata Audit & Access Control Discovery & Exploration Lineage Lifecycle Mgmt. HDFS HBASE HIVE CLOUDERA NAVIGATOR CDH
  • 77. 77 Cloudera Navigator Data Audit & Access Control Verify Permissions View which users and groups have access to files and directories Audit Configuration Configuration of audit tracking for HDFS, HBase and Hive Audit Dashboard Simple, queryable interface to view data access Information Export Export audit information for integration with SIEM tools HDFS HBASE HIVE IAM / LDAP SYSTEM 3rd PARTY SIEM / GRC SYSTEM ACCESS SERVICE AUDIT LOG SERVICE VIEW PERMISSIONS AUDIT LOG CONFIG AUDIT LOG COLLECTION CLOUDERA NAVIGATOR 1.0
  • 78. 78 Benefits of Cloudera Navigator Control • Store sensitive data • Maintain full audit history • The first and only centralized audit tool for Hadoop Visibility • Verify access permissions to files and directories • Report on data access by user and type Integration • View permissions for LDAP/IAM users • Export audit data for integration with third-party SIEM tools