SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
Cloudera training: secure your Cloudera cluster
© Cloudera, Inc. All rights reserved.
The demand for skills is high and Hadoop is the future. Customers
cannot afford to move slowly in staffing their Big Data projects.
Customers are building plans to ensure projects are staffed with
skilled employees, and supported by a qualified services provider.
Job Trends from Indeed.com
What are you most concerned about
when it comes to your readiness for big
data and hadoop?
Cloudera MDP webinar poll results, July 2016
© Cloudera, Inc. All rights reserved.
Why Cloudera training?
Aligned to best practices and the pace of change
1 Broadest range of courses
Learning paths for Developer, Admin, Analyst
2 Most experienced instructors
More than 40,000 trained since 2009
6 Widest geographic coverage
Most classes offered: 50 cities worldwide plus online
7 Most relevant platform & community
CDH deployed more than all other distributions combined
3 Leader in certification
Over 12,000 accredited Cloudera professionals
Trusted source for training
100,000+ people have attended online courses4
8 Depth of training material
Hands-on labs and VMs support live instruction
9 Ongoing learning
Video tutorials and e-learning complement training
State of the art curriculum
Courses updated as Hadoop evolves5 10Commitment to big data education
University partnerships to teach Hadoop in colleges
© Cloudera, Inc. All rights reserved.
Creating leaders in the field
Training enables Big Data solutions and innovation
94%
66%
Would recommend or highly recommend Cloudera
training to friends or colleagues
Draw on lessons from Cloudera training on at least a
monthly basis
40% Develop new apps or perform business-critical
analyses as a result of training alone
Sources: Cloudera Past Public Training Participant Study, December 2012.
Cloudera Customer Satisfaction Study, January 2013.
88% Indicate Cloudera training provided the Hadoop
expertise their roles require
© Cloudera, Inc. All rights reserved.
What is available from Cloudera University?
• Private training: Course delivered at location of customer choice to internal audience
• Public training: Courses regularly scheduled around the globe. Schedule available on web
• Virtual training: Live training accessed via the internet; available for public and private courses
• OnDemand training: Pre-recorded lecture with identical content/exercises as live training options
• Certification: Rigorously developed and meaningful bodies of knowledge
OnDemand Virtual live classroom Private onsitePublic live classroom
© Cloudera, Inc. All rights reserved.
Suggested Cloudera University curricula
Developers
• Python/Scala Training
• Developer for Spark and Hadoop
• CCA: Spark and Hadoop
Developer
• Spark ML & Kafka modules
• Topic specific training (Search,
HBase)
• Hands on practice
• CCP: Data Engineer
Administrators
• Cloudera Administration training
• CCA: Administrator
• Cloudera Security OnDemand
Data Analysts/Data Scientists
• Data Analyst: Using Hive, Pig & Impala
• CCA: Data Analyst
• Cloudera Data Science
7© Cloudera, Inc. All rights reserved.
Security for Hadoop
Carlo Lazzaris | Technical Instructor
8© Cloudera, Inc. All rights reserved.
Security Webinar Agenda
1. The need for Hadoop Security
Hacker news and legal regulations
2. Cloudera Security Implementation
Five levels of security
3. How to secure your Cloudera cluster
Cloudera Documentation
Cloudera professional services
Cloudera OnDemand security course
9© Cloudera, Inc. All rights reserved.
The need for Hadoop security
10© Cloudera, Inc. All rights reserved.
Unguarded data stores are the victims
11© Cloudera, Inc. All rights reserved.
Regulatory Compliance
Organizations can be fined up to 4% of
annual global turnover for breaching GDPR
or €20 Million
12© Cloudera, Inc. All rights reserved.
Cloudera security implementation
13© Cloudera, Inc. All rights reserved.
Cloudera Enterprise CDH
13
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE SERVICES
DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA CATALOG
INGEST &
REPLICATION
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
DATA
SCIENCE
S3 ADLS HDFS KUDU
STORAGE
SERVICES
14© Cloudera, Inc. All rights reserved.
• Unified security – protects sensitive data with consistent
controls, even for transient and recurring workloads
• Consistent governance – enables secure self-service access
to all relevant data and increases compliance
• Easy workload management – increases user productivity and
boosts job predictability
• Flexible ingest and replication – aggregates a single copy of
all data, provides disaster recovery, and eases migration
• Shared catalog – defines and preserves structure and
business context of data for new applications and partner
solutions
Open platform services
Built for multi-function analytics | Optimized for cloud
15© Cloudera, Inc. All rights reserved.
Cloudera Enterprise-Grade Security and Governance
Access
Defining what
users and
applications can
do with data
Technical Concepts:
Permissions
Authorization
Data
Protection
Shielding data in
the cluster from
unauthorized
visibility
Technical Concepts:
Encryption at rest & in
motion
Visibility
Reporting on
where data came
from and how it’s
being used
Technical Concepts:
Auditing
Lineage
Cloudera Manager Apache Sentry Cloudera Navigator
Navigator Encrypt &
Key Trustee
Identity
Validate users by
membership in
enterprise
directory
Technical
Concepts:
Authentication
User/group mapping
16© Cloudera, Inc. All rights reserved.
Cloudera Certified Technology Partners
Data Sources Data Ingest
Process, Refine
& Prep
Data Discovery Advanced Analytics
Connected
Machines/Data sources
Other Data Sources
17© Cloudera, Inc. All rights reserved.
A certified product ensures it integrates securely
• Authenticate via Kerberos or LDAP
Authentication
• Handle Apache Sentry with Hive, Impala, Search, HDFS
Authorization
• Support HDFS transport encryption, at-rest encryption; support SSL/TLS
connection encryption
Encryption
18© Cloudera, Inc. All rights reserved.
Vulnerability Response and Process
Vulnerability
reports
Upstream
Internal
External
Fix Publish
19© Cloudera, Inc. All rights reserved.
Cluster Security Levels
20© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
20
The modern platform for machine learning and analytics optimized for the cloud
21© Cloudera, Inc. All rights reserved.
Enterprise Encryption Performance
23© Cloudera, Inc. All rights reserved.
Disclaimer
This talk serves as a general guideline for
security implementation on Hadoop.
The actual implementation procedures and
scope of implementation vary on a case-by-
case basis, and should be assessed by
Cloudera’s Professional Services team or
certified Cloudera SI Partners.
24© Cloudera, Inc. All rights reserved.
Non-secure #0
Data Free for All
25© Cloudera, Inc. All rights reserved.
Firewall
ActiveDirectory/KDC
Hadoop cluster
Cloudera
Manager
Gateway
node
Cloudera Worker
nodesDatacenter
Applications
26© Cloudera, Inc. All rights reserved.
4 modes of Identity Management
1. Simple Authentication
2. Kerberos
3. LDAP
4. SAML
File group ownership
• AD integration
• SSSD or Centrify
Consideration in large enterprises.
via SSSD
via
27© Cloudera, Inc. All rights reserved.
Simple Authentication detect the user
Firewall
ActiveDirectory
Master
Worker Worker Worker
Cloudera
Manager
Master
(SSSD/Centrify)
28© Cloudera, Inc. All rights reserved.
Simple authentication =
no authentication
29© Cloudera, Inc. All rights reserved.
Minimal Security #1
Reduce Risk Exposure
30© Cloudera, Inc. All rights reserved.
How it works: Authentication
• LDAP and SAML authentication
options
Web UIs
• LDAP/AD and Kerberos
authentication options
SQL Access
•Kerberos authentication
•Automation provided by Cloudera
Manager to leverage Active
Directory (AD)
Command Lines
User authenticates to
AD or KDC
Authenticated user
gets Kerberos Ticket
Ticket grants access to
Services e.g. Impala
User [ssmith]
Password [***** ]
31© Cloudera, Inc. All rights reserved.
Kerberos
EXAMPLE.COM
KDC
user@EXAMPLE.COM
Hadoop
user@EXAMPLE.COM 
user
Strong Authentication
KDC Key Distribution Center
• MIT
• ActiveDirectory (more common)
realmprimary
32© Cloudera, Inc. All rights reserved.
Kerberos
Consideration in large corporates
Time synchronization
CM Kerberos Wizard
• Configure AD to create a Kerberos
principal for CM server, and to
delegate CM the ability to
create/manage Kerberos
principals
33© Cloudera, Inc. All rights reserved.
Kerberos
Consideration in large corporates
Time synchronization
CM Kerberos Wizard
• Configure AD to create a Kerberos
principal for CM server, and to
delegate CM the ability to
create/manage Kerberos
principals
34© Cloudera, Inc. All rights reserved.
Kerberos Authentication
* LDAP over SSL
35© Cloudera, Inc. All rights reserved.
Authorization/Access Control
HDFS File ACL YARN job submission
Hbase ACLsOozie ACL
Access Control List (ACLs)
Hive
Sentry Managed
(RBAC)
Impala
36© Cloudera, Inc. All rights reserved.
Auditing
37© Cloudera, Inc. All rights reserved.
Backup/Disaster Recovery
Cloudera Backup/Disaster Recovery (BDR)
• A high performance data replicator
• Copies incremental data on the source cluster at specified schedules
Supports
 Kerberos
 Data encryption
 HDFS replication to cloud
38© Cloudera, Inc. All rights reserved.
Kerberized BDR Best Practice
Production DR
Cloudera BDR
PROD.EXAMPLE.COM
Cross-realm trust
KDC KDC
DR.EXAMPLE.COM
39© Cloudera, Inc. All rights reserved.
More Security #2
Managed, Secure, Protected
40© Cloudera, Inc. All rights reserved.
Data In-Motion Encryption
RPC encryption
Data transport encryption
• Supports AES CTR, up to 256-bit
key length
HTTP TLS/SSL encryption
• No self-signed certificates in
production
Master
Worker Worker Worker
Master
Application
RPC encryption
Transport
encryption
TLS/SSL
41© Cloudera, Inc. All rights reserved.
Data At-Rest Encryption
Transparent encryption
Supports any Hadoop applications
Encryption Zone
$ hadoop key create mykey
$ hadoop fs -mkdir /zone
$ hdfs crypto -createZone -keyName mykey -path /zone
/
/tmp /zone
foo bar
Encryption zone
42© Cloudera, Inc. All rights reserved.
Key Management Server Deployment (non-prod)
HDFS
NameNode
Client
Java
Keystore
KMS
Keystore file
Separation of duties
• Encryption Zone Key (EZK) is stored in
KMS server
• HDFS super user can not decrypt files
43© Cloudera, Inc. All rights reserved.
Key Management Server/Key Trustee Server Deployment
HDFS
NameNode
Client
Key Trustee
KMS
Key Trustee
KMS
Firewall
Key Trustee
Server
(Active)
Key Trustee
Server
(Passive)
synchronization
(or more)
44© Cloudera, Inc. All rights reserved.
KMS+KTS+HSM Deployment
HDFS
NameNode
Client HSM KMS
HSM KMS
Firewall
Key Trustee
Server
(Active)
Key Trustee
Server
(Passive)
synchronization
Key HSM
(or more)
Key HSM
HSM
HSM
45© Cloudera, Inc. All rights reserved.
Troubleshooting: Encryption Performance Anomaly
• Configuration
• AES-NI Hardware acceleration
• OpenSSL library
• Entropy
46© Cloudera, Inc. All rights reserved.
Fine Grained Access Control with Apache Sentry
47© Cloudera, Inc. All rights reserved.
Most Security #3
Secure Data Vault
48© Cloudera, Inc. All rights reserved.
Level 3 Secure Data Vault
• All data, both data-at-rest and data-in-transit is encrypted
• Key management system is fault-tolerant
• Auditing mechanisms comply with industry, government, and regulatory
standards (PCI, HIPAA, NIST, for example)
• Auditing extends from EDH to the other systems that integrate with it.
• Cluster administrators are well-trained
• Security procedures have been certified by an expert
• Cluster can pass technical review
49© Cloudera, Inc. All rights reserved.
Data Redaction
Personal Identifiable Information
• PCI-DSS, HIPAA
Best practices followed
Password
• stores in credential files, not in configuration
Log, queries
• Cloudera Manager
50© Cloudera, Inc. All rights reserved.
Full Encryption
Encrypt Data Spills
• MapReduce
• Impala
• Hive
• Flume
OS-level encryption
• Navigator Encrypt
51© Cloudera, Inc. All rights reserved.
How to secure your Cloudera cluster
52© Cloudera, Inc. All rights reserved.
Cloudera Documentation
53© Cloudera, Inc. All rights reserved.
Cloudera Professional Services security engagement
• Review security requirements and provide an overview of data security policies
• Audit architecture and current systems for security policies and best practices
• Custom tailor a security reference architecture
• Optimize OS and Java to take advantage of hardware-based crypto-acceleration
• Install and configure Kerberos with MIT Kerberos KDC or Active Directory
• Install and configure Sentry and Cloudera Navigator (license required)
• Install and configure Navigator Encrypt and Key Trustee with an HSM root of trust
• Review fine-grain permissions on sample data using Sentry
• Review audit and lineage on sample data using Navigator
• Use Cloudera Manager and Hue to review security integration for users
• Enable and configure HDFS encryption
https://www.cloudera.com/more/services-and-support/professional-services/security-integration-pilot.html
54© Cloudera, Inc. All rights reserved.
Cloudera online ondemand security course
• Online self paced training course https://ondemand.cloudera.com
• Launch planned for mid Feb 2018
• 3 days estimate worth of content at Cloudera level 1 and 2 security level
• Currently 375~ slides with 9 detailed chapters and 16 instructor demonstrations :
1. Security overview
2. Security Architecture
3. Host Security
4. Encrypting Data in motion
5. Authentication
6. Authorization
7. Encrypting Data at Rest
8. Auditing
9. Additional Considerations: Data Governance
55© Cloudera, Inc. All rights reserved.
Ondemand security course instructor guided demos
1. Potential Attack vectors
2. Securing the cluster hosts
3. Generating and managing keys for TLS
4. Configuring Cloudera Manager for TLS
5. Encrypting Data in Motion
6. Hadoop default authentication
7. Kerberizing Cluster with MIT Kerberos
8. Kerberizing Cluster with Active Directory
9. Configuring Authorising with Cloudera
Manager
10. Controlling access to Yarn
11. Controlling access to HDFS
12. Controlling access to Tables
13. Enabling HDFS Encryption
14. Protecting local data with NavEncrypt
15. Using Navigator for auditing
16. Reassessing cluster security
56© Cloudera, Inc. All rights reserved.
Ondemand security course disclaimer
THIS IS REALLY IMPORTANT:
The examples in this course are based on CM/CDH 5.12, running in a cloud-based deployment on a
cluster using the CentOS 7.2 operating system.
Given the almost limitless permutations of possible configurations, including different versions of CDH,
Cloudera Manager, operating systems, directory servers, Kerberos servers, web browsers, and other
tools, as well as variations in policies, laws, and practices that affect each organization differently, it's
impossible for a training course to cover all aspects of security.
This course is meant to provide a background that will help you to understand many important concepts
and techniques, but is not intended as a replacement for the relevant documentation or a consulting
engagement with an expert who can provide advice based on your specific requirements.
• Disclaimers ~ due to security variety and permutations
• Versions used: CDH 5.12 and Centos 7.2
57© Cloudera, Inc. All rights reserved.
Ondemand security course scenario
• Many of our demonstrations are based on a hypothetical scenario
• However, the concepts should apply to nearly any organization
• Loudacre Mobile is a fast-growing wireless carrier
• Employees serving in a variety of roles
• Data ingested from many sources, in many formats
• Data processed by many tools
58© Cloudera, Inc. All rights reserved.
Ondemand security course environment
59© Cloudera, Inc. All rights reserved.
Comprehensive demonstration cluster
60© Cloudera, Inc. All rights reserved.
Sample chapter structure: Encrypting Data in Motion
• Encryption Fundamentals
• Certificates
• Key Management
 Instructor-Led Demonstration: Generating and Managing Keys for TLS
• Configuring Cloudera Manager for TLS
 Instructor-Led Demonstration: Configuring Cloudera Manager for TLS
• Encrypting Hadoop’s Data in Motion
 Instructor-Led Demonstration: Encrypting Hadoop’s Data in Motion
• Essential Points
61© Cloudera, Inc. All rights reserved.
Register your interest for
OnDemand security course:
peter.rizvi@cloudera.com
© Cloudera, Inc. All rights reserved.
Thank you

More Related Content

PDF
Introduction to Apache Hive
PDF
Introduction to spark
PPT
Hadoop Security Architecture
PDF
Hudi architecture, fundamentals and capabilities
PDF
Apache Hadoop and HBase
PDF
Iceberg: a fast table format for S3
ODP
Presto
PDF
MongodB Internals
Introduction to Apache Hive
Introduction to spark
Hadoop Security Architecture
Hudi architecture, fundamentals and capabilities
Apache Hadoop and HBase
Iceberg: a fast table format for S3
Presto
MongodB Internals

What's hot (20)

PPTX
Achieving 100k Queries per Hour on Hive on Tez
PPTX
Securing Hadoop with Apache Ranger
PPTX
Apache Knox setup and hive and hdfs Access using KNOX
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
Hdp security overview
PPTX
Introduction to Hadoop and Hadoop component
PDF
Building an open data platform with apache iceberg
PDF
HDFS Selective Wire Encryption
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PPTX
The Impala Cookbook
PPTX
Elastic Stack Introduction
PPT
Introduction to redis
PDF
Cassandra Introduction & Features
PPTX
Apache HBase™
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PPTX
Introduction to Redis
PPTX
Apache Tez: Accelerating Hadoop Query Processing
ODP
Google's Dremel
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Intro to HBase
Achieving 100k Queries per Hour on Hive on Tez
Securing Hadoop with Apache Ranger
Apache Knox setup and hive and hdfs Access using KNOX
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Hdp security overview
Introduction to Hadoop and Hadoop component
Building an open data platform with apache iceberg
HDFS Selective Wire Encryption
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
The Impala Cookbook
Elastic Stack Introduction
Introduction to redis
Cassandra Introduction & Features
Apache HBase™
Airflow Best Practises & Roadmap to Airflow 2.0
Introduction to Redis
Apache Tez: Accelerating Hadoop Query Processing
Google's Dremel
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Intro to HBase
Ad

Similar to Cloudera training: secure your Cloudera cluster (20)

PDF
Hadoop security implementationon 20171003
PPTX
Security implementation on hadoop
PPTX
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
PPTX
Seeking Cybersecurity--Strategies to Protect the Data
PPTX
The 5 Biggest Data Myths in Telco: Exposed
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PPTX
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Road to Cloudera certification
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PDF
Hadoop on Cloud: Why and How?
PPTX
Optimize your cloud strategy for machine learning and analytics
PPTX
Five Tips for Running Cloudera on AWS
PDF
Machine Learning in the Enterprise 2019
PPTX
Turning Data into Business Value with a Modern Data Platform
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PDF
One Hadoop, Multiple Clouds
PPTX
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Hadoop security implementationon 20171003
Security implementation on hadoop
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Seeking Cybersecurity--Strategies to Protect the Data
The 5 Biggest Data Myths in Telco: Exposed
Hadoop security @ Philly Hadoop Meetup May 2015
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Road to Cloudera certification
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Hadoop on Cloud: Why and How?
Optimize your cloud strategy for machine learning and analytics
Five Tips for Running Cloudera on AWS
Machine Learning in the Enterprise 2019
Turning Data into Business Value with a Modern Data Platform
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds
How Big Data Can Enable Analytics from the Cloud (Technical Workshop)
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
Cloudera SDX
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Cloudera SDX

Recently uploaded (20)

PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PPTX
5 Stages of group development guide.pptx
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PDF
Training And Development of Employee .pdf
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
PPTX
Probability Distribution, binomial distribution, poisson distribution
DOCX
Business Management - unit 1 and 2
PPT
Chapter four Project-Preparation material
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
How to Get Funding for Your Trucking Business
PDF
Types of control:Qualitative vs Quantitative
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PPTX
Amazon (Business Studies) management studies
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
5 Stages of group development guide.pptx
Belch_12e_PPT_Ch18_Accessible_university.pptx
WRN_Investor_Presentation_August 2025.pdf
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
DOC-20250806-WA0002._20250806_112011_0000.pdf
Training And Development of Employee .pdf
Euro SEO Services 1st 3 General Updates.docx
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
ICG2025_ICG 6th steering committee 30-8-24.pptx
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
Probability Distribution, binomial distribution, poisson distribution
Business Management - unit 1 and 2
Chapter four Project-Preparation material
Chapter 5_Foreign Exchange Market in .pdf
How to Get Funding for Your Trucking Business
Types of control:Qualitative vs Quantitative
Power and position in leadershipDOC-20250808-WA0011..pdf
Amazon (Business Studies) management studies

Cloudera training: secure your Cloudera cluster

  • 1. © Cloudera, Inc. All rights reserved. Cloudera training: secure your Cloudera cluster
  • 2. © Cloudera, Inc. All rights reserved. The demand for skills is high and Hadoop is the future. Customers cannot afford to move slowly in staffing their Big Data projects. Customers are building plans to ensure projects are staffed with skilled employees, and supported by a qualified services provider. Job Trends from Indeed.com What are you most concerned about when it comes to your readiness for big data and hadoop? Cloudera MDP webinar poll results, July 2016
  • 3. © Cloudera, Inc. All rights reserved. Why Cloudera training? Aligned to best practices and the pace of change 1 Broadest range of courses Learning paths for Developer, Admin, Analyst 2 Most experienced instructors More than 40,000 trained since 2009 6 Widest geographic coverage Most classes offered: 50 cities worldwide plus online 7 Most relevant platform & community CDH deployed more than all other distributions combined 3 Leader in certification Over 12,000 accredited Cloudera professionals Trusted source for training 100,000+ people have attended online courses4 8 Depth of training material Hands-on labs and VMs support live instruction 9 Ongoing learning Video tutorials and e-learning complement training State of the art curriculum Courses updated as Hadoop evolves5 10Commitment to big data education University partnerships to teach Hadoop in colleges
  • 4. © Cloudera, Inc. All rights reserved. Creating leaders in the field Training enables Big Data solutions and innovation 94% 66% Would recommend or highly recommend Cloudera training to friends or colleagues Draw on lessons from Cloudera training on at least a monthly basis 40% Develop new apps or perform business-critical analyses as a result of training alone Sources: Cloudera Past Public Training Participant Study, December 2012. Cloudera Customer Satisfaction Study, January 2013. 88% Indicate Cloudera training provided the Hadoop expertise their roles require
  • 5. © Cloudera, Inc. All rights reserved. What is available from Cloudera University? • Private training: Course delivered at location of customer choice to internal audience • Public training: Courses regularly scheduled around the globe. Schedule available on web • Virtual training: Live training accessed via the internet; available for public and private courses • OnDemand training: Pre-recorded lecture with identical content/exercises as live training options • Certification: Rigorously developed and meaningful bodies of knowledge OnDemand Virtual live classroom Private onsitePublic live classroom
  • 6. © Cloudera, Inc. All rights reserved. Suggested Cloudera University curricula Developers • Python/Scala Training • Developer for Spark and Hadoop • CCA: Spark and Hadoop Developer • Spark ML & Kafka modules • Topic specific training (Search, HBase) • Hands on practice • CCP: Data Engineer Administrators • Cloudera Administration training • CCA: Administrator • Cloudera Security OnDemand Data Analysts/Data Scientists • Data Analyst: Using Hive, Pig & Impala • CCA: Data Analyst • Cloudera Data Science
  • 7. 7© Cloudera, Inc. All rights reserved. Security for Hadoop Carlo Lazzaris | Technical Instructor
  • 8. 8© Cloudera, Inc. All rights reserved. Security Webinar Agenda 1. The need for Hadoop Security Hacker news and legal regulations 2. Cloudera Security Implementation Five levels of security 3. How to secure your Cloudera cluster Cloudera Documentation Cloudera professional services Cloudera OnDemand security course
  • 9. 9© Cloudera, Inc. All rights reserved. The need for Hadoop security
  • 10. 10© Cloudera, Inc. All rights reserved. Unguarded data stores are the victims
  • 11. 11© Cloudera, Inc. All rights reserved. Regulatory Compliance Organizations can be fined up to 4% of annual global turnover for breaching GDPR or €20 Million
  • 12. 12© Cloudera, Inc. All rights reserved. Cloudera security implementation
  • 13. 13© Cloudera, Inc. All rights reserved. Cloudera Enterprise CDH 13 The modern platform for machine learning and analytics optimized for the cloud EXTENSIBLE SERVICES CORE SERVICES DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA CATALOG INGEST & REPLICATION SECURITY GOVERNANCE WORKLOAD MANAGEMENT DATA SCIENCE S3 ADLS HDFS KUDU STORAGE SERVICES
  • 14. 14© Cloudera, Inc. All rights reserved. • Unified security – protects sensitive data with consistent controls, even for transient and recurring workloads • Consistent governance – enables secure self-service access to all relevant data and increases compliance • Easy workload management – increases user productivity and boosts job predictability • Flexible ingest and replication – aggregates a single copy of all data, provides disaster recovery, and eases migration • Shared catalog – defines and preserves structure and business context of data for new applications and partner solutions Open platform services Built for multi-function analytics | Optimized for cloud
  • 15. 15© Cloudera, Inc. All rights reserved. Cloudera Enterprise-Grade Security and Governance Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Data Protection Shielding data in the cluster from unauthorized visibility Technical Concepts: Encryption at rest & in motion Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage Cloudera Manager Apache Sentry Cloudera Navigator Navigator Encrypt & Key Trustee Identity Validate users by membership in enterprise directory Technical Concepts: Authentication User/group mapping
  • 16. 16© Cloudera, Inc. All rights reserved. Cloudera Certified Technology Partners Data Sources Data Ingest Process, Refine & Prep Data Discovery Advanced Analytics Connected Machines/Data sources Other Data Sources
  • 17. 17© Cloudera, Inc. All rights reserved. A certified product ensures it integrates securely • Authenticate via Kerberos or LDAP Authentication • Handle Apache Sentry with Hive, Impala, Search, HDFS Authorization • Support HDFS transport encryption, at-rest encryption; support SSL/TLS connection encryption Encryption
  • 18. 18© Cloudera, Inc. All rights reserved. Vulnerability Response and Process Vulnerability reports Upstream Internal External Fix Publish
  • 19. 19© Cloudera, Inc. All rights reserved. Cluster Security Levels
  • 20. 20© Cloudera, Inc. All rights reserved. Cloudera Enterprise 20 The modern platform for machine learning and analytics optimized for the cloud
  • 21. 21© Cloudera, Inc. All rights reserved. Enterprise Encryption Performance
  • 22. 23© Cloudera, Inc. All rights reserved. Disclaimer This talk serves as a general guideline for security implementation on Hadoop. The actual implementation procedures and scope of implementation vary on a case-by- case basis, and should be assessed by Cloudera’s Professional Services team or certified Cloudera SI Partners.
  • 23. 24© Cloudera, Inc. All rights reserved. Non-secure #0 Data Free for All
  • 24. 25© Cloudera, Inc. All rights reserved. Firewall ActiveDirectory/KDC Hadoop cluster Cloudera Manager Gateway node Cloudera Worker nodesDatacenter Applications
  • 25. 26© Cloudera, Inc. All rights reserved. 4 modes of Identity Management 1. Simple Authentication 2. Kerberos 3. LDAP 4. SAML File group ownership • AD integration • SSSD or Centrify Consideration in large enterprises. via SSSD via
  • 26. 27© Cloudera, Inc. All rights reserved. Simple Authentication detect the user Firewall ActiveDirectory Master Worker Worker Worker Cloudera Manager Master (SSSD/Centrify)
  • 27. 28© Cloudera, Inc. All rights reserved. Simple authentication = no authentication
  • 28. 29© Cloudera, Inc. All rights reserved. Minimal Security #1 Reduce Risk Exposure
  • 29. 30© Cloudera, Inc. All rights reserved. How it works: Authentication • LDAP and SAML authentication options Web UIs • LDAP/AD and Kerberos authentication options SQL Access •Kerberos authentication •Automation provided by Cloudera Manager to leverage Active Directory (AD) Command Lines User authenticates to AD or KDC Authenticated user gets Kerberos Ticket Ticket grants access to Services e.g. Impala User [ssmith] Password [***** ]
  • 30. 31© Cloudera, Inc. All rights reserved. Kerberos EXAMPLE.COM KDC user@EXAMPLE.COM Hadoop user@EXAMPLE.COM  user Strong Authentication KDC Key Distribution Center • MIT • ActiveDirectory (more common) realmprimary
  • 31. 32© Cloudera, Inc. All rights reserved. Kerberos Consideration in large corporates Time synchronization CM Kerberos Wizard • Configure AD to create a Kerberos principal for CM server, and to delegate CM the ability to create/manage Kerberos principals
  • 32. 33© Cloudera, Inc. All rights reserved. Kerberos Consideration in large corporates Time synchronization CM Kerberos Wizard • Configure AD to create a Kerberos principal for CM server, and to delegate CM the ability to create/manage Kerberos principals
  • 33. 34© Cloudera, Inc. All rights reserved. Kerberos Authentication * LDAP over SSL
  • 34. 35© Cloudera, Inc. All rights reserved. Authorization/Access Control HDFS File ACL YARN job submission Hbase ACLsOozie ACL Access Control List (ACLs) Hive Sentry Managed (RBAC) Impala
  • 35. 36© Cloudera, Inc. All rights reserved. Auditing
  • 36. 37© Cloudera, Inc. All rights reserved. Backup/Disaster Recovery Cloudera Backup/Disaster Recovery (BDR) • A high performance data replicator • Copies incremental data on the source cluster at specified schedules Supports  Kerberos  Data encryption  HDFS replication to cloud
  • 37. 38© Cloudera, Inc. All rights reserved. Kerberized BDR Best Practice Production DR Cloudera BDR PROD.EXAMPLE.COM Cross-realm trust KDC KDC DR.EXAMPLE.COM
  • 38. 39© Cloudera, Inc. All rights reserved. More Security #2 Managed, Secure, Protected
  • 39. 40© Cloudera, Inc. All rights reserved. Data In-Motion Encryption RPC encryption Data transport encryption • Supports AES CTR, up to 256-bit key length HTTP TLS/SSL encryption • No self-signed certificates in production Master Worker Worker Worker Master Application RPC encryption Transport encryption TLS/SSL
  • 40. 41© Cloudera, Inc. All rights reserved. Data At-Rest Encryption Transparent encryption Supports any Hadoop applications Encryption Zone $ hadoop key create mykey $ hadoop fs -mkdir /zone $ hdfs crypto -createZone -keyName mykey -path /zone / /tmp /zone foo bar Encryption zone
  • 41. 42© Cloudera, Inc. All rights reserved. Key Management Server Deployment (non-prod) HDFS NameNode Client Java Keystore KMS Keystore file Separation of duties • Encryption Zone Key (EZK) is stored in KMS server • HDFS super user can not decrypt files
  • 42. 43© Cloudera, Inc. All rights reserved. Key Management Server/Key Trustee Server Deployment HDFS NameNode Client Key Trustee KMS Key Trustee KMS Firewall Key Trustee Server (Active) Key Trustee Server (Passive) synchronization (or more)
  • 43. 44© Cloudera, Inc. All rights reserved. KMS+KTS+HSM Deployment HDFS NameNode Client HSM KMS HSM KMS Firewall Key Trustee Server (Active) Key Trustee Server (Passive) synchronization Key HSM (or more) Key HSM HSM HSM
  • 44. 45© Cloudera, Inc. All rights reserved. Troubleshooting: Encryption Performance Anomaly • Configuration • AES-NI Hardware acceleration • OpenSSL library • Entropy
  • 45. 46© Cloudera, Inc. All rights reserved. Fine Grained Access Control with Apache Sentry
  • 46. 47© Cloudera, Inc. All rights reserved. Most Security #3 Secure Data Vault
  • 47. 48© Cloudera, Inc. All rights reserved. Level 3 Secure Data Vault • All data, both data-at-rest and data-in-transit is encrypted • Key management system is fault-tolerant • Auditing mechanisms comply with industry, government, and regulatory standards (PCI, HIPAA, NIST, for example) • Auditing extends from EDH to the other systems that integrate with it. • Cluster administrators are well-trained • Security procedures have been certified by an expert • Cluster can pass technical review
  • 48. 49© Cloudera, Inc. All rights reserved. Data Redaction Personal Identifiable Information • PCI-DSS, HIPAA Best practices followed Password • stores in credential files, not in configuration Log, queries • Cloudera Manager
  • 49. 50© Cloudera, Inc. All rights reserved. Full Encryption Encrypt Data Spills • MapReduce • Impala • Hive • Flume OS-level encryption • Navigator Encrypt
  • 50. 51© Cloudera, Inc. All rights reserved. How to secure your Cloudera cluster
  • 51. 52© Cloudera, Inc. All rights reserved. Cloudera Documentation
  • 52. 53© Cloudera, Inc. All rights reserved. Cloudera Professional Services security engagement • Review security requirements and provide an overview of data security policies • Audit architecture and current systems for security policies and best practices • Custom tailor a security reference architecture • Optimize OS and Java to take advantage of hardware-based crypto-acceleration • Install and configure Kerberos with MIT Kerberos KDC or Active Directory • Install and configure Sentry and Cloudera Navigator (license required) • Install and configure Navigator Encrypt and Key Trustee with an HSM root of trust • Review fine-grain permissions on sample data using Sentry • Review audit and lineage on sample data using Navigator • Use Cloudera Manager and Hue to review security integration for users • Enable and configure HDFS encryption https://www.cloudera.com/more/services-and-support/professional-services/security-integration-pilot.html
  • 53. 54© Cloudera, Inc. All rights reserved. Cloudera online ondemand security course • Online self paced training course https://ondemand.cloudera.com • Launch planned for mid Feb 2018 • 3 days estimate worth of content at Cloudera level 1 and 2 security level • Currently 375~ slides with 9 detailed chapters and 16 instructor demonstrations : 1. Security overview 2. Security Architecture 3. Host Security 4. Encrypting Data in motion 5. Authentication 6. Authorization 7. Encrypting Data at Rest 8. Auditing 9. Additional Considerations: Data Governance
  • 54. 55© Cloudera, Inc. All rights reserved. Ondemand security course instructor guided demos 1. Potential Attack vectors 2. Securing the cluster hosts 3. Generating and managing keys for TLS 4. Configuring Cloudera Manager for TLS 5. Encrypting Data in Motion 6. Hadoop default authentication 7. Kerberizing Cluster with MIT Kerberos 8. Kerberizing Cluster with Active Directory 9. Configuring Authorising with Cloudera Manager 10. Controlling access to Yarn 11. Controlling access to HDFS 12. Controlling access to Tables 13. Enabling HDFS Encryption 14. Protecting local data with NavEncrypt 15. Using Navigator for auditing 16. Reassessing cluster security
  • 55. 56© Cloudera, Inc. All rights reserved. Ondemand security course disclaimer THIS IS REALLY IMPORTANT: The examples in this course are based on CM/CDH 5.12, running in a cloud-based deployment on a cluster using the CentOS 7.2 operating system. Given the almost limitless permutations of possible configurations, including different versions of CDH, Cloudera Manager, operating systems, directory servers, Kerberos servers, web browsers, and other tools, as well as variations in policies, laws, and practices that affect each organization differently, it's impossible for a training course to cover all aspects of security. This course is meant to provide a background that will help you to understand many important concepts and techniques, but is not intended as a replacement for the relevant documentation or a consulting engagement with an expert who can provide advice based on your specific requirements. • Disclaimers ~ due to security variety and permutations • Versions used: CDH 5.12 and Centos 7.2
  • 56. 57© Cloudera, Inc. All rights reserved. Ondemand security course scenario • Many of our demonstrations are based on a hypothetical scenario • However, the concepts should apply to nearly any organization • Loudacre Mobile is a fast-growing wireless carrier • Employees serving in a variety of roles • Data ingested from many sources, in many formats • Data processed by many tools
  • 57. 58© Cloudera, Inc. All rights reserved. Ondemand security course environment
  • 58. 59© Cloudera, Inc. All rights reserved. Comprehensive demonstration cluster
  • 59. 60© Cloudera, Inc. All rights reserved. Sample chapter structure: Encrypting Data in Motion • Encryption Fundamentals • Certificates • Key Management  Instructor-Led Demonstration: Generating and Managing Keys for TLS • Configuring Cloudera Manager for TLS  Instructor-Led Demonstration: Configuring Cloudera Manager for TLS • Encrypting Hadoop’s Data in Motion  Instructor-Led Demonstration: Encrypting Hadoop’s Data in Motion • Essential Points
  • 60. 61© Cloudera, Inc. All rights reserved. Register your interest for OnDemand security course: peter.rizvi@cloudera.com
  • 61. © Cloudera, Inc. All rights reserved. Thank you

Editor's Notes

  • #3: Markets, and customers, can only expand as quickly as the human element is able to support it. Right now we are in a time where the demand is very much outpacing the supply of qualified big data professionals. Maintaining a training function is critical for cloudera because we need to maintain a capable delivery ecosystem that allow our customers to thrive within the hadoop environment. Recruitment is one option for organizations to overcome this barrier, but that path comes with an additional challenge: finding the right candidates. When it comes to emerging technology skills, it’s a seller’s market. There is significant competition for a finite pool of skilled technologists; and this competition will only increase as the use of this technology increases. Faced with an ever-tightening supply of qualified job applicants, organizations are finding that the costs to recruit new employees far exceeds the cost to train existing ones, and also that current employees are more than willing to be trained. The need for IT talent is only going to increase in an ever-expanding range of industries. Consider that by 2020, GE – known primarily as a manufacturer, expects to generate $15 billion from software, which would make it one of the top 10 software companies in the world. Or consider that 70 percent of Monsanto’s total jobs are already in science, technology, engineering, or math. Certainly many of those are in chemical and crop engineering, but increasingly, many are in IT, analytics, the Internet of Things, and digital operations. Monsanto is competing for skills not just with other agribusinesses but with companies in all industries. Organizations need to consider the cost of recruitment, and attrition. A majority of analysis around the topic of training confirm that employees that receive training are more likely to remain at their current employers. It allows them to learn new skills, and illustrates their employers are investing in them. For technologists, hadoop… spark… and the other projects that compose our platform open up a world of possibilities and curiosity. It is challenging and rewarding. We have several customers that build out robust hadoop training plans as a benefit to their employees, and the returns they see in the innovation on the platform and employee retention makes the cost of training a major value when viewed the spectrum of both short and long term returns. The evolution of the data center in the past few decades has mandated that IT decisions are now critical not just for back office operations, but more so critical in nearly every aspect of a business. With regard to “big data”, the technologies leveraged are very linked to an organizations customers and markets. As such, Business leaders are tasked with transforming their business to accommodate the realities of the “data-driven” market. This mean in some cases updating of hardware, and implementation of new software, but also the upgrade of the skills of their internal staff. If the talent of your staff is a concern, you are not alone. Cloudera, and analyst firms such as IDC, have polled organizations about enterprise software deployments… not surprisingly one of the primary areas of concern for Cloudera prospects and customers are the skills of their staff. This is a new way of computing, and harnessing the benefits of a Cloudera subscription requires employees familiar with the tools included in the platform, and an understanding of how to best leverage them for their use case. IDC looked at projects more generally, but solicited input from over 500 managers implementing IT projects on what were the critical factors in the success of a project. Since we are discussing training, and building out a team of experts on this call, I’m guessing you are assuming it was not the software, not clearly defined business objectives, or a solid project plan which predicated success. Overwhelmingly managers ranked the skill and dedication of the project team as the factor which played the largest part in the success of their project. We want to make sure that customers include the human element needed to role out a successful project as they consider a Cloudera subscription.
  • #6: I’ve alluded to some of these options early in the presentation; but to ensure there is clarity on our delivery options… we offer both public and private training. Public training courses are scheduled around the globe by Cloudera and by our Authorized Training Partners. Authorized training partner instructors go through the same procedures as Cloudera instructors, regularly also provide field services in their regions, and allow for local language delivery in areas where we do not have direct coverage. Public training schedules can be found on Cloudera’s website where you can search by course title and/or location of interest. Public training is a nice option if you have just a few team members that need training, or you need to get someone ramped up in a short timeframe. Students are able to interact with their peers from other organizations implementing Cloudera solutions, and a live instructor. Private Training is reserved for a customer who wants their entire team to be trained. Normally we say if you have seven or more students who need the same training class, its worth your while to explore our private training option. We’ll send an instructor to a location of your choice to deliver training specific to your needs. Regularly the training is one of the courses that I’ve described earlier in this presentation, but if needed, we can also customize the content to align it with your business objectives. To be clear, “customization” is not new content creation, it is creating an agenda from our portfolio of content that makes sense for the customer. Some examples would be adding Spark ML or JEP to Spark and Hadoop training to make it a five day course, or cutting Pig from Data Analyst training to make it a three day course. We generally recommend not trying to customize a course by looking at disparate topics across many classes – it usually ends up having no flow or connection, and the students leave with more questions than answers. Our courses build on concepts throughout the duration of the class. Customization is encouraged, but shouldn’t be abused. Private Training courses are available for “up to 10” or “up to 16” students. Virtual training is live training that is delivered over the internet. Both public and private classes can be delivered in this manner. From a public perspective, it’s a popular option for individuals who are not local to one of our training locations. Private customers with geographically dispersed team also find this means to save on the travel costs it would take to bring the team to a central location. OnDemand training is a library of pre-recorded training classes, which allows for 24x7, self-paced training in a searchable environment. Our entire portfolio of content is available in this format, and students leverage a cloud-based lab environment to complete the same hands-on exercises we deliver in the live classrooms. Courses can be bought as a library, or by individual title. Certification, I’ve touched on earlier. Certifications may be bought in bulk via PO, or purchased directly via our website. Certification candidates are remotely monitored, and are not required to go into a testing center to compete the exam. All you need is an internet connection. Prices range from $295 for CCA level exams to $400 for CCP: Data Engineer, or $600 per CCP: Data Science exam.
  • #7: … and here is what I talked about in the past three slides, in summary. Over time, we will be adding courses to the Administrator training path focused on Security, Cloud, and Architecture – look for those in the next calendar year. We also have plans to iterate and/or augment our Developer, Data Analyst, and Data Scientist content to reflect the evolution of the technology.
  • #8: This talk is mainly about security implementation from both an engineering and a support perspective.
  • #11: Data breach incidents are increasing year by year. This year alone there have been a number of high profile breaches. Security is built deep in Hadoop, but it does not work out of box. Rome is not bulit in a day. As you will learn during your security implementation process, it takes a lot of configurations and best practices to make a secure Hadoop cluster. Good news: Cloudera Manager and Navigator is there to the rescue! Cloudera’s platform is built on top of Apache Hadoop technology. It is the first Hadoop platform to achieve PCI-compliance.
  • #12: New York State Department of Financial Services “紐約州金融服務署” Breach Notification Right to Access Right to be Forgotten Data portability Privacy by Design Data Protection Officers
  • #14: But obviously it takes more than good people and processes. You need the right technology. Let’s get down to brass tacks on what the software is about We’re based on an open source core. A complete, integrated enterprise platform leveraging open source HOSS business model - core set of platform capabilities – we contribute actively into that community. and we layer value added software on top - that’s how we run our business. But what’s truly differentiating about our platform is the enterprise experience you get. It’s why we’re able to claim 7 of the top ten banks and 9 of the top ten telcos are our customers. For regulated industries, the enterprise experience is critical. Multi-cloud – No vendor lock in. Work in the environment of your choice. Better pricing leverage Managed TCO – Multiple pricing and deployment options Integrated – Integrated components with shared metadata, security and operations Secure - Protect sensitive data from unauthorized access – encryption, key management Compliance – Full auditing and visibility Governance – Ensure data veracity
  • #15: Apps share data, rather than data replicated for apps Lower costs because less data to replicate More secure because data is in one central location Easier to build apps because data is easily accessible Open architecture to share data with other teams and workloads, including data science
  • #16: Apps share data, rather than data replicated for apps Lower costs because less data to replicate More secure because data is in one central location Easier to build apps because data is easily accessible Open architecture to share data with other teams and workloads, including data science
  • #17: As a customer, you will most likely not interact with Cloudera’s platform directly. Typically customers access Cloudera’s platform indirectly through partner products. To ensure the same security protocol is not breached, we certify partner products with security in mind. For the purpose of this talk, I am going to briefly mention Cloudera’s certification process from a security perspective. Should also hire Cloudera certified administrators, or hire professional services from Cloudera SI partners
  • #18: A little bit on partner product certification https://docs.google.com/a/cloudera.com/document/d/1XwRV_bVZrM90JsPhHxLYAgd6vCdvT7qQ-k8eIQ2QYsk/edit?usp=sharing
  • #19: Upstream = reports coming from apache project. Each apache project has a private security@ mailing alias. Obey Apache’s security policy Internal = reports coming internally from Cloudera. Cloudera Engineering run several security weakness detection tools looking for security issues in the software. External = reports coming from third party or a customer.
  • #20: Cloudera works hard to provide security on top of the big data platform. In this talk, I will present the best practices and common pitfalls of security implementation on Hadoop, based on my experience working with customers. Source: https://www.cloudera.com/documentation/enterprise/latest/topics/sg_edh_overview.html#topic_ads_t2q_1r Achieving data security is costly. Depending on use cases and sensitivity of data, enterprise may decide which level of security is desired. Typically, enterprises choose to implement security on Hadoop step by step. Or hire Cloudera PS to make a custom security implementation plan and complete these steps in one shot.
  • #22: https://cloudera.app.box.com/files/0/f/6321638305/1/f_56252438130 TPC-DS Impact is very little This is tested with Key Trustee. HSM is currently very slow AES-NI As the result shows below the percentage overhead of using encryption on system was 2% in terms of query execution time and 3.1% in CPU time.
  • #23: A secure system takes more than just a good product. It also requires experienced people to integrate it and operate it. These people must receive the proper training. Technology: Cloudera’s platform and certified partners’ products, post-sell support People: Cloudera PS team or SI partners, consulting firms, customer’s admin, users Process: SOP, documentation, regular audits, compliance plan, not covered in this talk
  • #25: Depend on existing firewalls.
  • #26: Leverage existing firewall mechanisms in the enterprise to set up perimeter. First line of defense Firewall exposes only: gateway nodes for submitting jobs, and CM and CN interface. System chart: CM, master node (HA), worker nodes, firewalls,
  • #27: The Cloudera’s platform does not manage user authentication. Instead, it relies on external authentication mechanism for that purpose, such as Kerberos, LDAPs or AD. For simple authentication it gets user name from local operating system user name. But it is too much effort trying to ensure accounts are consistent. So use AD + SSSD/Centrify CDH is composed of many open source projects, and as a result, not all of them support the same set of authentication mechanisms. There are (simple, kerberos, ldap, saml) supported. AD integration – it is likely your enterprise is already using ActiveDirectory for user identity control. --- use SSSD instead of LdapGroupsMapping. --- Create dedicated OU for cluster --- use LDAP over SSL Need to select a good base, so that AD returns quickly. A slow lookup can stop all operations. LDAP authentication can be used for CM, Hue, Hive and Impala. The latency of LDAP request/response is critical for cluster performance.
  • #29: User identity can be forged easily. It is okay to have unsecured dev cluster, or PoC cluster.
  • #32: This should be the _minimal_ security requirement for any production cluster Kerberos is a cryptographic authentication mechanism. Key Distribution Center KDC Kerberos -- Kerberos to user name mapping Simple authentication = no authentication Time synchronization -- NTP Keytab handling – keytab stores password and is required for Hadoop services https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html CM makes it extremely easy.
  • #33: This should be the _minimal_ security requirement for any production cluster Kerberos is a cryptographic authentication mechanism. Kerberos -- Kerberos to user name mapping Simple authentication = no authentication Time synchronization -- NTP Keytab handling – keytab stores password and is required for Hadoop services https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html CM makes it extremely easy.
  • #34: This should be the _minimal_ security requirement for any production cluster Kerberos is a cryptographic authentication mechanism. Kerberos -- Kerberos to user name mapping Simple authentication = no authentication Time synchronization -- NTP Keytab handling – keytab stores password and is required for Hadoop services https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_s3_cm_principal.html CM makes it extremely easy.
  • #36: Authentication is a prerequisite of authorization Access control lists (ACLs) restrict who can submit work to dynamic resource pools and administer them.
  • #37: Cloudera Navigator  Enable Audit Collection Audit log retention Provenance use case A number of business decisions and transactions rely on the verifiability of the data used in those decisions and transactions. Data-verification questions might include:How was this mortgage credit score computed? How can I prove that this number on a sales report is correct? What data sources were used in this calculation? Auditing use case What was a specific user doing on a specific day? Who deleted a particular directory? What happened to data in a production database, and why is it no longer available?
  • #38: A backup/DR cluster that is purely for DR purpose (replicates between multiple untrusted Kerberos realms) https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
  • #39: One Kerberos realm per cluster BDR runs from destination. Must configure the destination realm to trust source realm The DR cluster should not be used for any purposes other than DR.
  • #41: AES/CTR/NoPadding is an encryption algorithm.
  • #42: At-rest encryption is required by PCI-DSS, FISMA, HIPAA Separation of duties -- NameNode vs KMS Hdfs superuser cannot decrypt keys. At rest encryption is more complex than in-transit, because the key is typically not updated for a long time, so need a more complex mechanism to protect keys An encryption zone can only be created for an empty directory. There’s a workaround to run hdfs distcp to copy files into the EZ. Supports at most 256 bit encryption. ”Always-on encryption zone”/”nested encryption zone” support in CDH5.7 but no CM support i.e. doesn’t work end-to-end
  • #43: https://www.cloudera.com/documentation/enterprise/latest/topics/encryption_ref_arch.html Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. KTS should be a separate cluster. The two clusters are protected by a firewall. Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes Keytrustee servers should be on its own box. KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones. May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode hardware security module (HSM)
  • #44: Resource planning & requirement: Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. (total of 4 hosts) KTS should be a separate cluster. The two clusters are protected by a firewall. Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes Keytrustee servers should be on its own box. KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones. May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode hardware security module (HSM)
  • #45: Deployment consideration: at least 2 KMS proxy. At least 2 keytrustee servers. KTS should be a separate cluster. The two clusters are protected by a firewall. Keytrustee servers are active-passive. If the active is down, the passive is able to serve reads, but not writes Keytrustee servers should be on its own box. KTS HA: if either one fails, only reads are allowed. It does not affect reading/writing encrypted files, but can’t create encryption zones. May have more than 2 KMS proxies for load balancing purposes. KMS is cpu intensive, so use hardware equivalent to NameNode hardware security module (HSM)
  • #46: https://cloudera.app.box.com/files/0/f/6321638305/1/f_56252438130 TPC-DS Misconfiguration Use aes/ctr/nopadding, (Data Transfer Encryption Algorithm) default is 128-bits/ 256-bits (managed by CM) Low entropy : /proc/sys/kernel/random/entropy_avail Hardware acceleration Openssl library Entropy configuration
  • #47: One of the characteristics of Hadoop platform, is there are a variety of tools capable of accessing the same set of data. For example, …MapReduce, Hive, Impala, Pig and 3rd party software can all access HDFS. A unified access control is crucial. Pig, Sqoop and Kafka are also supported by Sentry. If Impala is used, Sentry is a must. By default, Impala can be accessed by user impala 3rd party BI tools may not support Sentry, which must enforce access through HiveServer2. Migrating from no Sentry to Sentry is a tremendous work, and hard to rollback
  • #49: In regulated industry, the regulation such as PCI or HIPAA requires redaction of PIIs. (such as SSNs) https://www.cloudera.com/documentation/enterprise/latest/topics/sg_redaction.html https://blog.cloudera.com/blog/2015/06/new-in-cdh-5-4-sensitive-data-redaction/
  • #50: In regulated industry, the regulation such as PCI or HIPAA requires redaction of PIIs. (such as SSNs) https://www.cloudera.com/documentation/enterprise/latest/topics/sg_redaction.html https://blog.cloudera.com/blog/2015/06/new-in-cdh-5-4-sensitive-data-redaction/
  • #51: Intermediate files. Certain services may write spilled data outside HDFS, on local disk. So additional configuration is required to ensure they are encrypted as well. Navigator Encrypt is a kernel model that intercepts I/O requests to encrypted datastores, including log files, config file, temp file, databases
  • #53: Other references: https://cloudera.app.box.com/files/0/s/firewall/1/f_202846938208 Ben and Joey were both long time Cloudera Solution Architects