SlideShare a Scribd company logo
Secure Hadoop Application
Ecosystem
Boston Application Security
Conference
Oct 3 2015
Google Trends – Big Data Big Data Job Trends 2
3
Hadoop EcosystemFlumeSqoop
ZooKeeper
HBase
Hive
Pig
MapReduce
Spark
YARN – Resource Manager
HDFS – Distributed File System
Kafka
Storm
4
Why
• Hadoop is a storage/processing infrastructure
– Whether Big Data is hype or not
• Fits well for lot of use cases
• Inherent distributed storage/processing
– Provides scalability at a relatively low cost
• There is lot of backing
– IBM, Microsoft, Amazon, Google, Intel …
• Various distributions and companies
5
Hadoop Distributed File System
FileA
FileB
FileC
H1:blk0, H2:blk1
H3:blk0,H1:blk1
H2:blk0;H3:blk1
HDFS Directory
Master Host (NN)
DISK
Local File System File
FileA0
FileB1
Inode-x
Inode-y
Local FS Directory
Host 1
FileA1
FileC0
Inode-a
Inode-n
Local FS Directory
Host 2
FileB0
FileC1
Inode-r
Inode-c
Local FS Directory
Host 3
In-x
In-y
In-a
In-n
In-r
In-c
DISK
DISK
DISK
Files created
are of size
equal to the
HDFS blksize
6
HDFS - Write Flow
Client
Namespace
MetaData
Blockmap
(Fsimage
Edit files)
Name Node
Data Node Data Node Data Node
1
2
3
4
5
6 6
77
8
1. Client requests to open a file to write through fs.create() call. This will overwrite existing file.
2. Name node responds with a lease to the file path
3. Client writes to local and when data reaches block size, requests Name Node for write
4. Name Node responds with a new blockid and the destination data nodes for write and replication
5. Client sends the first data node the data and the checksum generated on the data to be written
6. First data node writes the data and checksum and in parallel pipelines the replications to other DN
7. Each data node where the data is replicated responds back with success /failure to the first DN
8. First data node in turn informs to the Name node that the write request for the block is complete
which in turn will update its block map
Note: There can be only one write at a time on a file
7
HDFS - Read Flow
Client
Namespace
MetaData
Blockmap
(Fsimage
Edit files)
Name Node
Data Node Data Node Data Node
1
2
3
4
5 6
1. Client requests to open a file to read through fs.open() call
2. Name node responds with a lease to the file path
3. Client requests for read the data in the file
4. Name Node responds with block ids in sequence and the corresponding data nodes
5. Client reaches out directly to the DNs for each block of data in the file
6. When DNs sends back data along with check sum, client performs a checksum verification by
generating a checksum
7. If the checksum verification fails client reaches out to other DNs where the re is a replication
7
8
Authorization
• POSIX model for file and directory permissions
– Associated with an owner and a group
– Permission for owner, group and others
– r for read, w for append to files
– r for listing files, w for delete/create files in dirs
– x to access child directories
– Sticky bit on dirs prevents deletions by others
9
Kerberos
10
TGS
AS
KDB
KDC
1
Create Principal
User
2 - kinit
3 – Receive TGT
4 – Request Service Ticket
Service
5 – Receive Service Ticket
For service principals Keytabs are used
Secure HDFS Cluster - Authentication
Master
Namenode
Slave
Datanode
Slave
Datanode
Slave
Datanode
KDC
Keytab Keytab Keytab
Keytab
11
Secure HDFS - Client Authentication
Namenode
Slave
Datanode
Slave
Datanode
Slave
Datanode
KDC
HDFS Client
KRB Token 1
Deleg Token
2
3
Block Tokens
Deleg Token
Key
Key Key Key
4
12
Authentication Configuration
• Set up Kerberos infrastructure
– It may be already available through AD
• Define service principals
• Create Keytabs for service principals
– E.g. HDFS, YARN
• Copy keytabs to the master and slave nodes
• Update site.xml files
• Restart the services
13
HDFS Data Encryption
HDFS
Client
Key Mgmt
Server
Key
Trusty
Namenode
Datenode
1 - EZ
2 – EZ Key
2 - Create
EZ EDEK
3
EDEK
4 – R/W
5
14
YARN
15
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Keytab Keytab Keytab
Keytab
Client submits
MapRed Job
App Master Container Container
Controlling Resource Usage
• Schedulers
– Fair
– Capacity
• Queues defined to use percentage of resource
– Hierarchy with in queues
• Users and groups attached to groups
– Administer
– Submit
16
YARN Queue
17
Root 100%
Sec 70%
sadmin, suser
Adhoc 30%
Aadmin, auser
Hadoop Cluster - Secure Perimeter
Master
Slave Slave Slave
IPS/IDS/Firewall
IPS/IDS/Firewall
Clients
DMZ/Separate Network
18
HDFS Services & Ports
HDFS Service Port
Name Node 8020
Name Node UI 50070
Secondary Name Node UI 50090
Data Node 50020
Data Node UI 50075
Journal Node 8480, 8485
HttpFS 14000, 14001
19
Principle of Least Priviledge
• hdfs-site xml
– dfs.permissions.superusergroup
– dfs.cluster.administrators
• core-site.xml
– Hadoop.security.authorization to true
• hadoop-policy.xml
– security.client.protocol.acl
– security.client.datanode.protocol.acl
– security.get.user.mappings.protocol.acl
20
Application Code Change
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://NN:PORT/user/hbase");
conf.set("hadoop.security.authentication", "Kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("ubuntu/hostname@REALM", ”ubuntu.keytab");
FileSystem fs = FileSystem.get(conf);
21
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://NN:PORT/user/hbase");
conf.set("hadoop.security.authentication", "Kerberos");
FileSystem fs = FileSystem.get(conf);
Unsecure Hadoop
Secure Hadoop
Key Takeaways
• New infrastructure will be part of enterprises
– May not be as big as the hype
• Adherence to application security principles
– Complexity and maturity may be a roadblock
• Constant follow-up on latest developments
22
References & Acknowledgements
• Hadoop Security
– https://issues.apache.org/jira/browse/HADOOP-4487
– Hadoop Project – Securing Hadoop Page
• HDFS Encryption
– https://issues.apache.org/jira/browse/HDFS-6134
– Hadoop Project Transparent Encryption Page
– http://www.slideshare.net/Hadoop_Summit/transparent-encryption-in-hdfs
• Hadoop service level authorization
• YARN
– Fair Scheduler
– Capacity Scheduler
• Hadoop Security Book
23
Thank You!!
24
bnair@asquareb.com
blog.asquareb.com
https://github.com/bijugs
@gsbiju
http://www.slideshare.net/bijugs

More Related Content

PDF
2014 sept 4_hadoop_security
PDF
Hadoop security
PDF
Hadoop & Security - Past, Present, Future
PPTX
Open Source Security Tools for Big Data
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PDF
Hadoop security overview_hit2012_1117rev
PPTX
Hadoop security
PPTX
Hdp security overview
2014 sept 4_hadoop_security
Hadoop security
Hadoop & Security - Past, Present, Future
Open Source Security Tools for Big Data
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop security overview_hit2012_1117rev
Hadoop security
Hdp security overview

What's hot (20)

PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
PDF
Hadoop Security: Overview
PPTX
Hadoop security
PPT
Hadoop Security Architecture
PPTX
Hadoop Security Today and Tomorrow
PPTX
Hadoop ClusterClient Security Using Kerberos
PPTX
Managing enterprise users in Hadoop ecosystem
PPTX
Data protection for hadoop environments
PPTX
Hadoop Security Features That make your risk officer happy
PDF
Hadoop Operations - Best practices from the field
PPT
Hadoop Operations: How to Secure and Control Cluster Access
PPTX
Hadoop Security Today & Tomorrow with Apache Knox
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Improvements in Hadoop Security
PDF
Nl HUG 2016 Feb Hadoop security from the trenches
PPTX
Hadoop Security Features that make your risk officer happy
PPTX
Hadoop REST API Security with Apache Knox Gateway
PPTX
Securing the Hadoop Ecosystem
PDF
Apache Sentry for Hadoop security
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop Security: Overview
Hadoop security
Hadoop Security Architecture
Hadoop Security Today and Tomorrow
Hadoop ClusterClient Security Using Kerberos
Managing enterprise users in Hadoop ecosystem
Data protection for hadoop environments
Hadoop Security Features That make your risk officer happy
Hadoop Operations - Best practices from the field
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop and Kerberos: the Madness Beyond the Gate
Improvements in Hadoop Security
Nl HUG 2016 Feb Hadoop security from the trenches
Hadoop Security Features that make your risk officer happy
Hadoop REST API Security with Apache Knox Gateway
Securing the Hadoop Ecosystem
Apache Sentry for Hadoop security
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Ad

Viewers also liked (20)

PDF
NENUG Apr14 Talk - data modeling for netezza
PDF
Websphere MQ (MQSeries) fundamentals
PDF
Chef patterns
PDF
Concurrency
PDF
Project Risk Management
PDF
HDFS User Reference
PDF
HBase Application Performance Improvement
PDF
Netezza workload management
PPTX
Hadoop and Big Data Security
PDF
Using Netezza Query Plan to Improve Performace
PDF
Netezza fundamentals for developers
PDF
Managing Websphere Application Server certificates
PDF
It was just Open Source - TEDx Novara
PPT
THE BIG PRINT - Showing the Action
PDF
Hadoop Security Now and Future
PDF
Row or Columnar Database
PDF
Big Data Security Intelligence and Analytics for Advanced Threat Protection
PDF
Advanced Security In Hadoop Cluster
PPT
WebSphere Message Broker Training | IBM WebSphere Message Broker Online Training
PPTX
GSM/UMTS network architecture tutorial (Indonesia)
NENUG Apr14 Talk - data modeling for netezza
Websphere MQ (MQSeries) fundamentals
Chef patterns
Concurrency
Project Risk Management
HDFS User Reference
HBase Application Performance Improvement
Netezza workload management
Hadoop and Big Data Security
Using Netezza Query Plan to Improve Performace
Netezza fundamentals for developers
Managing Websphere Application Server certificates
It was just Open Source - TEDx Novara
THE BIG PRINT - Showing the Action
Hadoop Security Now and Future
Row or Columnar Database
Big Data Security Intelligence and Analytics for Advanced Threat Protection
Advanced Security In Hadoop Cluster
WebSphere Message Broker Training | IBM WebSphere Message Broker Online Training
GSM/UMTS network architecture tutorial (Indonesia)
Ad

Similar to Hadoop security (20)

PPTX
Understanding Hadoop
PDF
Aziksa hadoop architecture santosh jha
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Hadoop and HDFS
PPTX
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
PPTX
Unit-3.pptx
PPT
Borthakur hadoop univ-research
PDF
HDFS Design Principles
PPTX
Giraffa - November 2014
PPTX
HDFS tiered storage
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
Introduction to hadoop and hdfs
PPTX
Introduction to HDFS
PDF
Tutorial Haddop 2.3
ODP
Hadoop HDFS by rohitkapa
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
Hadoop File System.pptx
PDF
Hadoop Introduction
PPTX
Big data- HDFS(2nd presentation)
Understanding Hadoop
Aziksa hadoop architecture santosh jha
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop and HDFS
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Unit-3.pptx
Borthakur hadoop univ-research
HDFS Design Principles
Giraffa - November 2014
HDFS tiered storage
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
Introduction to hadoop and hdfs
Introduction to HDFS
Tutorial Haddop 2.3
Hadoop HDFS by rohitkapa
HDFS: Hadoop Distributed Filesystem
Hadoop File System.pptx
Hadoop Introduction
Big data- HDFS(2nd presentation)

More from Biju Nair (6)

PDF
Chef conf-2015-chef-patterns-at-bloomberg-scale
PDF
HBase Internals And Operations
PDF
Apache Kafka Reference
PDF
Serving queries at low latency using HBase
PDF
Multi-Tenant HBase Cluster - HBaseCon2018-final
PDF
Cursor Implementation in Apache Phoenix
Chef conf-2015-chef-patterns-at-bloomberg-scale
HBase Internals And Operations
Apache Kafka Reference
Serving queries at low latency using HBase
Multi-Tenant HBase Cluster - HBaseCon2018-final
Cursor Implementation in Apache Phoenix

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Modernizing your data center with Dell and AMD
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Modernizing your data center with Dell and AMD
Digital-Transformation-Roadmap-for-Companies.pptx

Hadoop security

  • 1. Secure Hadoop Application Ecosystem Boston Application Security Conference Oct 3 2015
  • 2. Google Trends – Big Data Big Data Job Trends 2
  • 3. 3
  • 4. Hadoop EcosystemFlumeSqoop ZooKeeper HBase Hive Pig MapReduce Spark YARN – Resource Manager HDFS – Distributed File System Kafka Storm 4
  • 5. Why • Hadoop is a storage/processing infrastructure – Whether Big Data is hype or not • Fits well for lot of use cases • Inherent distributed storage/processing – Provides scalability at a relatively low cost • There is lot of backing – IBM, Microsoft, Amazon, Google, Intel … • Various distributions and companies 5
  • 6. Hadoop Distributed File System FileA FileB FileC H1:blk0, H2:blk1 H3:blk0,H1:blk1 H2:blk0;H3:blk1 HDFS Directory Master Host (NN) DISK Local File System File FileA0 FileB1 Inode-x Inode-y Local FS Directory Host 1 FileA1 FileC0 Inode-a Inode-n Local FS Directory Host 2 FileB0 FileC1 Inode-r Inode-c Local FS Directory Host 3 In-x In-y In-a In-n In-r In-c DISK DISK DISK Files created are of size equal to the HDFS blksize 6
  • 7. HDFS - Write Flow Client Namespace MetaData Blockmap (Fsimage Edit files) Name Node Data Node Data Node Data Node 1 2 3 4 5 6 6 77 8 1. Client requests to open a file to write through fs.create() call. This will overwrite existing file. 2. Name node responds with a lease to the file path 3. Client writes to local and when data reaches block size, requests Name Node for write 4. Name Node responds with a new blockid and the destination data nodes for write and replication 5. Client sends the first data node the data and the checksum generated on the data to be written 6. First data node writes the data and checksum and in parallel pipelines the replications to other DN 7. Each data node where the data is replicated responds back with success /failure to the first DN 8. First data node in turn informs to the Name node that the write request for the block is complete which in turn will update its block map Note: There can be only one write at a time on a file 7
  • 8. HDFS - Read Flow Client Namespace MetaData Blockmap (Fsimage Edit files) Name Node Data Node Data Node Data Node 1 2 3 4 5 6 1. Client requests to open a file to read through fs.open() call 2. Name node responds with a lease to the file path 3. Client requests for read the data in the file 4. Name Node responds with block ids in sequence and the corresponding data nodes 5. Client reaches out directly to the DNs for each block of data in the file 6. When DNs sends back data along with check sum, client performs a checksum verification by generating a checksum 7. If the checksum verification fails client reaches out to other DNs where the re is a replication 7 8
  • 9. Authorization • POSIX model for file and directory permissions – Associated with an owner and a group – Permission for owner, group and others – r for read, w for append to files – r for listing files, w for delete/create files in dirs – x to access child directories – Sticky bit on dirs prevents deletions by others 9
  • 10. Kerberos 10 TGS AS KDB KDC 1 Create Principal User 2 - kinit 3 – Receive TGT 4 – Request Service Ticket Service 5 – Receive Service Ticket For service principals Keytabs are used
  • 11. Secure HDFS Cluster - Authentication Master Namenode Slave Datanode Slave Datanode Slave Datanode KDC Keytab Keytab Keytab Keytab 11
  • 12. Secure HDFS - Client Authentication Namenode Slave Datanode Slave Datanode Slave Datanode KDC HDFS Client KRB Token 1 Deleg Token 2 3 Block Tokens Deleg Token Key Key Key Key 4 12
  • 13. Authentication Configuration • Set up Kerberos infrastructure – It may be already available through AD • Define service principals • Create Keytabs for service principals – E.g. HDFS, YARN • Copy keytabs to the master and slave nodes • Update site.xml files • Restart the services 13
  • 14. HDFS Data Encryption HDFS Client Key Mgmt Server Key Trusty Namenode Datenode 1 - EZ 2 – EZ Key 2 - Create EZ EDEK 3 EDEK 4 – R/W 5 14
  • 16. Controlling Resource Usage • Schedulers – Fair – Capacity • Queues defined to use percentage of resource – Hierarchy with in queues • Users and groups attached to groups – Administer – Submit 16
  • 17. YARN Queue 17 Root 100% Sec 70% sadmin, suser Adhoc 30% Aadmin, auser
  • 18. Hadoop Cluster - Secure Perimeter Master Slave Slave Slave IPS/IDS/Firewall IPS/IDS/Firewall Clients DMZ/Separate Network 18
  • 19. HDFS Services & Ports HDFS Service Port Name Node 8020 Name Node UI 50070 Secondary Name Node UI 50090 Data Node 50020 Data Node UI 50075 Journal Node 8480, 8485 HttpFS 14000, 14001 19
  • 20. Principle of Least Priviledge • hdfs-site xml – dfs.permissions.superusergroup – dfs.cluster.administrators • core-site.xml – Hadoop.security.authorization to true • hadoop-policy.xml – security.client.protocol.acl – security.client.datanode.protocol.acl – security.get.user.mappings.protocol.acl 20
  • 21. Application Code Change Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://NN:PORT/user/hbase"); conf.set("hadoop.security.authentication", "Kerberos"); UserGroupInformation.setConfiguration(conf); UserGroupInformation.loginUserFromKeytab("ubuntu/hostname@REALM", ”ubuntu.keytab"); FileSystem fs = FileSystem.get(conf); 21 Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://NN:PORT/user/hbase"); conf.set("hadoop.security.authentication", "Kerberos"); FileSystem fs = FileSystem.get(conf); Unsecure Hadoop Secure Hadoop
  • 22. Key Takeaways • New infrastructure will be part of enterprises – May not be as big as the hype • Adherence to application security principles – Complexity and maturity may be a roadblock • Constant follow-up on latest developments 22
  • 23. References & Acknowledgements • Hadoop Security – https://issues.apache.org/jira/browse/HADOOP-4487 – Hadoop Project – Securing Hadoop Page • HDFS Encryption – https://issues.apache.org/jira/browse/HDFS-6134 – Hadoop Project Transparent Encryption Page – http://www.slideshare.net/Hadoop_Summit/transparent-encryption-in-hdfs • Hadoop service level authorization • YARN – Fair Scheduler – Capacity Scheduler • Hadoop Security Book 23