SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Fine-Grained Security
for Spark and Hive
Carter Shanklin - Director PM
Don Bosco Durai - Security Architect
June 29, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
● Current security options and challenges
● Apache Ranger Overview
● LLAP Overview
● Use Cases and Demo
● Apache Atlas Integration
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Options and Challenges
⬢ Limited to storage level access control for Spark, Pig and MR
⬢ Column Level Access via HiveServer2
⬢ Row Level filtering need Hive Views
– Multiple Hive Views needs to be created and managed
– Explicit permissions need to be given for each view/user
– User need to know which view to use
⬢ Masking needs custom UDF
– Needs to be wrapped using Views
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
• Central audit location for all
access requests
• Support multiple destination
sources (HDFS, Solr, etc.)
• Real-time visual query
interface
AuditingAuthorization
• Store and manage
encryption keys
• Support HDFS TDE
• Integration with HSM
Ranger KMS
• Centralized platform to
define, administer and
manage security policies
consistently
• Enforce policies within each
component
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
© Hortonworks Inc. 2015. All Rights Reserved
Ranger Architecture
HDFS
Ranger Administration Portal
HBase
Hive Server2
Ranger Audit
Server
Ranger
Plugin
HadoopComponentsEnterprise
Users
Ranger
Plugin
Ranger
Plugin
Legacy Tools and Data
Governance
HDFS
Knox
NifI
Ranger
Plugin
Ranger
Plugin
RDBMS
Solr
Ranger
Plugin
Ranger Policy
Server Integration API
Kafka
Ranger
Plugin
YARN
Ranger
Plugin
Ranger
Plugin
Storm
Ranger
Plugin
Atlas
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Data Access
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Audits - Admin Actions
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Overview
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 and LLAP
⬢ At a High Level:
– 2000+ features, improvements and bug
fixes in Hive since HDP 2.4.
– 600+ of these from outside of
Hortonworks.
⬢ Major Improvements:
– Preview: Hive LLAP: Persistent query
servers with intelligent in-memory
caching.
– ACID GA: Hardened and proven at scale.
– Expanded SQL Compliance: More capable
integration with BI tools.
– Performance: Interactive query, 2x faster
ETL.
– Security: Row / Column security
extending to views, Column level security
for Spark.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Open Interfaces
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Integration with Hive and LLAP
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive / LLAP Security Capabilities with Ranger
⬢ Ranger Hive plugin provides authorization / access controls.
⬢ Column Masking:
– Inject Hive UDFs that mask characters or hash values.
– Dynamic, per-user.
⬢ Dynamic Row Filtering:
– Query is analyzed and policies applied.
– Dynamic, per-user.
⬢ All operations run as ordinary SQL queries:
– Masking statements convert to clauses in the SQL select clause.
– Filters convert to clauses in the SQL where clause.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Native Hive Masking Capabilities
UDF Purpose Example Start Example Result
mask Convert letters to X/x and
numbers to n.
123 Fake St. nnn Xxxx Xx.
mask_first_n Mask only the first n
characters.
433-54-3937 nnn-54-3937
mask_last_n Mask only the last n
characters.
433-54-3937 433-54-nnnn
mask_show_first_n Mask, showing only the first
n characters.
555-233-1234 555-nnn-nnnn
mask_show_last_n Mask, showing only the last
n characters.
433-54-3937 nnn-nn-3937
mask_hash Produce a consistent hash of
the field.
CA 21f241cccaa5cfa33190f56ff1510e37
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Delivering Spark Security
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Spark Column Security with LLAP
⬢ Fine-Grained Column Level Access Control for SparkSQL.
⬢ Fully dynamic policies per user. Doesn’t require views.
⬢ Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from
HiveServer and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query
plan based on dynamic security
policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed
by LLAP server.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per-User Row Filtering by Region in SparkSQL
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Setup
⬢Customer User and Sales data in ORC (Metadata in MetaStore)
⬢Data can be access via SparkSQL or HiveServer2
⬢Marketing needs access to Sales and Users data for analytics
⬢Fraud Investigation team needs access to data for fraud detection
⬢Billing team needs access to Sales and Users data for billing
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
Group Users
Fraud frank
Marketing mark
Billing bill
Tables
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 1: Restricting Column Access
This is a simple use case where certain groups or users don’t permission to view
the query
⬢Billing group has access to all columns in table Users
⬢Marketing group can’t access credit card column from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_phone customer_ccn
bill (Billing) 😀 😀
mark (Marketing) 😀 😡
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Results
bill
from
Billing
mark
from
Marketing
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Restrict Columns - Audit Screen
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 2: Column Masking
In this use case where certain groups or users won't be able to see the real
value of certain columns.
⬢Billing group can see the real/raw values for all columns in table Users
⬢Fraud group can only see masked values of PII and PCI fields from table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column customer_email,
customer_phone,
customer_ccn
bill (Billing) 😀
frank (Fraud) 😎
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Mask Fields
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Results
bill
from
Billing
frank
from
Fraud
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Column Masking - Audit Screen
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 3: Row Filtering
In this use case where certain groups or users won't be able to see all the rows
from certain tables
⬢Billing group can see all the rows in the table Users
⬢Marketing can only see rows/data from their region in the table Users
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Users table
bill (Billing) 😀
Mark (Marketing-
CA)
Only CA Users
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policy - Row Filtering - Results
bill
from
Billing
mark
from
Marketing
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Case 4: Row Filtering - Cross Table
This an extension of previous use cases, where the context information for
filtering the row is in another table.
⬢Billing group can see all the rows in the table Sales
⬢Marketing can only see rows/data from their region in the table Sales,
however Sales table doesn’t have the customer geographic information, so it
needs to be derived from Users table
Users
customer_id
customer_name
customer_email
customer_phone
customer_ccn
customer_state
customer_zip
User/Column Rows in Sales table
bill (Billing) 😀
Mark (Marketing-
CA)
Only CA Users
Sales
customer_id
product_id
promotion_id
cookie_id
tracking_id
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger Policies - Row Filtering - Cross Table
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Integration
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cross Product Symbiosis
Apache
Atlas
Apache
Ranger
LLAP
Classification/
Tagging
Governance
Lineage
Tag Based
Policies
Dynamic Custom
Policies
Enforcement hooks
HDFS S3
Meta
Store
* Column Masking and Row Filtering not yet supported by tag based policy
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ranger - Tag Based Policies
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Q & A

More Related Content

PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PDF
Scalable OCR with NiFi and Tesseract
PPTX
Why is my Hadoop cluster slow?
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPTX
Apache Atlas: Governance for your Data
PPTX
A Multi Colored YARN
PPTX
Best Practices for Enterprise User Management in Hadoop Environment
Hadoop & Cloud Storage: Object Store Integration in Production
Scalable OCR with NiFi and Tesseract
Why is my Hadoop cluster slow?
Apache Hadoop YARN: Past, Present and Future
Row/Column- Level Security in SQL for Apache Spark
Apache Atlas: Governance for your Data
A Multi Colored YARN
Best Practices for Enterprise User Management in Hadoop Environment

What's hot (20)

PPTX
Log Analytics Optimization
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
Hadoop Operations - Past, Present, and Future
PPTX
Mission to NARs with Apache NiFi
PPTX
An Overview on Optimization in Apache Hive: Past, Present, Future
PPTX
Apache NiFi 1.0 in Nutshell
PPTX
Why is my Hadoop* job slow?
PPTX
Intro to Spark with Zeppelin
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PPTX
Hadoop and Spark – Perfect Together
PPTX
Design Patterns For Real Time Streaming Data Analytics
PDF
Apache Hadoop Crash Course
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
YARN - Past, Present, & Future
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
PPTX
Scalable Real-time analytics using Druid
Log Analytics Optimization
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop Operations - Past, Present, and Future
Mission to NARs with Apache NiFi
An Overview on Optimization in Apache Hive: Past, Present, Future
Apache NiFi 1.0 in Nutshell
Why is my Hadoop* job slow?
Intro to Spark with Zeppelin
#HSTokyo16 Apache Spark Crash Course
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Hadoop and Spark – Perfect Together
Design Patterns For Real Time Streaming Data Analytics
Apache Hadoop Crash Course
Mool - Automated Log Analysis using Data Science and ML
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Analysis of Major Trends in Big Data Analytics
YARN - Past, Present, & Future
Dynamic Column Masking and Row-Level Filtering in HDP
Scalable Real-time analytics using Druid
Ad

Similar to Fine-Grained Security for Spark and Hive (20)

PPTX
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
PPTX
An Apache Hive Based Data Warehouse
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
PDF
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
PDF
An Apache Hive Based Data Warehouse
PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
PPTX
Overview of new features in Apache Ranger
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
PDF
GDPR/CCPA Compliance and Data Governance in Hadoop
PPTX
SoCal BigData Day
PPT
State of Security: Apache Spark & Apache Zeppelin
PPTX
The Power of Data
PPTX
Hive 3 - a new horizon
PPTX
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
PDF
What is New in Apache Hive 3.0?
PDF
Hive 3 a new horizon
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
An Apache Hive Based Data Warehouse
Hive edw-dataworks summit-eu-april-2017
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
An Apache Hive Based Data Warehouse
Fine Grain Access Control for Big Data: ORC Column Encryption
Overview of new features in Apache Ranger
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
GDPR/CCPA Compliance and Data Governance in Hadoop
SoCal BigData Day
State of Security: Apache Spark & Apache Zeppelin
The Power of Data
Hive 3 - a new horizon
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
What is New in Apache Hive 3.0?
Hive 3 a new horizon
Security and Data Governance using Apache Ranger and Apache Atlas
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Running Apache Spark & Apache Zeppelin in Production
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Approach and Philosophy of On baking technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Approach and Philosophy of On baking technology
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025

Fine-Grained Security for Spark and Hive

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Fine-Grained Security for Spark and Hive Carter Shanklin - Director PM Don Bosco Durai - Security Architect June 29, 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda ● Current security options and challenges ● Apache Ranger Overview ● LLAP Overview ● Use Cases and Demo ● Apache Atlas Integration
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Options and Challenges
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Options and Challenges ⬢ Limited to storage level access control for Spark, Pig and MR ⬢ Column Level Access via HiveServer2 ⬢ Row Level filtering need Hive Views – Multiple Hive Views needs to be created and managed – Explicit permissions need to be given for each view/user – User need to know which view to use ⬢ Masking needs custom UDF – Needs to be wrapped using Views
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger Overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger • Central audit location for all access requests • Support multiple destination sources (HDFS, Solr, etc.) • Real-time visual query interface AuditingAuthorization • Store and manage encryption keys • Support HDFS TDE • Integration with HSM Ranger KMS • Centralized platform to define, administer and manage security policies consistently • Enforce policies within each component
  • 7. © Hortonworks Inc. 2015. All Rights Reserved
  • 8. © Hortonworks Inc. 2015. All Rights Reserved
  • 9. © Hortonworks Inc. 2015. All Rights Reserved Ranger Architecture HDFS Ranger Administration Portal HBase Hive Server2 Ranger Audit Server Ranger Plugin HadoopComponentsEnterprise Users Ranger Plugin Ranger Plugin Legacy Tools and Data Governance HDFS Knox NifI Ranger Plugin Ranger Plugin RDBMS Solr Ranger Plugin Ranger Policy Server Integration API Kafka Ranger Plugin YARN Ranger Plugin Ranger Plugin Storm Ranger Plugin Atlas
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Audits - Data Access
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Audits - Admin Actions
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Overview
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 and LLAP ⬢ At a High Level: – 2000+ features, improvements and bug fixes in Hive since HDP 2.4. – 600+ of these from outside of Hortonworks. ⬢ Major Improvements: – Preview: Hive LLAP: Persistent query servers with intelligent in-memory caching. – ACID GA: Hardened and proven at scale. – Expanded SQL Compliance: More capable integration with BI tools. – Performance: Interactive query, 2x faster ETL. – Security: Row / Column security extending to views, Column level security for Spark.
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Open Interfaces
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Integration with Hive and LLAP
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive / LLAP Security Capabilities with Ranger ⬢ Ranger Hive plugin provides authorization / access controls. ⬢ Column Masking: – Inject Hive UDFs that mask characters or hash values. – Dynamic, per-user. ⬢ Dynamic Row Filtering: – Query is analyzed and policies applied. – Dynamic, per-user. ⬢ All operations run as ordinary SQL queries: – Masking statements convert to clauses in the SQL select clause. – Filters convert to clauses in the SQL where clause.
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Native Hive Masking Capabilities UDF Purpose Example Start Example Result mask Convert letters to X/x and numbers to n. 123 Fake St. nnn Xxxx Xx. mask_first_n Mask only the first n characters. 433-54-3937 nnn-54-3937 mask_last_n Mask only the last n characters. 433-54-3937 433-54-nnnn mask_show_first_n Mask, showing only the first n characters. 555-233-1234 555-nnn-nnnn mask_show_last_n Mask, showing only the last n characters. 433-54-3937 nnn-nn-3937 mask_hash Produce a consistent hash of the field. CA 21f241cccaa5cfa33190f56ff1510e37
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Delivering Spark Security
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Features: Spark Column Security with LLAP ⬢ Fine-Grained Column Level Access Control for SparkSQL. ⬢ Fully dynamic policies per user. Doesn’t require views. ⬢ Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server.
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example: Per-User Row Filtering by Region in SparkSQL
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Setup ⬢Customer User and Sales data in ORC (Metadata in MetaStore) ⬢Data can be access via SparkSQL or HiveServer2 ⬢Marketing needs access to Sales and Users data for analytics ⬢Fraud Investigation team needs access to data for fraud detection ⬢Billing team needs access to Sales and Users data for billing Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip Sales customer_id product_id promotion_id cookie_id tracking_id Group Users Fraud frank Marketing mark Billing bill Tables
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 1: Restricting Column Access This is a simple use case where certain groups or users don’t permission to view the query ⬢Billing group has access to all columns in table Users ⬢Marketing group can’t access credit card column from table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column customer_phone customer_ccn bill (Billing) 😀 😀 mark (Marketing) 😀 😡
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns - Results bill from Billing mark from Marketing
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Restrict Columns - Audit Screen
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 2: Column Masking In this use case where certain groups or users won't be able to see the real value of certain columns. ⬢Billing group can see the real/raw values for all columns in table Users ⬢Fraud group can only see masked values of PII and PCI fields from table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column customer_email, customer_phone, customer_ccn bill (Billing) 😀 frank (Fraud) 😎
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Mask Fields
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Column Masking - Results bill from Billing frank from Fraud
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Column Masking - Audit Screen
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 3: Row Filtering In this use case where certain groups or users won't be able to see all the rows from certain tables ⬢Billing group can see all the rows in the table Users ⬢Marketing can only see rows/data from their region in the table Users Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column Rows in Users table bill (Billing) 😀 Mark (Marketing- CA) Only CA Users
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Row Filtering
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policy - Row Filtering - Results bill from Billing mark from Marketing
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case 4: Row Filtering - Cross Table This an extension of previous use cases, where the context information for filtering the row is in another table. ⬢Billing group can see all the rows in the table Sales ⬢Marketing can only see rows/data from their region in the table Sales, however Sales table doesn’t have the customer geographic information, so it needs to be derived from Users table Users customer_id customer_name customer_email customer_phone customer_ccn customer_state customer_zip User/Column Rows in Sales table bill (Billing) 😀 Mark (Marketing- CA) Only CA Users Sales customer_id product_id promotion_id cookie_id tracking_id
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger Policies - Row Filtering - Cross Table
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Integration
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cross Product Symbiosis Apache Atlas Apache Ranger LLAP Classification/ Tagging Governance Lineage Tag Based Policies Dynamic Custom Policies Enforcement hooks HDFS S3 Meta Store * Column Masking and Row Filtering not yet supported by tag based policy
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ranger - Tag Based Policies
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Q & A