SlideShare a Scribd company logo
Magnus.Runesson@svenskaspel.se
DataWorks Summit 2018-04-19
Practical experiences using Atlas and Ranger to
implement GDPR
2018-04-25 1
22018-04-25
Who is talking?
Magnus
Runesson
Data Engineer @ Svenska Spel
Developer
Ops
RDBMS
BigData
High performance
Gaming is for everyone´s enjoyment
42018-04-25
• This talk does not cover all we have done around GDPR
• This is NOT a way to say if you do this you are GDPR compliant.
• Some details are left out or simplified
Disclaimer
52018-04-25
• Why?
• Svenska Spel’s data warehouse
• Atlas & Ranger
• How did we implement it?
• Experiences and conclusions
Agenda
62018-04-25
GDPR requires
• clear purpose for PII data
• privacy by design
• clear consent or legal ground
• not to use/store PII if not needed
• people own their own data.
• penalty if not followed
Why?
72018-04-25
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Goals
82018-04-25
Svenska Spel’s data warehouse
92018-04-25
• Moved from classic Cognos + Oracle
• HDP 2.6 using Hive
• Includes Personal Identifiable Information (PII)
• 300+ event streams in
• 150 published tables and views
Svenska Spel’s data warehouse
102018-04-25
• Used data are
• Understood
• Documented
• Modelled
• Modelled with Data Vault
• Oracle SQL Developer Data Modeler
• SQL code generated from model
Model based development
112018-04-25
• History tracking
• Uniquely linked
• Pattern based
• Easy to generate code
Data Vault
Link
Hub
Hub
Satellite
Satellite Satellite
12
CRM mart
ETL
Anonymization
DataLake
Integration
DataVault
Dimension
mart
ETL BI mart
Exasol
Tableau
Hadoop Presentation
Rolebasedaccess
CRM
Whitelisting
…
132018-04-25
Apache Atlas and Ranger
142018-04-25
• Metadata about resources
• Resource is
• Table
• Column
• Schema
• File on HDFS
• …
• Lineage
Apache Atlas
152018-04-25
• Tags have no meaning themselves
• Your business vocabulary define the meaning
• Example of tags:
• Business entity owning the data
• Indication of sensitive data
• The rules in Ranger enforces the policy
• Separate metadata from policy implementation
Atlas tags
PII
162018-04-25
• Is user U allowed to do operation O on resource R?
• Access
• Row based filtering
• Masking
• Audit logging
• Resources referred with tags
Apache Ranger
172018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
Table in Hive before we started our work
182018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
PII_table
PII
Add PII tags on table and columns in Atlas.
No behaviour change.
PII
192018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
17 ABC 12345 False False
42 DEF 54321 True False
13 BDE 54672 False True
PII
We set a rule in Ranger to mask PII columns
Analyst view
PII_table
PII
202018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
3 Paul 54672 False True
PII
Ranger restrict our CRM user to only see rows with
Marketing = True
PII_table
PII
212018-04-25
How did we implement this?
222018-04-25
Development process
Change
m
odel
Store
m
odel
G
enerate
code
Deploy
PII
Add
rules
232018-04-25
• In-house tool
• Template based generation of SQL/HQL
• Generate files with tag-information
• Tables and columns respectively
HQL generator
HQL generator
CSV SQL
PII
242018-04-25
schema;table;attribute;tags
dim_mart;customer_d;customer_id;PII,Sensitive
dim_mart;customer_d;has_phone;
Corresponding file for tables without attribute(column)
Tag file for columns
252018-04-25
• Hand coded of rules per tag
• Policy tool applies rule on all tables with the tag
• Can be different rules for different users
• Filter gets appended to where condition by Ranger
• Used for
• Row based filtering (access)
• Masking (anonymization)
• Catch all rule to deny access to tables not in our model
Ranger rules
262018-04-25
tag;groups;users;filter
PII_table;CRM_USERS;;EXISTS (SELECT 1
FROM customer_whitelist_crm whitelist_crm
WERE whitelist_crm.customer_id = $table.customer_id)
$table get replaced at deployment time
Example tag_row_policies.csv
272018-04-25
Deployment process
*.sql
table_tags.csv
column_tags.csv
tag_row_policies.csv
Apply *.sql DDL
Policy tool - tag files
Policy tool - policy file
282018-04-25
• Makes it easy to manage
• Atlas tags
• Ranger policy rules
• Command line tool
• Consumes tags and policy CSV files
• Calls Atlas and Ranger API
• Less than 1000 rows of Python
Policytool
Policytool
CSV
292018-04-25
Put everything together
302018-04-25
Development process
Change
m
odel
Store
m
odel
G
enerate
code
Deploy
PII
Add
rules
*.sql
column_tags.csv
table_tags.csv
tag_row_policies.csv
312018-04-25
Deployment process
*.sql
column_tags.csv
table_tags.csv
tag_row_policies.csv
Apply *.sql DDL
Policy tool - tag files
Policy tool - policy file
322018-04-25
Change in view of an Analyst
Before
CRM
Analyst
332018-04-25
• Simple and easy model
• Limited performance penalty
• Tag on table with masking rule => all columns masked
• Hard to understand API doc
• Restriction on Ranger row based filtering (not on tags)
• Row based filtering and masking not on direct file access
Experiences of Atlas and Ranger
342018-04-25
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Reached Goals
352018-04-25
• Goals reached
• No SQL changes
• Scale when new datasets added
• Our data model is guaranteed in sync
• Better comments in Hive
• Minimal impact on ETL developers workflow
Conclusions
362018-04-25
• Make it as simple as possible
• Automate
• Know your tool
• Be clear on your authorization model
• Know your data
Takeaways
Magnus.Runesson@svenskaspel.se
@MRunesson
Thank you!
2018-04-25 37
karriar.svenskaspel.se

More Related Content

PPTX
Adapting to the exponential development of technology
PPTX
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
PPTX
GDPR: the IBM journey to compliance
PPTX
Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...
PPTX
Building trust in your data lake. A fintech case study on automated data disc...
PDF
Journey to Big Data: Main Issues, Solutions, Benefits
PDF
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
PDF
End to End Supply Chain Control Tower
Adapting to the exponential development of technology
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
GDPR: the IBM journey to compliance
Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...
Building trust in your data lake. A fintech case study on automated data disc...
Journey to Big Data: Main Issues, Solutions, Benefits
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
End to End Supply Chain Control Tower

What's hot (20)

PPTX
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
PDF
Postgres Vision 2018: Data as the New Oil
 
PPTX
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
PPTX
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
PDF
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
PDF
Postgres Vision 2018: AI Needs IA
 
PPTX
Harnessing the Power of Big Data at Freddie Mac
PPTX
Addressing Challenges with IoT Edge Management
PDF
Postgres Vision 2018: How to Consume your Database Platform On-premises
 
PDF
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
PPTX
Postgres Vision 2018: Taking Postgres Everywhere
 
PPTX
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
PPTX
Renault: A Data Lake Journey
PDF
Postgres Vision 2018: The Pragmatic Cloud
 
PPTX
Terracotta Hadoop & In-Memory Webcast
PPTX
The Single Most Important Formula for Business Success
PDF
Data Mesh at CMC Markets: Past, Present and Future
PDF
Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS
 
PPTX
How OpenTable uses Big Data to impact growth by Raman Marya
PPTX
Oil and gas big data edition
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Postgres Vision 2018: Data as the New Oil
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
Couchbase and Apache Kafka - Bridging the gap between RDBMS and NoSQL
Postgres Vision 2018: AI Needs IA
 
Harnessing the Power of Big Data at Freddie Mac
Addressing Challenges with IoT Edge Management
Postgres Vision 2018: How to Consume your Database Platform On-premises
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Postgres Vision 2018: Taking Postgres Everywhere
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
Renault: A Data Lake Journey
Postgres Vision 2018: The Pragmatic Cloud
 
Terracotta Hadoop & In-Memory Webcast
The Single Most Important Formula for Business Success
Data Mesh at CMC Markets: Past, Present and Future
Postgres Vision 2018: Your Migration Path - Rabobank and a New DBaaS
 
How OpenTable uses Big Data to impact growth by Raman Marya
Oil and gas big data edition
Ad

Similar to Practical experiences using Atlas and Ranger to implement GDPR (20)

PPTX
Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssu...
PPTX
Journey in Country of Data Access Governance - Data works summit 2019 Barcelona
PPTX
Metadata Driven Access Control in Practice - BigData Tech Warsawm 2019
PPTX
The Power of Data
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
PPTX
Classification based security in Hadoop
PPTX
GDPR Community Showcase for Apache Ranger and Apache Atlas
PDF
Privacy by Design - Lars Albertsson, Mapflat
PPTX
Fine-Grained Security for Spark and Hive
PPTX
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
PPTX
Unleashing the power of apache atlas with apache - virtual dataconnector
PPTX
Security Framework for Multitenant Architecture
PDF
Amundsen: From discovering to security data
PPTX
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
PPTX
Securing data in hybrid environments using Apache Ranger
PDF
The New Database Frontier: Harnessing the Cloud
PPTX
Tag based policies using Apache Atlas and Ranger
PPTX
Data Modeling for Security, Privacy and Data Protection
PPTX
Real world data engineering practices for GDPR
PPTX
Oracle openworld-presentation
Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssu...
Journey in Country of Data Access Governance - Data works summit 2019 Barcelona
Metadata Driven Access Control in Practice - BigData Tech Warsawm 2019
The Power of Data
Is your Enterprise Data lake Metadata Driven AND Secure?
Classification based security in Hadoop
GDPR Community Showcase for Apache Ranger and Apache Atlas
Privacy by Design - Lars Albertsson, Mapflat
Fine-Grained Security for Spark and Hive
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
Unleashing the power of apache atlas with apache - virtual dataconnector
Security Framework for Multitenant Architecture
Amundsen: From discovering to security data
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Securing data in hybrid environments using Apache Ranger
The New Database Frontier: Harnessing the Cloud
Tag based policies using Apache Atlas and Ranger
Data Modeling for Security, Privacy and Data Protection
Real world data engineering practices for GDPR
Oracle openworld-presentation
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Practical experiences using Atlas and Ranger to implement GDPR