SlideShare a Scribd company logo
Scaling Privacy with
Apache Spark
Aaron Colcord
Sr. Director Engineering, Northwestern Mutual
Don Durai Bosco
CTO and Co-Founder, Privacera
Agenda
▪ Our background
▪ Why privacy, security,
compliance?
▪ Approaches
▪ Ideal problem solve
▪ Real life meets ideal life
Backgrounds
▪ Building an Enterprise Scale Unified
Framework
▪ Very Long, Respected History ~ 160 Years
▪ Compliance is extremely important to us
▪ Agile Data vs Compliant Data
▪ Founded in 2016 by the creators of Apache
Ranger & Apache Atlas
▪ Extends Ranger's capabilities beyond traditional
Big Data environments to cloud (Databricks,
AWS, Azure, GCP, and more)
▪ Specializes in democratizing data for analytics,
while ensuring compliance with privacy
regulations (GDPR, CCPA, LGPD, HIPAA, & more)
• Privacera
• Northwestern Mutual
Why do we suddenly care about privacy?
• You care if you are regulated in any form
• Simple you need to show you can pass an audit
• You care if you store any information about your users
• Simple because governments have woken up with GDPR and CCPA
• You care if you want to democratize your data
• Simple because the use of your data can be scrutinized
We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and
technology really showed how important it was.
Have you ever...
• Collecting information about your customers can
• Improve the experience
• Allow the company to understand their business better
• At the core, privacy is a policy and legal obligation
• You have the data, it used to be your business to just secure it.
• Do you want your information monetized? Sold? Traded?
• Most companies don’t do this. But the privacy policy is there for you.
• Clicked ‘accept all’ on website, used a digital assistant..
Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or
EULA?
And it’s only going to pick up speed.
• More Regulations are arriving around privacy
• Increasing your ability to execute against data means respecting your user’s rights
• A part of maturity is being able to manage governance
More importantly, why do we care so much?
• Technology like Apache Spark opens the capability to
democratize your data.
• Most every company wants the marketplace to enrich
and share their data.
• Who inside that company can view it? Do we have the
controls to protect your information? Can we verify
that the information is used for the right purposes?
What is the difference between these?
▪ Preventing unauthorized
usage of systems
▪ Ensuring users don’t see the
incorrect information
▪ Creating boundaries to
enforce right action of the
system
• The process of making sure
your company and
employees follow all laws,
regulations, standards, and
ethical practices that apply
to your organization
• Compliance
• Security
• “Data privacy may be
defined as the authorized,
fair, and legitimate
processing of personal
information”
• Consent rights
• Do not share
• Slippery space
• Privacy
Examine strategies to scale agile data w/privacy
• Build a metadata layer that defines PII in its schema
• Users and developers can and will change where PII is stored
• You can literally chase people to do the ‘right thing’ forever
• You could build views with permissions to certain users
• Not very scalable
• Plus you need to always show who accessed and why
• Are these security scenario?
Challenges to that strategy
• Is the metadata layer flexible enough or should we think in policies?
• Privacy is inherently your organization’s position which may evolve based on regulation
• Can your development keep up with views?
• When you discover the extra 10,000 fields, can you keep up?
• Implement a framework that scales
• Security is not Privacy.
• Security has a different domain and set of principles.
• Remember we are protecting the usage of your data.
How can we solve it?
Ideal scalable system
▪ Revocation of
Consent
▪ Portability
▪ Erasure
▪ Rectification
▪ How is data used?
▪ Rights follow Data
Reuse
▪ Flexible to change
▪ Should align with a
Data Governance
program
▪ Should adapt to
changing data
▪ Proactive.
▪ Reclassification
• Classification
• User Rights
▪ How was it used?
▪ How was it
accessed?
▪ How was it
protected?
▪ Did it cross
borders?
• Audit/Governance
▪ Authorization of
User may change
▪ Supports Agile
Access
▪ Business Use is
preserved
▪ Automated
Systems obey
Privacy
• Access
User Rights at Scale
▪Revocation of Consent/ Right To Be Forgotten
▪Portability
▪Erasure
▪Rectification
▪How is data used?
▪Rights follow Data Reuse
▪Flexible to change
S3 ADLS Redshift Snowflake Synapse
Privacy Challenges in Open Data Ecosystem
Athena Databricks HDInsight
EMR
Dremio Trino PrestoDB
PowerBI Tableau
Storage
SQL Engines
Data Virtualization
BI Tools
Marketing
Data
Analyst
Data
Scientist/A
rchitect
Governance blind spot
Tools & Technology
AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL
AUDIT COLLECTION AND REPORTING
Automated Data Discovery
● Automatically detect and catalog sensitive
data
● Detailed classification, e.g. EMAIL, SSN,
GENDER, CC, PHONE_NUMBER, etc.
● Eliminate manual processes
● Catalog data as it is ingested
● Track data movement and propagate tag
● Catalog data across multiple cloud
services
Centralized Access Control
● Global Tag/Classification-based policies
● Purpose and Persona based policies
● Dynamic row filters v/s Views
● Dynamic masking or decryption
● Approval workflows with time and
purpose constraints
Centralized Auditing and Reporting
● Centralize auditing
● Monitoring data access by classification
● Track usage by Purpose
● Generate attestation reports
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

PDF
Migrate and Modernize Hadoop-Based Security Policies for Databricks
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PDF
Azure Synapse Analytics
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PDF
Databricks: A Tool That Empowers You To Do More With Data
PPTX
Azure synapse analytics overview elasta cloud3
PDF
Intro to Delta Lake
PDF
Cloud and Analytics - From Platforms to an Ecosystem
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Azure Synapse Analytics
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Databricks: A Tool That Empowers You To Do More With Data
Azure synapse analytics overview elasta cloud3
Intro to Delta Lake
Cloud and Analytics - From Platforms to an Ecosystem

What's hot (20)

PDF
Introduction to Azure Synapse Webinar
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
PDF
Modernizing to a Cloud Data Architecture
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Auckland SQL Saturday - Azure Data Lake
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
PDF
Analytics-Enabled Experiences: The New Secret Weapon
PDF
Azure databricks c sharp corner toronto feb 2019 heather grandy
PDF
Using Redash for SQL Analytics on Databricks
PPTX
From Events to Networks: Time Series Analysis on Scale
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PDF
Azure Data Factory v2
PPTX
Modern data warehouse
PDF
Data platform architecture
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PPTX
How to Build Continuous Ingestion for the Internet of Things
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
PDF
Azure Synapse 101 Webinar Presentation
PPTX
The Power of Data
Introduction to Azure Synapse Webinar
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Modernizing to a Cloud Data Architecture
Azure Synapse Analytics Overview (r2)
Auckland SQL Saturday - Azure Data Lake
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Analytics-Enabled Experiences: The New Secret Weapon
Azure databricks c sharp corner toronto feb 2019 heather grandy
Using Redash for SQL Analytics on Databricks
From Events to Networks: Time Series Analysis on Scale
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Azure Data Factory v2
Modern data warehouse
Data platform architecture
Azure Databricks - An Introduction (by Kris Bock)
How to Build Continuous Ingestion for the Internet of Things
Part 3 - Modern Data Warehouse with Azure Synapse
Azure Synapse 101 Webinar Presentation
The Power of Data
Ad

Similar to Scaling Privacy in a Spark Ecosystem (20)

PDF
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
PPTX
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
PPTX
Microsoft Cloud GDPR Compliance Options (SUGUK)
PDF
cloud session uklug
PPTX
GDPR - Why it matters and how to make it Easy
PPTX
SharePoint Governance 101 SPSSA2016
PPTX
Data Governance, Compliance and Security in Hadoop with Cloudera
PPTX
Data Loss Prevention in O365
PPTX
CRMCS GDPR - Why it matters and how to make it Easy
PPTX
SharePoint Governance 101 - Austin O365 & SharePoint User Group
PPTX
Fuse Analytics - HR & Payroll Cloud Transformation Pitfalls, Lessons Learned
PPTX
SharePoint Governance 101 - OKCSUG
PPTX
Webinar - Compliance with the Microsoft Cloud- 2017-04-19
PDF
data_blending
PPTX
Cybersecurity and Data Protection Executive Briefing
 
PPTX
Global Data Privacy Regulation
PDF
Intro to Data Science on Hadoop
PPTX
IDERA Live | Understanding SQL Server Compliance both in the Cloud and On Pre...
PPTX
How Cloudera SDX can aid GDPR compliance
PPTX
Security, Administration & Governance for SharePoint On-Prem, Online, & Every...
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Microsoft Cloud GDPR Compliance Options (SUGUK)
cloud session uklug
GDPR - Why it matters and how to make it Easy
SharePoint Governance 101 SPSSA2016
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Loss Prevention in O365
CRMCS GDPR - Why it matters and how to make it Easy
SharePoint Governance 101 - Austin O365 & SharePoint User Group
Fuse Analytics - HR & Payroll Cloud Transformation Pitfalls, Lessons Learned
SharePoint Governance 101 - OKCSUG
Webinar - Compliance with the Microsoft Cloud- 2017-04-19
data_blending
Cybersecurity and Data Protection Executive Briefing
 
Global Data Privacy Regulation
Intro to Data Science on Hadoop
IDERA Live | Understanding SQL Server Compliance both in the Cloud and On Pre...
How Cloudera SDX can aid GDPR compliance
Security, Administration & Governance for SharePoint On-Prem, Online, & Every...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Introduction to Business Data Analytics.
PPTX
Computer network topology notes for revision
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Fluorescence-microscope_Botany_detailed content
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Moving the Public Sector (Government) to a Digital Adoption
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Launch Your Data Science Career in Kochi – 2025
Introduction to Business Data Analytics.
Computer network topology notes for revision

Scaling Privacy in a Spark Ecosystem

  • 1. Scaling Privacy with Apache Spark Aaron Colcord Sr. Director Engineering, Northwestern Mutual Don Durai Bosco CTO and Co-Founder, Privacera
  • 2. Agenda ▪ Our background ▪ Why privacy, security, compliance? ▪ Approaches ▪ Ideal problem solve ▪ Real life meets ideal life
  • 3. Backgrounds ▪ Building an Enterprise Scale Unified Framework ▪ Very Long, Respected History ~ 160 Years ▪ Compliance is extremely important to us ▪ Agile Data vs Compliant Data ▪ Founded in 2016 by the creators of Apache Ranger & Apache Atlas ▪ Extends Ranger's capabilities beyond traditional Big Data environments to cloud (Databricks, AWS, Azure, GCP, and more) ▪ Specializes in democratizing data for analytics, while ensuring compliance with privacy regulations (GDPR, CCPA, LGPD, HIPAA, & more) • Privacera • Northwestern Mutual
  • 4. Why do we suddenly care about privacy? • You care if you are regulated in any form • Simple you need to show you can pass an audit • You care if you store any information about your users • Simple because governments have woken up with GDPR and CCPA • You care if you want to democratize your data • Simple because the use of your data can be scrutinized We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and technology really showed how important it was.
  • 5. Have you ever... • Collecting information about your customers can • Improve the experience • Allow the company to understand their business better • At the core, privacy is a policy and legal obligation • You have the data, it used to be your business to just secure it. • Do you want your information monetized? Sold? Traded? • Most companies don’t do this. But the privacy policy is there for you. • Clicked ‘accept all’ on website, used a digital assistant.. Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or EULA?
  • 6. And it’s only going to pick up speed. • More Regulations are arriving around privacy • Increasing your ability to execute against data means respecting your user’s rights • A part of maturity is being able to manage governance
  • 7. More importantly, why do we care so much? • Technology like Apache Spark opens the capability to democratize your data. • Most every company wants the marketplace to enrich and share their data. • Who inside that company can view it? Do we have the controls to protect your information? Can we verify that the information is used for the right purposes?
  • 8. What is the difference between these? ▪ Preventing unauthorized usage of systems ▪ Ensuring users don’t see the incorrect information ▪ Creating boundaries to enforce right action of the system • The process of making sure your company and employees follow all laws, regulations, standards, and ethical practices that apply to your organization • Compliance • Security • “Data privacy may be defined as the authorized, fair, and legitimate processing of personal information” • Consent rights • Do not share • Slippery space • Privacy
  • 9. Examine strategies to scale agile data w/privacy • Build a metadata layer that defines PII in its schema • Users and developers can and will change where PII is stored • You can literally chase people to do the ‘right thing’ forever • You could build views with permissions to certain users • Not very scalable • Plus you need to always show who accessed and why • Are these security scenario?
  • 10. Challenges to that strategy • Is the metadata layer flexible enough or should we think in policies? • Privacy is inherently your organization’s position which may evolve based on regulation • Can your development keep up with views? • When you discover the extra 10,000 fields, can you keep up? • Implement a framework that scales • Security is not Privacy. • Security has a different domain and set of principles. • Remember we are protecting the usage of your data.
  • 11. How can we solve it?
  • 12. Ideal scalable system ▪ Revocation of Consent ▪ Portability ▪ Erasure ▪ Rectification ▪ How is data used? ▪ Rights follow Data Reuse ▪ Flexible to change ▪ Should align with a Data Governance program ▪ Should adapt to changing data ▪ Proactive. ▪ Reclassification • Classification • User Rights ▪ How was it used? ▪ How was it accessed? ▪ How was it protected? ▪ Did it cross borders? • Audit/Governance ▪ Authorization of User may change ▪ Supports Agile Access ▪ Business Use is preserved ▪ Automated Systems obey Privacy • Access
  • 13. User Rights at Scale ▪Revocation of Consent/ Right To Be Forgotten ▪Portability ▪Erasure ▪Rectification ▪How is data used? ▪Rights follow Data Reuse ▪Flexible to change
  • 14. S3 ADLS Redshift Snowflake Synapse Privacy Challenges in Open Data Ecosystem Athena Databricks HDInsight EMR Dremio Trino PrestoDB PowerBI Tableau Storage SQL Engines Data Virtualization BI Tools Marketing Data Analyst Data Scientist/A rchitect
  • 16. Tools & Technology AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL AUDIT COLLECTION AND REPORTING
  • 17. Automated Data Discovery ● Automatically detect and catalog sensitive data ● Detailed classification, e.g. EMAIL, SSN, GENDER, CC, PHONE_NUMBER, etc. ● Eliminate manual processes ● Catalog data as it is ingested ● Track data movement and propagate tag ● Catalog data across multiple cloud services
  • 18. Centralized Access Control ● Global Tag/Classification-based policies ● Purpose and Persona based policies ● Dynamic row filters v/s Views ● Dynamic masking or decryption ● Approval workflows with time and purpose constraints
  • 19. Centralized Auditing and Reporting ● Centralize auditing ● Monitoring data access by classification ● Track usage by Purpose ● Generate attestation reports
  • 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.