SlideShare a Scribd company logo
Automating Data Pipeline Security
1
2
4
3
5 6
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
Carta’s Data Team is Hiring 🎉
Automating Data Pipeline Security
Automating Data Pipeline Security
Privacy
3 Big Ideas
1. Privacy has a strange history.
2. Privacy-first systems are designed by people with a professional ethic.
3. Privacy can be automated away.
Automating security in your data pipeline
privacy
1. Strange History of Privacy
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
16

“The actio iniuriarum was, in Roman law,
a delict which served to protect the
non-patrimonial aspects of a person's
existence – who a person is rather than
what a person has.”
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
©1979 "The Invention of the Right to Privacy" by Dorothy J. Glancy
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
2. Privacy-first Ethic
Software is eating
the world.
“Audit defensibility is too low a
bar when it comes to our
customer’s privacy.”
GDPR
EU General Data Protection Regulation
● Right of access
● Pseudonymisation
● Right of erasure
● Records of processing activities
● Privacy by design
CCPA
California Consumer Privacy Act
● Know what personal information is being
collected
● Right to erasure
● Know whether their personal information is
being shared, and if so, with whom
● Opt-out of the sale of their personal
information
Privacy Regulation
3. Automate Privacy
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
“The security posture of your
weakest vendor is the security
posture of your entire
organization.”
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
Blank Slide
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
● Airflow DAGs to move data into S3
and Redshift
● DAG: Directed Acyclic Graph
● Operator/Task: A node in the graph
● Airflow runs dbt
Workflow manager from Airbnb
Apache Airflow
Apache Airflow
● Open source boilerplate for running Airflow
in Docker
● Used at Carta
Dockerized Airflow
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey
How do we keep up with the sensitive
columns being added in source data?
Automating the blacklist updates
Stale Blacklist
● dbt tests fail when the result set is
not empty.
● The records returned by dbt test
are the offending records.
Automated data tests
dbt test
● dbt tests fail when the result set is
not empty.
● The records returned by dbt test
are the offending records.
Automated data tests
dbt test
We have a custom access management
system called Gatekeeper.
Tools for requesting and granting access
Automating Access
This example uses our IAM Service
Account custom Terraform module to
create a new Revenue Service account
user with access to a single S3 data lake
bucket.
Automate Data Lake access
Terraform Modules
Data Warehouse Migrations
● sql-migrate: Excellent cli and
migrations library written in Go.
● Extended to support Jinja
templating.
We can rebuild the Warehouse from code.
Pseudonymity
Disguised identity or “false name”
©2019 Alex Ewerlöf "GDPR pseudonymization techniques"
Pseudonymity: Obfuscation
👍 Easy to do in any language.
👍 No impact to downstream systems.
👎 Can be unscrambled.
Scrambling or mixing up data
Pseudonymity: Masking
👍 Simple.
👍 Owner can verify the last 4 digits.
👎 Some pieces of the real data are stored.
Obscure part of the data
Pseudonymity: Tokenization
👍 Popular libraries like Faker.
👍 All original data is replaced.
👎 No way to recover the original data.
Replace real data with fake data
Pseudonymity: Blurring
👍 95% of this image is left unblurred.
👎 Possible to reverse blurring.
Blur a subset of the data
Pseudonymity: Encryption
👍 The original data can be recovered.
👍 Manage fewer permissions downstream.
👎 Asymmetric vs Symmetric trade-offs.
Two-way transformation of the data
AWS Key Management Service
● Generate a new data key for encrypting and
decrypting data protected by a master key.
● Or manually rotate the master key and
re-encrypt the data.
Automate key creation and rotation
Encrypted Columns
● pgcrypto allows us to encrypt sensitive
columns before the data lands in our S3
data lake.
● This example is encrypting the birth_date
column in Postgres.
Postgres pgcrypto
“Last Mile” Decryption
● Access to encrypted columns is limited to
analysts with the encryption key.
● This example is decrypting the birth_date
column in Redshift.
Decrypt sensitive data at query time
Encrypted Column Problems
Some things to consider...
1. Symmetric or Asymmetric encryption scheme?
2. Should we manually rotate our master key?
3. How many keys should we use and how should they be organized?
4. Should our analysts and data scientists need to think about keys?
5. When and how do we re-encrypt data? When an employee with
access to keys leaves the company?
3 Big Ideas
1. Privacy has a strange history.
2. Privacy-first systems are designed by people with a professional ethic.
3. Privacy can be automated away.
Automating security in your data pipeline
privacy
carta.com/jobs
@troyharvey
troy.harvey@carta.com
OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

More Related Content

PPTX
DN18 | Ocean Protocol – Empowering a Decentralized Marketplace for The New AI...
KEY
Open Data
PDF
APIdays Paris 2019 - Turning your Database into a GraphQL API with Prisma & N...
PDF
BigchainDB and Beyond
PDF
Introduction to BigchainDB
PDF
Why Blockchain Matters to Big Data - Big Data London Meetup - Nov 3, 2016
PDF
Weaving the ILP Fabric into Bigchain DB
PDF
Graylog is the New Black
DN18 | Ocean Protocol – Empowering a Decentralized Marketplace for The New AI...
Open Data
APIdays Paris 2019 - Turning your Database into a GraphQL API with Prisma & N...
BigchainDB and Beyond
Introduction to BigchainDB
Why Blockchain Matters to Big Data - Big Data London Meetup - Nov 3, 2016
Weaving the ILP Fabric into Bigchain DB
Graylog is the New Black

What's hot (14)

PDF
C* Summit 2013: Lock it Up: Securing Sensitive Data by Sam Heywood
PDF
CryptocurrencyProject
PDF
BigchainDB - Big Data meets Blockchain
PDF
Blockchain
PDF
Te damos la bienvenida a una nueva forma de realizar búsquedas
PDF
Blockchain big data groningen meetup 2017-03-23
PDF
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
PPTX
Ethereum explorer
PDF
Blockchain – The future of Internet by Moinur Rahman
PPTX
Crowdsourcing Speech Data Science and AI
PPTX
Demystifying messaging communication patterns
PDF
Trent McConaghy- BigchainDB
PPTX
Fascinating Metrics and Analytics About Cryptocurrencies
PDF
Alexander Sibiryakov- Frontera
C* Summit 2013: Lock it Up: Securing Sensitive Data by Sam Heywood
CryptocurrencyProject
BigchainDB - Big Data meets Blockchain
Blockchain
Te damos la bienvenida a una nueva forma de realizar búsquedas
Blockchain big data groningen meetup 2017-03-23
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Ethereum explorer
Blockchain – The future of Internet by Moinur Rahman
Crowdsourcing Speech Data Science and AI
Demystifying messaging communication patterns
Trent McConaghy- BigchainDB
Fascinating Metrics and Analytics About Cryptocurrencies
Alexander Sibiryakov- Frontera
Ad

Similar to OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey (20)

PDF
CrypTag: Building Encrypted, Taggable, Searchable Zero-knowledge Systems
DOCX
key-aggregate cryptosystem for scalable data sharing in cloud storage
PDF
XP Days 2019: First secret delivery for modern cloud-native applications
PPTX
Cloud Security and some preferred practices
PDF
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
PPTX
P2 Project
PPTX
Security pre
PPTX
homomorphicencryption.pptx homomorphicencryption.pptx
PDF
Observability at Spotify
PPTX
Webinar: How to Design Primary Storage for GDPR
PDF
cryptography
PPT
Protecting Sensitive Data using Encryption and Key Management
PPTX
Rugged DevOps at Scale with Rich Mogull
PDF
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
DOCX
Securing data at rest with encryption
PPTX
Networking Security in data communication.pptx
DOCX
key-aggregate cryptosystem for scalable data sharing in cloud storage
PPTX
Building A Cloud Security Strategy for Scale
PPTX
big data and Iot , its security part ,hoe yoy help this
PDF
Solve Big Data Security Issues
CrypTag: Building Encrypted, Taggable, Searchable Zero-knowledge Systems
key-aggregate cryptosystem for scalable data sharing in cloud storage
XP Days 2019: First secret delivery for modern cloud-native applications
Cloud Security and some preferred practices
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
P2 Project
Security pre
homomorphicencryption.pptx homomorphicencryption.pptx
Observability at Spotify
Webinar: How to Design Primary Storage for GDPR
cryptography
Protecting Sensitive Data using Encryption and Key Management
Rugged DevOps at Scale with Rich Mogull
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Securing data at rest with encryption
Networking Security in data communication.pptx
key-aggregate cryptosystem for scalable data sharing in cloud storage
Building A Cloud Security Strategy for Scale
big data and Iot , its security part ,hoe yoy help this
Solve Big Data Security Issues
Ad

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
medical staffing services at VALiNTRY
PPTX
Introduction to Artificial Intelligence
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
System and Network Administration Chapter 2
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Transform Your Business with a Software ERP System
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Cost to Outsource Software Development in 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
assetexplorer- product-overview - presentation
PDF
PTS Company Brochure 2025 (1).pdf.......
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
Upgrade and Innovation Strategies for SAP ERP Customers
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
CHAPTER 2 - PM Management and IT Context
Odoo Companies in India – Driving Business Transformation.pdf
Why Generative AI is the Future of Content, Code & Creativity?
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
System and Network Administration Chapter 2
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Transform Your Business with a Software ERP System
Computer Software and OS of computer science of grade 11.pptx
Cost to Outsource Software Development in 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
assetexplorer- product-overview - presentation
PTS Company Brochure 2025 (1).pdf.......

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

  • 5. Carta’s Data Team is Hiring 🎉
  • 7. Automating Data Pipeline Security Privacy
  • 8. 3 Big Ideas 1. Privacy has a strange history. 2. Privacy-first systems are designed by people with a professional ethic. 3. Privacy can be automated away. Automating security in your data pipeline privacy
  • 9. 1. Strange History of Privacy
  • 16. 16  “The actio iniuriarum was, in Roman law, a delict which served to protect the non-patrimonial aspects of a person's existence – who a person is rather than what a person has.”
  • 20. ©1979 "The Invention of the Right to Privacy" by Dorothy J. Glancy
  • 24. “Audit defensibility is too low a bar when it comes to our customer’s privacy.”
  • 25. GDPR EU General Data Protection Regulation ● Right of access ● Pseudonymisation ● Right of erasure ● Records of processing activities ● Privacy by design CCPA California Consumer Privacy Act ● Know what personal information is being collected ● Right to erasure ● Know whether their personal information is being shared, and if so, with whom ● Opt-out of the sale of their personal information Privacy Regulation
  • 28. “The security posture of your weakest vendor is the security posture of your entire organization.”
  • 32. ● Airflow DAGs to move data into S3 and Redshift ● DAG: Directed Acyclic Graph ● Operator/Task: A node in the graph ● Airflow runs dbt Workflow manager from Airbnb Apache Airflow
  • 33. Apache Airflow ● Open source boilerplate for running Airflow in Docker ● Used at Carta Dockerized Airflow
  • 35. How do we keep up with the sensitive columns being added in source data? Automating the blacklist updates Stale Blacklist
  • 36. ● dbt tests fail when the result set is not empty. ● The records returned by dbt test are the offending records. Automated data tests dbt test
  • 37. ● dbt tests fail when the result set is not empty. ● The records returned by dbt test are the offending records. Automated data tests dbt test
  • 38. We have a custom access management system called Gatekeeper. Tools for requesting and granting access Automating Access
  • 39. This example uses our IAM Service Account custom Terraform module to create a new Revenue Service account user with access to a single S3 data lake bucket. Automate Data Lake access Terraform Modules
  • 40. Data Warehouse Migrations ● sql-migrate: Excellent cli and migrations library written in Go. ● Extended to support Jinja templating. We can rebuild the Warehouse from code.
  • 41. Pseudonymity Disguised identity or “false name” ©2019 Alex Ewerlöf "GDPR pseudonymization techniques"
  • 42. Pseudonymity: Obfuscation 👍 Easy to do in any language. 👍 No impact to downstream systems. 👎 Can be unscrambled. Scrambling or mixing up data
  • 43. Pseudonymity: Masking 👍 Simple. 👍 Owner can verify the last 4 digits. 👎 Some pieces of the real data are stored. Obscure part of the data
  • 44. Pseudonymity: Tokenization 👍 Popular libraries like Faker. 👍 All original data is replaced. 👎 No way to recover the original data. Replace real data with fake data
  • 45. Pseudonymity: Blurring 👍 95% of this image is left unblurred. 👎 Possible to reverse blurring. Blur a subset of the data
  • 46. Pseudonymity: Encryption 👍 The original data can be recovered. 👍 Manage fewer permissions downstream. 👎 Asymmetric vs Symmetric trade-offs. Two-way transformation of the data
  • 47. AWS Key Management Service ● Generate a new data key for encrypting and decrypting data protected by a master key. ● Or manually rotate the master key and re-encrypt the data. Automate key creation and rotation
  • 48. Encrypted Columns ● pgcrypto allows us to encrypt sensitive columns before the data lands in our S3 data lake. ● This example is encrypting the birth_date column in Postgres. Postgres pgcrypto
  • 49. “Last Mile” Decryption ● Access to encrypted columns is limited to analysts with the encryption key. ● This example is decrypting the birth_date column in Redshift. Decrypt sensitive data at query time
  • 50. Encrypted Column Problems Some things to consider... 1. Symmetric or Asymmetric encryption scheme? 2. Should we manually rotate our master key? 3. How many keys should we use and how should they be organized? 4. Should our analysts and data scientists need to think about keys? 5. When and how do we re-encrypt data? When an employee with access to keys leaves the company?
  • 51. 3 Big Ideas 1. Privacy has a strange history. 2. Privacy-first systems are designed by people with a professional ethic. 3. Privacy can be automated away. Automating security in your data pipeline privacy