SlideShare a Scribd company logo
Protecting the Data Lake
A bit about Ash Narkar !
@ashtalk
● Data Lake Overview
● Open Policy Agent
○ Community
○ Features
○ Use Cases
● Use case deep dive
○ Ceph Data Protection
Agenda
Data is King !
Data is King !
● Pervasive
● Abundant
● Customer Experience
● Revenue Growth
Data is King !
● Pervasive
● Abundant
● Customer Experience
● Revenue Growth
● Cyber Attacks
● Breaches
● Fines
● Loss of Customer Trust
What Is A Data Lake?
Data Lake Features
● Centralized Content
● Scalability
● Multiple data type support
● Resource optimization
Data Lake Platform
Sources Consumers
Data Lake Platform: Kafka
Features
● Distributed streaming platform
● Building real-time streaming data pipelines and applications
Security Challenges
● Authorization using Access Control Lists(ACLs)
● How to authorize requests based on context, like user, IP, common name in
certificate
Security Policies
● Consumers of topics containing PII must be whitelisted
● Producers to topics with high fanout must be whitelisted
Data Lake Platform: Ceph
Features
● Unified distributed storage system
● Delivers object, block, and file storage
Security Challenges
● Security protocol handles only Ceph clients and servers. NO human
users or applications 😔
Security Policies
● Users can access only those buckets belonging to the same
geographical region as them
● Access based on a user’s Business Unit, Department etc.
Data Lake Platform: Elasticsearch
Features
● Full-text search and analytics engine
● Store, search and analyze
Security Challenges
● Authorization is not considered as part of job
● User responsible for implementing access control
Security Policies
● Access control policies for a patient’s PHI
Security Challenge Overview
● Distinct systems
● Changing security requirements
❌ Hardcoding policy
❌ Tight coupling
✅ Expressiveness
✅ Speed and performance
✅ Unified Solution
Who can solve the Security Challenge ?
What Is OPA?
OPA: Community
Inception
Project started in 2016 at
Styra.
Goal
Unify policy enforcement
across the stack.
Use Cases
Admission control
Authorization
ACLs
RBAC
IAM
ABAC
Risk management
Data Protection
Data Filtering
Users
Netflix
Medallia
Chef
Cloudflare
State Street
Pinterest
Intuit
Capital One
...and many more.
Today
CNCF project (Incubating)
59 contributors
800+ slack members
2000+ stars
20+ integrations
Service
OPA
Policy
(Rego)
Data
(JSON)
Request
DecisionQuery
OPA: General-purpose policy engine
● Declarative Policy Language (Rego)
○ Can user X do operation Y on resource Z?
○ What invariants does workload W violate?
○ Which records should bob be allowed to see?
● Library, sidecar, host-level daemon
○ Policy and data are kept in-memory
○ Zero decision-time dependencies
● Management APIs for control & observability
○ Bundle service API for sending policy & data to OPA
○ Status service API for receiving status from OPA
○ Log service API for receiving audit log from OPA
● Tooling to build, test, and debug policy
○ opa run, opa test, opa fmt, opa deps, opa check, etc.
○ VS Code plugin, Tracing, Profiling, etc.
OPA: Features
Service
OPA
Policy
(Rego)
Data
(JSON)
Request
DecisionQuery
How does OPA work?
How does OPA work?
Salary Service V1
OPA
Policy
(Rego)
Data
(JSON)
Request
DecisionQuery
Example policy
"Employees can read their own salary
and the salary of anyone they
manage."
How does OPA work?
Example policy
Employees can read their own salary and the salary of
anyone they manage.
Input Data
method: "GET"
path: ["salary", "bob"]
user: "bob"
3 Steps to OPA
Step 1: Clone OPA Repo
3 Steps to OPA
Step 1: Clone OPA Repo
Step 2: Build OPA binary
3 Steps to OPA
Step 1: Clone OPA Repo
Step 2: Build OPA binary
Step 3: Execute OPA binary
Protecting the Data Lake
Use Cases
CLOUD
Host
DB
Host
sshd
App
Container
HTTP API
Microservice APIs
Orchestrator
Admission Control
Container Execution,
SSH, sudo
Linux
Risk
Management
Data Protection and
Data Filtering
OPA Use Case:
Ceph Data Protection
Ceph Architecture
CEPH STORAGE CLUSTER (RADOS)
LIBRADOS
RBD CEPH FSRADOSGW
Ceph Data Protection: Setup
OPA
Node Port
Service
Incoming
Request
HTTP
S3 Api
RADOSGW RADOS
user, method,
bucket name
Allow / Deny
Policy
(Rego)
Data
(JSON)
Example policy
"Users can access only those buckets belonging
to the same geographical region as them."
Demo: Ceph Data Protection
https://katacoda.com/styra
Data is King !
● Pervasive
● Abundant
● Customer Experience
● Revenue Growth
● Cyber Attacks
● Breaches
● Fines
● Loss of Customer Trust
Thank You!
github.com/open-policy-agent/opa
openpolicyagent.org
slack.openpolicyagent.org
Booth S20

More Related Content

PDF
Fine-grained Authorization in a Containerized World
PDF
Opa gatekeeper
PDF
DSO-LG 2021 Reboot: Policy As Code (Anders Eknert)
PDF
CNCF opa
PPTX
Cloud native policy enforcement with Open Policy Agent
PDF
Opentracing jaeger
ODP
Nagios Conference 2014 - Luke Groschen - Using Nagios Network Analyzer and NS...
PDF
Netflix Open Source Meetup Season 3 Episode 2
Fine-grained Authorization in a Containerized World
Opa gatekeeper
DSO-LG 2021 Reboot: Policy As Code (Anders Eknert)
CNCF opa
Cloud native policy enforcement with Open Policy Agent
Opentracing jaeger
Nagios Conference 2014 - Luke Groschen - Using Nagios Network Analyzer and NS...
Netflix Open Source Meetup Season 3 Episode 2

What's hot (20)

PDF
Tracing Micro Services with OpenTracing
PPTX
Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...
PDF
APIs: Intelligent Routing, Security, & Management
PDF
Elastic{ON} 2017 Recap
PDF
Next-Gen DDoS Detection
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
NGINX Plus R19 : EMEA
PDF
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
PDF
Digital Forensics and Incident Response in The Cloud
PDF
Open Tracing, to order and understand your mess. - ApiConf 2017
PPTX
Scale your application to new heights with NGINX and AWS
PDF
Five years of operating a large scale globally replicated Pulsar installation...
PPTX
What's new in NGINX Plus R19
PDF
Elastic Stack roadmap deep dive
PDF
An approach for migrating enterprise apps into open stack
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
Analyzing NGINX Logs with Datadog
PDF
6 Months Sailing with Docker in Production
PPTX
Vault Agent and Vault 0.11 features
PDF
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Tracing Micro Services with OpenTracing
Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...
APIs: Intelligent Routing, Security, & Management
Elastic{ON} 2017 Recap
Next-Gen DDoS Detection
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
NGINX Plus R19 : EMEA
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Digital Forensics and Incident Response in The Cloud
Open Tracing, to order and understand your mess. - ApiConf 2017
Scale your application to new heights with NGINX and AWS
Five years of operating a large scale globally replicated Pulsar installation...
What's new in NGINX Plus R19
Elastic Stack roadmap deep dive
An approach for migrating enterprise apps into open stack
Netflix Open Source Meetup Season 4 Episode 2
Analyzing NGINX Logs with Datadog
6 Months Sailing with Docker in Production
Vault Agent and Vault 0.11 features
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Ad

Similar to Protecting the Data Lake (20)

PDF
Open Policy Agent
PDF
Dynamic Authorization & Policy Control for Docker Environments
PDF
Dynamic Policy Enforcement for Microservice Environments
PPTX
OPA APIs and Use Case Survey
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PPTX
Lessons learned from embedding Cassandra in xPatterns
PDF
Hive on Spark, production experience @Uber
PDF
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
PDF
Data Platform in the Cloud
PDF
#RADC4L16: An API-First Archives Approach at NPR
PPTX
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
PDF
Extracting Insights from Data at Twitter
PDF
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
PDF
Database automation guide - Oracle Community Tour LATAM 2023
PPTX
Dynomite @ RedisConf 2017
PDF
Query and audit logging in cassandra
PDF
Designing for operability and managability
PDF
Scaling up uber's real time data analytics
PDF
Cloud-Scale BGP and NetFlow Analysis
PPTX
Denver Big Data Analytics Day
Open Policy Agent
Dynamic Authorization & Policy Control for Docker Environments
Dynamic Policy Enforcement for Microservice Environments
OPA APIs and Use Case Survey
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Lessons learned from embedding Cassandra in xPatterns
Hive on Spark, production experience @Uber
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Data Platform in the Cloud
#RADC4L16: An API-First Archives Approach at NPR
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
Extracting Insights from Data at Twitter
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Database automation guide - Oracle Community Tour LATAM 2023
Dynomite @ RedisConf 2017
Query and audit logging in cassandra
Designing for operability and managability
Scaling up uber's real time data analytics
Cloud-Scale BGP and NetFlow Analysis
Denver Big Data Analytics Day
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
NewMind AI Weekly Chronicles - August'25-Week II
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
A comparative analysis of optical character recognition models for extracting...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Assigned Numbers - 2025 - Bluetooth® Document

Protecting the Data Lake

  • 2. A bit about Ash Narkar ! @ashtalk
  • 3. ● Data Lake Overview ● Open Policy Agent ○ Community ○ Features ○ Use Cases ● Use case deep dive ○ Ceph Data Protection Agenda
  • 5. Data is King ! ● Pervasive ● Abundant ● Customer Experience ● Revenue Growth
  • 6. Data is King ! ● Pervasive ● Abundant ● Customer Experience ● Revenue Growth ● Cyber Attacks ● Breaches ● Fines ● Loss of Customer Trust
  • 7. What Is A Data Lake?
  • 8. Data Lake Features ● Centralized Content ● Scalability ● Multiple data type support ● Resource optimization
  • 10. Data Lake Platform: Kafka Features ● Distributed streaming platform ● Building real-time streaming data pipelines and applications Security Challenges ● Authorization using Access Control Lists(ACLs) ● How to authorize requests based on context, like user, IP, common name in certificate Security Policies ● Consumers of topics containing PII must be whitelisted ● Producers to topics with high fanout must be whitelisted
  • 11. Data Lake Platform: Ceph Features ● Unified distributed storage system ● Delivers object, block, and file storage Security Challenges ● Security protocol handles only Ceph clients and servers. NO human users or applications 😔 Security Policies ● Users can access only those buckets belonging to the same geographical region as them ● Access based on a user’s Business Unit, Department etc.
  • 12. Data Lake Platform: Elasticsearch Features ● Full-text search and analytics engine ● Store, search and analyze Security Challenges ● Authorization is not considered as part of job ● User responsible for implementing access control Security Policies ● Access control policies for a patient’s PHI
  • 13. Security Challenge Overview ● Distinct systems ● Changing security requirements ❌ Hardcoding policy ❌ Tight coupling ✅ Expressiveness ✅ Speed and performance ✅ Unified Solution
  • 14. Who can solve the Security Challenge ?
  • 16. OPA: Community Inception Project started in 2016 at Styra. Goal Unify policy enforcement across the stack. Use Cases Admission control Authorization ACLs RBAC IAM ABAC Risk management Data Protection Data Filtering Users Netflix Medallia Chef Cloudflare State Street Pinterest Intuit Capital One ...and many more. Today CNCF project (Incubating) 59 contributors 800+ slack members 2000+ stars 20+ integrations
  • 18. ● Declarative Policy Language (Rego) ○ Can user X do operation Y on resource Z? ○ What invariants does workload W violate? ○ Which records should bob be allowed to see? ● Library, sidecar, host-level daemon ○ Policy and data are kept in-memory ○ Zero decision-time dependencies ● Management APIs for control & observability ○ Bundle service API for sending policy & data to OPA ○ Status service API for receiving status from OPA ○ Log service API for receiving audit log from OPA ● Tooling to build, test, and debug policy ○ opa run, opa test, opa fmt, opa deps, opa check, etc. ○ VS Code plugin, Tracing, Profiling, etc. OPA: Features Service OPA Policy (Rego) Data (JSON) Request DecisionQuery
  • 19. How does OPA work?
  • 20. How does OPA work? Salary Service V1 OPA Policy (Rego) Data (JSON) Request DecisionQuery Example policy "Employees can read their own salary and the salary of anyone they manage."
  • 21. How does OPA work? Example policy Employees can read their own salary and the salary of anyone they manage. Input Data method: "GET" path: ["salary", "bob"] user: "bob"
  • 22. 3 Steps to OPA Step 1: Clone OPA Repo
  • 23. 3 Steps to OPA Step 1: Clone OPA Repo Step 2: Build OPA binary
  • 24. 3 Steps to OPA Step 1: Clone OPA Repo Step 2: Build OPA binary Step 3: Execute OPA binary
  • 26. Use Cases CLOUD Host DB Host sshd App Container HTTP API Microservice APIs Orchestrator Admission Control Container Execution, SSH, sudo Linux Risk Management Data Protection and Data Filtering
  • 27. OPA Use Case: Ceph Data Protection
  • 28. Ceph Architecture CEPH STORAGE CLUSTER (RADOS) LIBRADOS RBD CEPH FSRADOSGW
  • 29. Ceph Data Protection: Setup OPA Node Port Service Incoming Request HTTP S3 Api RADOSGW RADOS user, method, bucket name Allow / Deny Policy (Rego) Data (JSON)
  • 30. Example policy "Users can access only those buckets belonging to the same geographical region as them."
  • 31. Demo: Ceph Data Protection https://katacoda.com/styra
  • 32. Data is King ! ● Pervasive ● Abundant ● Customer Experience ● Revenue Growth ● Cyber Attacks ● Breaches ● Fines ● Loss of Customer Trust