SlideShare a Scribd company logo
Do Le Quoc, Franz Gregor, Jatinder Singh and Christof Fetzer
SGX-PySpark: Secure Distributed Data Analytics
Motivation
• Data analytics has become an important component
of modern cloud-based data-driven services
• Large-scale datasets processed by the service may
contain customer's sensitive information
• Customers need to trust both service providers and
cloud providers
• How to protect sensitive data while preserving the same utility of
data analytics ?
Key idea
• Ensure confidentiality and integrity for both code and
data using trusted hardware, i.e., Intel Software Guard
Extensions (SGX)
• Execute only sensitive parts of data analytics inside
enclaves
• Encrypt input data; decrypt and securely process it
inside enclaves
Implementation
• PySpark: widely used in industry for big data analytics
• SCONE: enables unmodified applications run inside
Intel SGX enclaves
• Execute Spark Driver and Python processes of PySpark
inside enclaves using SCONE
SGX-PySpark
• Objectives:
• Support complex operations for big data analytics
• Provide strong security guarantees
• Minimize performance overhead
• Support Python
• Architecture:
Evaluation
• Dataset: TPC-H Benchmark
• ~22 % overhead compared to native execution
Demo
• GitHub repository: https://github.com/doflink/sgx-pyspark-demo
• Demo video: https://youtu.be/yI3iEFWUWbU
0
20
40
60
80
100
Q1 Q3 Q4 Q5 Q6 Q7 Q10 Q12 Q13 Q14 Q16 Q18 Q19
Latency[seconds]
TPC-H Queries
SGX-PySpark
Native PySpark

More Related Content

PDF
Achieving cyber mission assurance with near real-time impact
PDF
Security Events Logging at Bell with the Elastic Stack
PDF
Log Monitoring and Anomaly Detection at Scale at ORNL
PDF
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
PDF
Hopper energyservices
PPTX
Customer Presentation - QuikTrip
PPTX
Managing the Dewey Decimal System
PDF
American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...
Achieving cyber mission assurance with near real-time impact
Security Events Logging at Bell with the Elastic Stack
Log Monitoring and Anomaly Detection at Scale at ORNL
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
Hopper energyservices
Customer Presentation - QuikTrip
Managing the Dewey Decimal System
American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...

What's hot (20)

PDF
Elastic at KPN
PDF
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
PDF
Empower your security practitioners with the Elastic Stack
PDF
Logz.io Jenkins Meetup
PDF
Countering Threats with the Elastic Stack at CERDEC/ARL
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
PDF
O monitoramento da infraestrutura facilitado, da ingestão ao insight
PDF
Sqrrl Overview for Stac Research
PDF
Sqrrl February Webinar: Breaking Down Data Silos
PDF
Migrating a legacy logging system: Etsy’s journey to Elastic Cloud
PDF
October 2014 Webinar: Cybersecurity Threat Detection
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PPTX
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
PPTX
Modern Web-site Development Pipeline
PPTX
CI/CD for a Data Platform
PDF
End-to-End Security Analytics with the Elastic Stack
PDF
Reducing Mean Time to Know
PDF
Elastic @ John Deere
PDF
Automate threat detections and avoid false positives
PDF
Automatize a detecção de ameaças e evite falsos positivos
Elastic at KPN
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
Empower your security practitioners with the Elastic Stack
Logz.io Jenkins Meetup
Countering Threats with the Elastic Stack at CERDEC/ARL
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
O monitoramento da infraestrutura facilitado, da ingestão ao insight
Sqrrl Overview for Stac Research
Sqrrl February Webinar: Breaking Down Data Silos
Migrating a legacy logging system: Etsy’s journey to Elastic Cloud
October 2014 Webinar: Cybersecurity Threat Detection
Combining Logs, Metrics, and Traces for Unified Observability
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Modern Web-site Development Pipeline
CI/CD for a Data Platform
End-to-End Security Analytics with the Elastic Stack
Reducing Mean Time to Know
Elastic @ John Deere
Automate threat detections and avoid false positives
Automatize a detecção de ameaças e evite falsos positivos
Ad

Similar to WWW19: SGX-PySpark: Secure Distributed Data Analytics (20)

PDF
secureTF: A Secure TensorFlow Framework
PDF
Trusted Hardware Database With Privacy And Data Confidentiality
PDF
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
PDF
Accelerating Cyber Threat Detection With GPU
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
SnakeGX (short version)
PPTX
Spark: Building an application from Start to Finish
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PPTX
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
PDF
Kogni - A Data Security Product. Discovers, Secures, & Monitors Sensitive Ent...
PDF
SGXMonitor Presentation - ACSAC 2022
PDF
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
PDF
Protecting Global Records Sharing with Identity Based Access Control List
PDF
Protecting Global Records Sharing with Identity Based Access Control List
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PDF
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
PPTX
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
DOCX
JPD1418 TrustedDB: A Trusted Hardware-Based Database with Privacy and Data C...
secureTF: A Secure TensorFlow Framework
Trusted Hardware Database With Privacy And Data Confidentiality
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Accelerating Cyber Threat Detection With GPU
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
SnakeGX (short version)
Spark: Building an application from Start to Finish
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Kogni - A Data Security Product. Discovers, Secures, & Monitors Sensitive Ent...
SGXMonitor Presentation - ACSAC 2022
Efficient Spark Analytics on Encrypted Data with Gidon Gershinsky
Protecting Global Records Sharing with Identity Based Access Control List
Protecting Global Records Sharing with Identity Based Access Control List
20160331 sa introduction to big data pipelining berlin meetup 0.3
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Data Privacy with Apache Spark: Defensive and Offensive Approaches
JPD1418 TrustedDB: A Trusted Hardware-Based Database with Privacy and Data C...
Ad

More from LEGATO project (20)

PDF
Scrooge Attack: Undervolting ARM Processors for Profit
PDF
A practical approach for updating an integrity-enforced operating system
PDF
TEEMon: A continuous performance monitoring framework for TEEs
PDF
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PDF
LEGaTO: Machine Learning Use Case
PPTX
Smart Home AI at the edge
PPTX
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
PPTX
LEGaTO Integration
PPTX
LEGaTO: Use cases
PPTX
LEGaTO: Software Stack Programming Models
PPTX
LEGaTO: Software Stack Runtimes
PPTX
LEGaTO Heterogeneous Hardware
PPTX
LEGaTO: Low-Energy Heterogeneous Computing Workshop
PDF
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
PDF
Infection Research with Maxeler Dataflow Computing
PDF
Smart Home - AI at the edge
PDF
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
PDF
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
PDF
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
PDF
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
Scrooge Attack: Undervolting ARM Processors for Profit
A practical approach for updating an integrity-enforced operating system
TEEMon: A continuous performance monitoring framework for TEEs
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
LEGaTO: Machine Learning Use Case
Smart Home AI at the edge
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO Integration
LEGaTO: Use cases
LEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Runtimes
LEGaTO Heterogeneous Hardware
LEGaTO: Low-Energy Heterogeneous Computing Workshop
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
Infection Research with Maxeler Dataflow Computing
Smart Home - AI at the edge
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing

Recently uploaded (20)

PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
Sciences of Europe No 170 (2025)
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
An interstellar mission to test astrophysical black holes
PPT
Chemical bonding and molecular structure
PPTX
famous lake in india and its disturibution and importance
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
2. Earth - The Living Planet Module 2ELS
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Phytochemical Investigation of Miliusa longipes.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
AlphaEarth Foundations and the Satellite Embedding dataset
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Sciences of Europe No 170 (2025)
microscope-Lecturecjchchchchcuvuvhc.pptx
neck nodes and dissection types and lymph nodes levels
INTRODUCTION TO EVS | Concept of sustainability
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
bbec55_b34400a7914c42429908233dbd381773.pdf
The KM-GBF monitoring framework – status & key messages.pptx
An interstellar mission to test astrophysical black holes
Chemical bonding and molecular structure
famous lake in india and its disturibution and importance
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
ECG_Course_Presentation د.محمد صقران ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
2. Earth - The Living Planet Module 2ELS
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Introduction to Fisheries Biotechnology_Lesson 1.pptx

WWW19: SGX-PySpark: Secure Distributed Data Analytics

  • 1. Do Le Quoc, Franz Gregor, Jatinder Singh and Christof Fetzer SGX-PySpark: Secure Distributed Data Analytics Motivation • Data analytics has become an important component of modern cloud-based data-driven services • Large-scale datasets processed by the service may contain customer's sensitive information • Customers need to trust both service providers and cloud providers • How to protect sensitive data while preserving the same utility of data analytics ? Key idea • Ensure confidentiality and integrity for both code and data using trusted hardware, i.e., Intel Software Guard Extensions (SGX) • Execute only sensitive parts of data analytics inside enclaves • Encrypt input data; decrypt and securely process it inside enclaves Implementation • PySpark: widely used in industry for big data analytics • SCONE: enables unmodified applications run inside Intel SGX enclaves • Execute Spark Driver and Python processes of PySpark inside enclaves using SCONE SGX-PySpark • Objectives: • Support complex operations for big data analytics • Provide strong security guarantees • Minimize performance overhead • Support Python • Architecture: Evaluation • Dataset: TPC-H Benchmark • ~22 % overhead compared to native execution Demo • GitHub repository: https://github.com/doflink/sgx-pyspark-demo • Demo video: https://youtu.be/yI3iEFWUWbU 0 20 40 60 80 100 Q1 Q3 Q4 Q5 Q6 Q7 Q10 Q12 Q13 Q14 Q16 Q18 Q19 Latency[seconds] TPC-H Queries SGX-PySpark Native PySpark