SlideShare a Scribd company logo
High-accuracy ML & AI
over sensitive data
Simeon Simeonov, Swoop
@simeons / sim at swoop dot com
High accuracy ML & AI over sensitive data
omni-channel marketing for your ideal population
supported by privacy-preserving ML/AI
e.g., we improve health outcomes by increasing the
diagnosis rate of rare diseases through doctor/patient education
Swoop & IPM.ai data for 300+M people
• Anonymized patient data
• Online activity
• Imprecise location data
• Demographics, psychographics, purchase behavior, …
Privacy by design: HIPAA-compliant prAIvacy™ platform.
Trusted by the largest pharma companies. GDPR compliant.
Privacy-preserving computation frontiers
• Stochastic
– Differential privacy
• Encryption-based
– Fully homomorphic encryption
• Protocol-based
– Secure multi-party computation (SMC)
When privacy-preserving algorithms are immature,
sanitize the data the algorithms are trained on
Privacy concerns stem from identifiability
• Direct (via personally-identifiable information)
• Indirect (via quasi-identifiers)
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
High accuracy ML & AI over sensitive data
Addressing identifiability in a single dataset
• Direct
– Generate secure pseudonymous identifiers
– Often uses clean room to process PII
• Indirect
– Sanitize quasi-identifiers to desired anonymity trade-offs
– Control data enhancement to maintain anonymity
anonymity == indistinguishability
Sanitizing quasi-identifiers
• Deterministic
– Generalize or suppress quasi-identifiers
– k-anonymity + derivatives
• any given record maps onto at least k-1 other records
• Stochastic
– Add noise to data
– (k, ℇ)-anonymity
• Domain-specific
Addressing identifiability across datasets
• Centralized approach
– Join all data + sanitize the whole
– Big increase in dimensionality
• Federated approach
– Keep data separate + sanitize operations across data
– Smallest possible increase in dimensionality
We show that when the data contains a large number of attributes which may be
considered quasi-identifiers, it becomes difficult to anonymize the data without an
unacceptably high amount of information loss. ... we are faced with ... either
completely suppressing most of the data or losing the desired level of anonymity.
On k-Anonymity and the Curse of Dimensionality
2005 Aggarwal, C. @ IBM T. J. Watson Research Center
Centralized sanitization hurts ML/AI accuracy
We find that for privacy budgets effective at preventing attacks,
patients would be exposed to increased risk of stroke,
bleeding events, and mortality.
Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing
2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation
Centralized sanitization increases risk
Normalized Certainty Penalty (NCP)
0%
5%
10%
15%
20%
25%
30%
35%
40%
2 3 4 5 6 7 8 9 10
k age gender & age
k-anonymizing Titanic passenger survivability
Federated sanitization: Swoop’s prAIvacy™
• Secure, isolated data pools
• Automated sanitization
• Min dimensionality growth
• Deterministic + stochastic
• Optimal + often lossless
Model condition X
score on other data
Putting it all to practice (using Spark)
• Pre-process data
• Generate secure pseudonymous identifiers
• Sanitize quasi-identifiers
dirty quasi-identifiers increase distinguishability:
clean data before sanitization
to prevent increased sanitization loss
no anonymization framework for unstructured data:
suppress or structure
Word embedding for text anonymization
• Text ➞ high-dimensionality vector
– Capture semantics
“Texas” + “Milwaukee” – “Wisconsin” ≃ “Dallas”
– ML/AI-friendly representation
– word2vec, doc2vec, GloVe, …
• Anonymizing embeddings
– Train secret embeddings model
– Add noise to vectors
Secure pseudonymous ID generation
Sim|Simeonov|M|1977-07-07|02140
8daed4fa67a07d7a5 … 6f574021
gPGIoVw … wnNpij1LveZRtKeWU=
Sim Simeonov; Male; July 7, 1977
One Swoop Way, Cambridge, MA 02140
// consistent serialization
// secure destructive hashing (SHA-xxx)
// master encryption (AES-xxx)
Vw50jZjh6BCWUzSVu … mfUFtyGZ3q // partner A encryption
6ykWEv7A2lisz8KUi … VT2ZddaOeML // partner B encryption
Sim Simeonov; M; 1977-07-07
One Swoop Way, Suite 305, Cambridge, MA 02140
...
Multiple IDs for dirty data
Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean
S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos
Sim|Simeonov|M|1977-07|02140 // also may reduce dob/geo accuracy
tune fuzzification to use cases & desired FP/FN rates
Build pseudonymous IDs with Spark
(and sanitize PII-based quasi-identifiers)
We need a few user-defined functions
• Strong secure hash function with very few collisions
– sha256(data) computes SHA-256
• Strong symmetric key encryption
– aes_encrypt(data, secret) in Hive but not ported to Spark
– aes__encrypt(data, secret) is a UDF to avoid name conflict
• Demo sugar to build secrets from pass phrases
– secret(pass_phrase)
Let’s create some PII
case class PII(firstName: String, lastName: String,
gender: String, dob: String, zip: String)
val sim = PII("Sim", "Simeonov", "M", "1977-07-07", "02140")
val ids = spark.createDataset(Seq(sim))
Consistent serialization
val p = lit("|") // just a pipe symbol to save us typing
lazy val idRules = Seq(
// Rule 1: Use all PII
concat(upper('firstName), p, upper('lastName), p, 'gender, p, 'dob, p, 'zip),
// Rule 2: Use only first initial of first name and soundex of last name
concat(upper('firstName.substr(1, 1)), p, soundex(upper('lastName)), p,
'gender, p, 'dob, p, 'zip)
)
Hash & encrypt
// The pseudonymous ID columns built from the rules
lazy val psids = {
val masterPassword = "Master Password" // master password to encrypt IDs with
// Serialize -> Hash -> Encrypt
idRules.zipWithIndex.map { case (serialization, idx) =>
aes__encrypt(sha256(serialization), secret(lit(masterPassword)))
.as(s"psid${idx + 1}")
}
}
PII-based quasi-identifiers
// Generalization of quasi-identifying columns
lazy val quasiIdCols: Seq[Column] = Seq(
'gender,
'dob.substr(1, 4).cast(IntegerType).as("yob"), // only year of birth
'zip.substr(1, 3).cast(IntegerType).as("zip3") // only first 3 digits of zip
)
Generate master IDs
// Master pseudonymous IDs
lazy val masterIds = ids.select(quasiIdCols ++ psids: _*)
Generate per partner IDs
val partnerPasswords = Map("A" -> "A Password", "B" -> "B Password")
val partnerIds = spark.createDataset(partnerPasswords.toSeq)
.toDF("partner_name", "pwd").withColumn("pwd", secret('pwd))
.crossJoin(masterIds)
.transform { df =>
psids.indices.foldLeft(df) { case (current, idx) =>
val colName = s"psid${idx + 1}"
current.withColumn(colName, base64(aes__encrypt(col(colName), 'pwd)))
}
}
.drop("pwd")
The end result
Sanitizing quasi-identifiers in Spark
• Optimal k-anonymity is an NP-hard problem
– Mondrian algorithm: greedy O(nlogn) approximation
• https://github.com/eubr-bigsea/k-anonymity-mondrian
• Active research
– Locale-sensitive hashing (LSH) improvements
– Risk-based approaches (e.g., LBS algorithm)
Interested in challenging data engineering, ML & AI on petabytes of data?
I’d love to hear from you. @simeons / sim at swoop dot com
https://databricks.com/session/great-models-with-great-privacy-optimizing-ml-ai-under-gdpr
https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production
https://swoop-inc.github.io/spark-records/
Privacy matters. Thank you for caring.

More Related Content

PPT
SecurityBasics.ppt
PDF
Zero-Knowledge Proofs in Light of Digital Identity
PDF
Great Models with Great Privacy: Optimizing ML and AI Under GDPR with Sim Sim...
PPTX
Securing data today and in the future - Oracle NYC
PPTX
Emerging application and data protection for cloud
PDF
Better Security Through Big Data Analytics
PDF
Threat Modeling 101
PPTX
UNCOVER DATA SECURITY BLIND SPOTS IN YOUR CLOUD, BIG DATA & DEVOPS ENVIRONMENT
SecurityBasics.ppt
Zero-Knowledge Proofs in Light of Digital Identity
Great Models with Great Privacy: Optimizing ML and AI Under GDPR with Sim Sim...
Securing data today and in the future - Oracle NYC
Emerging application and data protection for cloud
Better Security Through Big Data Analytics
Threat Modeling 101
UNCOVER DATA SECURITY BLIND SPOTS IN YOUR CLOUD, BIG DATA & DEVOPS ENVIRONMENT

Similar to High accuracy ML & AI over sensitive data (20)

PDF
Privacy-Preserving Data Analysis, Adria Gascon
PPT
Cryptography Basics
PPT
Cobit 2
PPT
Main Menu
PDF
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
PPT
BigData and Privacy webinar at Brighttalk
PDF
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
PPTX
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
DOCX
International Journal on Cryptography and Information Security (IJCIS)
PPTX
Digital Defense for Activists (and the rest of us)
PDF
Biometric Recognition for Authentication, BSides Austin, May 2017
PDF
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
PDF
International Journal on Cryptography and Information Security (IJCIS)
PPTX
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
PPTX
Data protection on premises, and in public and private clouds
PDF
International Journal on Cryptography and Information Security (IJCIS)
PDF
13th International Conference on Soft Computing (SCOM 2025)
PPT
Security Training 2008
PDF
Fraud and Cybersecurity: How are they Related?
PPTX
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
Privacy-Preserving Data Analysis, Adria Gascon
Cryptography Basics
Cobit 2
Main Menu
Great Models with Great Privacy: Optimizing ML and AI Over Sensitive Data
BigData and Privacy webinar at Brighttalk
CIS14: Authentication Family Tree (1.1.1 annotated) - Steve Wilson
Biometrics and Multi-Factor Authentication, The Unleashed Dragon
International Journal on Cryptography and Information Security (IJCIS)
Digital Defense for Activists (and the rest of us)
Biometric Recognition for Authentication, BSides Austin, May 2017
Learn Ethical Hacking in 10 Hours | Ethical Hacking Full Course | Edureka
International Journal on Cryptography and Information Security (IJCIS)
Stopping Breaches at the Perimeter: Strategies for Secure Access Control
Data protection on premises, and in public and private clouds
International Journal on Cryptography and Information Security (IJCIS)
13th International Conference on Soft Computing (SCOM 2025)
Security Training 2008
Fraud and Cybersecurity: How are they Related?
PACE-IT, Security+ 6.2: Cryptographic Methods (part 2)
Ad

More from Simeon Simeonov (11)

PDF
HyperLogLog Intuition Without Hard Math
PDF
Memory Issues in Ruby on Rails Applications
PPTX
Revolutionazing Search Advertising with ElasticSearch at Swoop
PPT
The Rough Guide to MongoDB
PPTX
Three Tips for Winning Startup Weekend
PPTX
Swoop: Solve Hard Problems & Fly Robots
PPTX
Build a Story Factory for Inbound Marketing in Five Easy Steps
PPTX
Strategies for Startup Success by Simeon Simeonov
PDF
Patterns of Successful Angel Investing by Simeon Simeonov
PPTX
Customer Development: The Second Decade by Bob Dorf
PPT
Beyond Bootstrapping
HyperLogLog Intuition Without Hard Math
Memory Issues in Ruby on Rails Applications
Revolutionazing Search Advertising with ElasticSearch at Swoop
The Rough Guide to MongoDB
Three Tips for Winning Startup Weekend
Swoop: Solve Hard Problems & Fly Robots
Build a Story Factory for Inbound Marketing in Five Easy Steps
Strategies for Startup Success by Simeon Simeonov
Patterns of Successful Angel Investing by Simeon Simeonov
Customer Development: The Second Decade by Bob Dorf
Beyond Bootstrapping
Ad

Recently uploaded (20)

PPTX
Introduction to Inferential Statistics.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Transcultural that can help you someday.
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPT
Predictive modeling basics in data cleaning process
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Microsoft 365 products and services descrption
PDF
annual-report-2024-2025 original latest.
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Introduction to Data Science and Data Analysis
Introduction to Inferential Statistics.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Transcultural that can help you someday.
[EN] Industrial Machine Downtime Prediction
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Business_Capability_Map_Collection__pptx
Predictive modeling basics in data cleaning process
ISS -ESG Data flows What is ESG and HowHow
retention in jsjsksksksnbsndjddjdnFPD.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Navigating the Thai Supplements Landscape.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft 365 products and services descrption
annual-report-2024-2025 original latest.
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Introduction to Data Science and Data Analysis

High accuracy ML & AI over sensitive data

  • 1. High-accuracy ML & AI over sensitive data Simeon Simeonov, Swoop @simeons / sim at swoop dot com
  • 3. omni-channel marketing for your ideal population supported by privacy-preserving ML/AI e.g., we improve health outcomes by increasing the diagnosis rate of rare diseases through doctor/patient education
  • 4. Swoop & IPM.ai data for 300+M people • Anonymized patient data • Online activity • Imprecise location data • Demographics, psychographics, purchase behavior, … Privacy by design: HIPAA-compliant prAIvacy™ platform. Trusted by the largest pharma companies. GDPR compliant.
  • 5. Privacy-preserving computation frontiers • Stochastic – Differential privacy • Encryption-based – Fully homomorphic encryption • Protocol-based – Secure multi-party computation (SMC)
  • 6. When privacy-preserving algorithms are immature, sanitize the data the algorithms are trained on
  • 7. Privacy concerns stem from identifiability • Direct (via personally-identifiable information) • Indirect (via quasi-identifiers) Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140
  • 9. Addressing identifiability in a single dataset • Direct – Generate secure pseudonymous identifiers – Often uses clean room to process PII • Indirect – Sanitize quasi-identifiers to desired anonymity trade-offs – Control data enhancement to maintain anonymity anonymity == indistinguishability
  • 10. Sanitizing quasi-identifiers • Deterministic – Generalize or suppress quasi-identifiers – k-anonymity + derivatives • any given record maps onto at least k-1 other records • Stochastic – Add noise to data – (k, ℇ)-anonymity • Domain-specific
  • 11. Addressing identifiability across datasets • Centralized approach – Join all data + sanitize the whole – Big increase in dimensionality • Federated approach – Keep data separate + sanitize operations across data – Smallest possible increase in dimensionality
  • 12. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. ... we are faced with ... either completely suppressing most of the data or losing the desired level of anonymity. On k-Anonymity and the Curse of Dimensionality 2005 Aggarwal, C. @ IBM T. J. Watson Research Center Centralized sanitization hurts ML/AI accuracy
  • 13. We find that for privacy budgets effective at preventing attacks, patients would be exposed to increased risk of stroke, bleeding events, and mortality. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing 2014 Fredrikson, M. et. al. @ UW Madison and Marshfield Clinic Research Foundation Centralized sanitization increases risk
  • 14. Normalized Certainty Penalty (NCP) 0% 5% 10% 15% 20% 25% 30% 35% 40% 2 3 4 5 6 7 8 9 10 k age gender & age k-anonymizing Titanic passenger survivability
  • 15. Federated sanitization: Swoop’s prAIvacy™ • Secure, isolated data pools • Automated sanitization • Min dimensionality growth • Deterministic + stochastic • Optimal + often lossless Model condition X score on other data
  • 16. Putting it all to practice (using Spark) • Pre-process data • Generate secure pseudonymous identifiers • Sanitize quasi-identifiers
  • 17. dirty quasi-identifiers increase distinguishability: clean data before sanitization to prevent increased sanitization loss
  • 18. no anonymization framework for unstructured data: suppress or structure
  • 19. Word embedding for text anonymization • Text ➞ high-dimensionality vector – Capture semantics “Texas” + “Milwaukee” – “Wisconsin” ≃ “Dallas” – ML/AI-friendly representation – word2vec, doc2vec, GloVe, … • Anonymizing embeddings – Train secret embeddings model – Add noise to vectors
  • 20. Secure pseudonymous ID generation Sim|Simeonov|M|1977-07-07|02140 8daed4fa67a07d7a5 … 6f574021 gPGIoVw … wnNpij1LveZRtKeWU= Sim Simeonov; Male; July 7, 1977 One Swoop Way, Cambridge, MA 02140 // consistent serialization // secure destructive hashing (SHA-xxx) // master encryption (AES-xxx) Vw50jZjh6BCWUzSVu … mfUFtyGZ3q // partner A encryption 6ykWEv7A2lisz8KUi … VT2ZddaOeML // partner B encryption Sim Simeonov; M; 1977-07-07 One Swoop Way, Suite 305, Cambridge, MA 02140 ...
  • 21. Multiple IDs for dirty data Sim|Simeonov|M|1977-07-07|02140 // full entry when data is clean S|S551|M|1977-07-07|02140 // fuzzify names to handle limited entry & typos Sim|Simeonov|M|1977-07|02140 // also may reduce dob/geo accuracy tune fuzzification to use cases & desired FP/FN rates
  • 22. Build pseudonymous IDs with Spark (and sanitize PII-based quasi-identifiers)
  • 23. We need a few user-defined functions • Strong secure hash function with very few collisions – sha256(data) computes SHA-256 • Strong symmetric key encryption – aes_encrypt(data, secret) in Hive but not ported to Spark – aes__encrypt(data, secret) is a UDF to avoid name conflict • Demo sugar to build secrets from pass phrases – secret(pass_phrase)
  • 24. Let’s create some PII case class PII(firstName: String, lastName: String, gender: String, dob: String, zip: String) val sim = PII("Sim", "Simeonov", "M", "1977-07-07", "02140") val ids = spark.createDataset(Seq(sim))
  • 25. Consistent serialization val p = lit("|") // just a pipe symbol to save us typing lazy val idRules = Seq( // Rule 1: Use all PII concat(upper('firstName), p, upper('lastName), p, 'gender, p, 'dob, p, 'zip), // Rule 2: Use only first initial of first name and soundex of last name concat(upper('firstName.substr(1, 1)), p, soundex(upper('lastName)), p, 'gender, p, 'dob, p, 'zip) )
  • 26. Hash & encrypt // The pseudonymous ID columns built from the rules lazy val psids = { val masterPassword = "Master Password" // master password to encrypt IDs with // Serialize -> Hash -> Encrypt idRules.zipWithIndex.map { case (serialization, idx) => aes__encrypt(sha256(serialization), secret(lit(masterPassword))) .as(s"psid${idx + 1}") } }
  • 27. PII-based quasi-identifiers // Generalization of quasi-identifying columns lazy val quasiIdCols: Seq[Column] = Seq( 'gender, 'dob.substr(1, 4).cast(IntegerType).as("yob"), // only year of birth 'zip.substr(1, 3).cast(IntegerType).as("zip3") // only first 3 digits of zip )
  • 28. Generate master IDs // Master pseudonymous IDs lazy val masterIds = ids.select(quasiIdCols ++ psids: _*)
  • 29. Generate per partner IDs val partnerPasswords = Map("A" -> "A Password", "B" -> "B Password") val partnerIds = spark.createDataset(partnerPasswords.toSeq) .toDF("partner_name", "pwd").withColumn("pwd", secret('pwd)) .crossJoin(masterIds) .transform { df => psids.indices.foldLeft(df) { case (current, idx) => val colName = s"psid${idx + 1}" current.withColumn(colName, base64(aes__encrypt(col(colName), 'pwd))) } } .drop("pwd")
  • 31. Sanitizing quasi-identifiers in Spark • Optimal k-anonymity is an NP-hard problem – Mondrian algorithm: greedy O(nlogn) approximation • https://github.com/eubr-bigsea/k-anonymity-mondrian • Active research – Locale-sensitive hashing (LSH) improvements – Risk-based approaches (e.g., LBS algorithm)
  • 32. Interested in challenging data engineering, ML & AI on petabytes of data? I’d love to hear from you. @simeons / sim at swoop dot com https://databricks.com/session/great-models-with-great-privacy-optimizing-ml-ai-under-gdpr https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production https://swoop-inc.github.io/spark-records/ Privacy matters. Thank you for caring.