SlideShare a Scribd company logo
Data Privacy Conference, Rootconf, 23-29 April 2021
Synthetic data
generation
Sandeep Joshi
[ needl.ai ]
https://www.linkedin.com/in/sanjoshi/
Data Privacy Conference, Rootconf, 23-29 April 2021
Agenda
1. Introduction to the problem
2. Capturing variation within a column
3. Capturing dependence between columns
4. Masking fields
5. Summary
2
Data Privacy Conference, Rootconf, 23-29 April 2021
Background on needl.ai
3
Data Privacy Conference, Rootconf, 23-29 April 2021
Privacy requirement
Engineers should not access customer’s data
But how do we test new features, especially ML-related ?
4
Aim : Generate Synthetic data from Production data
Data Privacy Conference, Rootconf, 23-29 April 2021
Real data --> Synthetic Data
5
Name Age Gender Respons
e
Tsunami
Singh
34 F Yes
Pappu
Pager
23 M Maybe
Khokha
Singh
53 F
Vasooli
Bhai
21 M
Jagat
Sahni
66 F No
Name Age Gender Respons
e
John
Smith
45 M Yes
Jack
Ryan
45 M
Jill
Reacher
34 F Maybe
Myles
Togo
23 F
Bill
Melater
18 M No
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture variation within a column
6
Name Age Gender Response
34
23
53
21
66
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture correlation between columns
7
Name Age Gender Response
34 F
23 M
53 F
21 M
66 F
Age and Gender may be correlated
Data Privacy Conference, Rootconf, 23-29 April 2021
Mask actual names
8
Name Age Gender Response
John May
April Smith
August Ryan
June Jackson
Money Spinner
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : Synthetic data vault
Package from MIT Data-to-AI group (https://sdv.dev/)
9
Data Privacy Conference, Rootconf, 23-29 April 2021
Single column
variation 10
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture variation within column
11
Name Age Gender Response
Tsunami Singh 34 F Yes
Pappu Pager 23 M Maybe
Khokha Singh 53 F
Vasooli Bhai 21 M
Jagat Sahni 66 F No
Data Privacy Conference, Rootconf, 23-29 April 2021
Statistics : density estimation problem
Age
34
23
53
21
Which one ??
12
Parametric and non-parametrics methods have been invented...
Data Privacy Conference, Rootconf, 23-29 April 2021
Density estimation
Using scipy
13
from scipy.stats import gaussian_kde, beta
data = [5, 12, 22, 400, 800, ... ]
beta.fit(data, floc=loc, fscale=scale)
gaussian_kde(data, bw_method=’silverman’)
Demo coming up...
Data Privacy Conference, Rootconf, 23-29 April 2021
Dependence
between fields 14
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture correlation between columns
15
Name Age Gender Response
Tsunami Singh 34 F Yes
Pappu Pager 23 M Maybe
Khokha Singh 53 F
Vasooli Bhai 21 M
Jagat Sahni 66 F No
Find how Age and Gender are correlated
Data Privacy Conference, Rootconf, 23-29 April 2021
Correlation
Both fields can change in the same/opposite direction or be unrelated
16
https://brianwhitworth.com/research-correlations/
Data Privacy Conference, Rootconf, 23-29 April 2021
Finding correlation with Pandas
columns_dict = {‘height’: [160, 156, 175, 180, 165, 143],
'age': [64, 55, 46, 23, 22, 19]}
df = pd.DataFrame.from_dict(columns_dict)
print(df.corr().values)
17
Use Pandas df.corr()
Data Privacy Conference, Rootconf, 23-29 April 2021
Correlation : graphical view
18
correlation age height
age 1 0.7
height 0.7 1
Data Privacy Conference, Rootconf, 23-29 April 2021
How to generate correlated data
Use Copulas (Gaussian Copula, CopulaGAN, etc)
19
https://en.wikipedia.org/wiki/Copula_(probability_theory)
Data Privacy Conference, Rootconf, 23-29 April 2021
Gaussian Copulas
Capture dependence between columns (Age vs Gender)
20
Name Age Gender Response
Transform individual columns (e.g. Age) while retaining their dependency
Data Privacy Conference, Rootconf, 23-29 April 2021
Demo
21
https://github.com/sanjosh/machine_learning/blob/main/copulas/synthetic%20data%20generation.ipynb
1. Capture variation within a column
2. Capture dependence between columns
3. Generate data for all columns
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking fields 22
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking emails
23
SDV uses Faker library https://faker.readthedocs.io/
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking names
24
SDV uses Faker library https://faker.readthedocs.io/
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV (synthetic
data vault) 25
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV model for a single table
26
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : custom constraints
27
https://github.com/sdv-dev/SDV/blob/master/tutorials/single_table_data/05_Handling_Constraints.ipynb
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : other features
1. Transformers for fields (convert categorical data to numbers)
2. Capture relations between tables
3. Time series
4. Metrics to compare synthetic with actual data
28
https://sdv.dev/SDV/index.html
Data Privacy Conference, Rootconf, 23-29 April 2021
Conclusion
SDV is a versatile tool
Good
1. Modular : can use parts of the framework (different git repos)
2. Usable with less data, unlike “deep learning”-based solutions (SDV does support GANs)
3. Its explainable (can debug or modify the output)
Issues
1. Difficult to add a custom transformer (no code samples)
2. Does not solve synthetic text generation problem (NLG)
3. Does not solve synthetic graph generation
29
Questions ?

More Related Content

PDF
Why you should care about synthetic data
PPTX
Apache Atlas: Governance for your Data
PDF
Intro to Time Series
PPTX
Introduction to Ethics of Big Data
PPTX
The rise of “Big Data” on cloud computing
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
Introduction to ML with Apache Spark MLlib
PPTX
Artificial Intelligence and the Data Center
Why you should care about synthetic data
Apache Atlas: Governance for your Data
Intro to Time Series
Introduction to Ethics of Big Data
The rise of “Big Data” on cloud computing
Feature Store as a Data Foundation for Machine Learning
Introduction to ML with Apache Spark MLlib
Artificial Intelligence and the Data Center

What's hot (20)

PDF
Synthetic data generation for machine learning
PDF
Synthetic Data Generation for Statistical Testing
PDF
Journey data driven organization
PDF
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
PDF
LLMs in Production: Tooling, Process, and Team Structure
PDF
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
PDF
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
PDF
Data Modeling with Neo4j
PPTX
Using Generative AI
PDF
Understanding GenAI/LLM and What is Google Offering - Felix Goh
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PDF
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
PPTX
Generative AI and ChatGPT - Scope of AI and advance Generative AI
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
PDF
Best Practices for Killer Data Visualization
PDF
8 Steps to Creating a Data Strategy
PPTX
Explainability for Natural Language Processing
PDF
generative-ai-fundamentals and Large language models
PPTX
Responsible AI in Industry (ICML 2021 Tutorial)
Synthetic data generation for machine learning
Synthetic Data Generation for Statistical Testing
Journey data driven organization
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
LLMs in Production: Tooling, Process, and Team Structure
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
Data Modeling with Neo4j
Using Generative AI
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Data Lakehouse, Data Mesh, and Data Fabric (r2)
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
Best Practices for Killer Data Visualization
8 Steps to Creating a Data Strategy
Explainability for Natural Language Processing
generative-ai-fundamentals and Large language models
Responsible AI in Industry (ICML 2021 Tutorial)
Ad

Similar to Synthetic data generation (20)

PPTX
Synthetic Data for Big Data Privacy
PPTX
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
PDF
Privacy Preserving by Anonymization Approach
PPTX
Sharing Confidential Data in ICPSR
PDF
Data Anonymization Process Challenges and Context Missions
PDF
Data Anonymization Process Challenges and Context Missions
PPTX
DSE-complete.pptx
PDF
How to create SDTM DM.xpt using Python v1.1
PDF
Everything You Always Wanted to Know About Synthetic Data
PPTX
Automation for test data anonymization
PPTX
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
PDF
Synthetic Data for AI - Conference @ European Commission
PPTX
intoduction of probabliity and statistics
PDF
Everything you always wanted to know about Synthetic Data
PPTX
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
PDF
A Comparative Study on Privacy Preserving Datamining Techniques
PDF
Inference Control In Statistical Databases From Theory To Practice 1st Editio...
PPTX
datamining-lect2 - What is data The data mining pipeline. Preprocessing and ...
PDF
Data Science at Intersection of Security and Privacy
Synthetic Data for Big Data Privacy
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
Privacy Preserving by Anonymization Approach
Sharing Confidential Data in ICPSR
Data Anonymization Process Challenges and Context Missions
Data Anonymization Process Challenges and Context Missions
DSE-complete.pptx
How to create SDTM DM.xpt using Python v1.1
Everything You Always Wanted to Know About Synthetic Data
Automation for test data anonymization
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
Synthetic Data for AI - Conference @ European Commission
intoduction of probabliity and statistics
Everything you always wanted to know about Synthetic Data
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
A Comparative Study on Privacy Preserving Datamining Techniques
Inference Control In Statistical Databases From Theory To Practice 1st Editio...
datamining-lect2 - What is data The data mining pipeline. Preprocessing and ...
Data Science at Intersection of Security and Privacy
Ad

More from Sandeep Joshi (11)

PDF
Block ciphers
PDF
How to build a feedback loop in software
PDF
Programming workshop
PDF
Hash function landscape
PDF
Android malware presentation
PDF
Doveryai, no proveryai - Introduction to tla+
PDF
Apache spark undocumented extensions
PDF
Lockless
PPTX
Rate limiters in big data systems
PDF
Virtualization overheads
PPTX
Data streaming algorithms
Block ciphers
How to build a feedback loop in software
Programming workshop
Hash function landscape
Android malware presentation
Doveryai, no proveryai - Introduction to tla+
Apache spark undocumented extensions
Lockless
Rate limiters in big data systems
Virtualization overheads
Data streaming algorithms

Recently uploaded (20)

PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to machine learning and Linear Models
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ISS -ESG Data flows What is ESG and HowHow
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to machine learning and Linear Models
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Acumen Training GuidePresentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
IB Computer Science - Internal Assessment.pptx

Synthetic data generation

  • 1. Data Privacy Conference, Rootconf, 23-29 April 2021 Synthetic data generation Sandeep Joshi [ needl.ai ] https://www.linkedin.com/in/sanjoshi/
  • 2. Data Privacy Conference, Rootconf, 23-29 April 2021 Agenda 1. Introduction to the problem 2. Capturing variation within a column 3. Capturing dependence between columns 4. Masking fields 5. Summary 2
  • 3. Data Privacy Conference, Rootconf, 23-29 April 2021 Background on needl.ai 3
  • 4. Data Privacy Conference, Rootconf, 23-29 April 2021 Privacy requirement Engineers should not access customer’s data But how do we test new features, especially ML-related ? 4 Aim : Generate Synthetic data from Production data
  • 5. Data Privacy Conference, Rootconf, 23-29 April 2021 Real data --> Synthetic Data 5 Name Age Gender Respons e Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No Name Age Gender Respons e John Smith 45 M Yes Jack Ryan 45 M Jill Reacher 34 F Maybe Myles Togo 23 F Bill Melater 18 M No
  • 6. Data Privacy Conference, Rootconf, 23-29 April 2021 Capture variation within a column 6 Name Age Gender Response 34 23 53 21 66
  • 7. Data Privacy Conference, Rootconf, 23-29 April 2021 Capture correlation between columns 7 Name Age Gender Response 34 F 23 M 53 F 21 M 66 F Age and Gender may be correlated
  • 8. Data Privacy Conference, Rootconf, 23-29 April 2021 Mask actual names 8 Name Age Gender Response John May April Smith August Ryan June Jackson Money Spinner
  • 9. Data Privacy Conference, Rootconf, 23-29 April 2021 SDV : Synthetic data vault Package from MIT Data-to-AI group (https://sdv.dev/) 9
  • 10. Data Privacy Conference, Rootconf, 23-29 April 2021 Single column variation 10
  • 11. Data Privacy Conference, Rootconf, 23-29 April 2021 Capture variation within column 11 Name Age Gender Response Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No
  • 12. Data Privacy Conference, Rootconf, 23-29 April 2021 Statistics : density estimation problem Age 34 23 53 21 Which one ?? 12 Parametric and non-parametrics methods have been invented...
  • 13. Data Privacy Conference, Rootconf, 23-29 April 2021 Density estimation Using scipy 13 from scipy.stats import gaussian_kde, beta data = [5, 12, 22, 400, 800, ... ] beta.fit(data, floc=loc, fscale=scale) gaussian_kde(data, bw_method=’silverman’) Demo coming up...
  • 14. Data Privacy Conference, Rootconf, 23-29 April 2021 Dependence between fields 14
  • 15. Data Privacy Conference, Rootconf, 23-29 April 2021 Capture correlation between columns 15 Name Age Gender Response Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No Find how Age and Gender are correlated
  • 16. Data Privacy Conference, Rootconf, 23-29 April 2021 Correlation Both fields can change in the same/opposite direction or be unrelated 16 https://brianwhitworth.com/research-correlations/
  • 17. Data Privacy Conference, Rootconf, 23-29 April 2021 Finding correlation with Pandas columns_dict = {‘height’: [160, 156, 175, 180, 165, 143], 'age': [64, 55, 46, 23, 22, 19]} df = pd.DataFrame.from_dict(columns_dict) print(df.corr().values) 17 Use Pandas df.corr()
  • 18. Data Privacy Conference, Rootconf, 23-29 April 2021 Correlation : graphical view 18 correlation age height age 1 0.7 height 0.7 1
  • 19. Data Privacy Conference, Rootconf, 23-29 April 2021 How to generate correlated data Use Copulas (Gaussian Copula, CopulaGAN, etc) 19 https://en.wikipedia.org/wiki/Copula_(probability_theory)
  • 20. Data Privacy Conference, Rootconf, 23-29 April 2021 Gaussian Copulas Capture dependence between columns (Age vs Gender) 20 Name Age Gender Response Transform individual columns (e.g. Age) while retaining their dependency
  • 21. Data Privacy Conference, Rootconf, 23-29 April 2021 Demo 21 https://github.com/sanjosh/machine_learning/blob/main/copulas/synthetic%20data%20generation.ipynb 1. Capture variation within a column 2. Capture dependence between columns 3. Generate data for all columns
  • 22. Data Privacy Conference, Rootconf, 23-29 April 2021 Masking fields 22
  • 23. Data Privacy Conference, Rootconf, 23-29 April 2021 Masking emails 23 SDV uses Faker library https://faker.readthedocs.io/
  • 24. Data Privacy Conference, Rootconf, 23-29 April 2021 Masking names 24 SDV uses Faker library https://faker.readthedocs.io/
  • 25. Data Privacy Conference, Rootconf, 23-29 April 2021 SDV (synthetic data vault) 25
  • 26. Data Privacy Conference, Rootconf, 23-29 April 2021 SDV model for a single table 26
  • 27. Data Privacy Conference, Rootconf, 23-29 April 2021 SDV : custom constraints 27 https://github.com/sdv-dev/SDV/blob/master/tutorials/single_table_data/05_Handling_Constraints.ipynb
  • 28. Data Privacy Conference, Rootconf, 23-29 April 2021 SDV : other features 1. Transformers for fields (convert categorical data to numbers) 2. Capture relations between tables 3. Time series 4. Metrics to compare synthetic with actual data 28 https://sdv.dev/SDV/index.html
  • 29. Data Privacy Conference, Rootconf, 23-29 April 2021 Conclusion SDV is a versatile tool Good 1. Modular : can use parts of the framework (different git repos) 2. Usable with less data, unlike “deep learning”-based solutions (SDV does support GANs) 3. Its explainable (can debug or modify the output) Issues 1. Difficult to add a custom transformer (no code samples) 2. Does not solve synthetic text generation problem (NLG) 3. Does not solve synthetic graph generation 29 Questions ?