SlideShare a Scribd company logo
Privacy-Preserving Schema Reuse 
Nguyen Quoc Viet Hung, Do Son Thanh, Nguyen Thanh Tam, and Karl Aberer 
EPFL, Switzerland
Schema Reuse 
Query 
Output 
Contribute 
Query 
Output 
Contribute 
schema.org 
factual.com 
Traditional approach: shows all 
original schemas 
Our approach: shows an 
anonymized (unified) schema 
DASFAA Security, privacy & trust DASFAA | 04.2014 2
Motivation 
• Schema Reuse offers many benefits: 
– Reduce development complexity: 
• New schemas require small modifications 
 copy and adapt existing schemas 
• Large repositories exist: schema.org, freebase.com, factual.com, niem.gov 
– Increase the interoperability: 
• Share common standard 
• But, privacy needs to be considered: 
– Leak schema information 
 Potential attack (e.g. SQL injection) 
– Maintain competitiveness: some parts of schemas are the source of 
revenue and business strategy. 
DASFAA Security, privacy & trust DASFAA | 04.2014 3
Challenges 
• How to define privacy constraints? 
• How to define an anonymized schema 
from multiple schemas? 
• How to define a utility function for a 
certain anonymized schema? 
• How to find an anonymized schema 
that satisfies privacy constraints and 
maximizes the utility function? 
Query 
Anonymized 
Schema 
Privacy constraints 
Contributors 
Our approach: shows an 
anonymized (unified) schema 
DASFAA Security, privacy & trust DASFAA | 04.2014 4
Challenge 1 – Define privacy constraints 
• Need to identify two elements 
– Sensitive information 
• Attributes 
– Privacy requirement 
• Prevent leaking provenance of sensitive attributes 
• Use presence constraint: 
A presence constraint ߛ is a triple ൏ ݏ, ܦ, ߠ ൐, where ݏ is a schema, ܦ is a 
set of attributes, and ߠ is a specified threshold. An anonymized schema ܵ෡ 
satisfies the presence constraint ߛ if ܲݎ ܦ ∈ ݏ ܵ෡ ሻ ൑ ߠ. 
DASFAA Security, privacy & trust DASFAA | 04.2014 5
Challenge 2 – Define anonymized schema 
• How to define “anonymized 
schema” given a set of schemas 
– Enough information to understand 
but not overwhelming 
• Anonymized schema contains a 
set of “abstract” attributes 
– Abstract attribute is a set similar 
attributes 
… 
Original schemas 
Name 
Num 
Name 
CC Holder 
CC 
{Name, Holder} 
{CC, Num} 
Anonymized schema 
Abstract attribute 
DASFAA Security, privacy & trust DASFAA | 04.2014 6
Challenge 3 – Define utility function 
• How to define utility function for a 
certain “anonymized schema” 
– Importance: sum of popularity of 
attributes 
• A schema that contains more popular 
attributes is better 
• An attribute that appears in more schemas is 
more popular 
– Completeness: number of abstract 
attributes 
• The more abstract attributes, the better 
Let Σ be the set of all possible 
anonymized schemas. The utility 
function ݑ: Σ → Թ measures a 
mount of information of each 
anonymized schema. 
? 
ൌ ݅݉݌݋ݎݐܽ݊ܿ݁ ܵመ 
൅ ݓ݄݁݅݃ݐ ∗ ܿ݋݉݌݈݁ݐ݁݊݁ݏݏሺܵመ 
ሻ 
{Holder} 
{CC} 
Utility function: 
ݑ ܵመ 
{Holder} {Name, Holder} 
{CC, Num} 
Importance Completeness 
S1 S2 S3 
DASFAA Security, privacy & trust DASFAA | 04.2014 7
Challenge 4 – Optimization problem (1) 
Maximizing Anonymized Schema 
Given a schema group ܵ and a set of privacy constraints ߁, construct 
an anonymized schema ܵ∗ such that ܵ∗ satisfies all constraints ߁ and 
has the utility value. 
• NP‐Hard problem 
… 
DASFAA Security, privacy & trust DASFAA | 04.2014 8
Challenge 4 – Optimization problem (2) 
• Problem modeling 
– Schema group: Affinity matrix 
– Anonymized schema: Affinity instance 
• Affinity instance is an affinity matrix with some empty cells 
ݏଵ 
a1 
a2 
Affinity matrix 
Anonymized schema 
DASFAA Security, privacy & trust DASFAA | 04.2014 9 
b1 
b2 
c1 
c2 
a1 b1 c1 
a2 b2 c2 
{a1, b1} 
{a2, b2,c2} 
a1 b1 
a2 b2 c2 
a1 b1 c1 
b2 
… 
= 
= 
Affinity instance 
{a1, b1,c1} 
ݏ { b2} ଶ 
ݏଷ 
 Need to find an affinity instance satisfying privacy constraints and having 
highest utility value
Challenge 4 – Optimization problem (4) 
• Overall solution: 
– Meta‐heuristic with 2 steps 
• Greedy algorithm: find a possible solution 
• Randomized local search: find optimal solution 
– Improve performance 
• Divide and conquer: partition the set of constraints into independent sets 
 satisfy each set independently 
DASFAA Security, privacy & trust DASFAA | 04.2014 10
Experiments - Setting 
Datasets: 
• Real data: 117 schemas 
• Synthetic data: vary the number of schemas and the number of attributes 
Evaluation Metrics: 
– Utility loss: measures the amount of utility reduction w.r.t the existence 
of privacy constraints 
• Δݑ ൌ ௨∅ି௨౳ 
௨∅ 
where u∅ is utility without constraints, ݑ୻ is utility with a 
set of constraints Γ 
– Privacy loss: measures the amount of disagreement between actual 
privacy ܲ ൌ ሼ௜ 
݌ሽ and expected privacy Θ ൌ ሼ௜ 
ߠሽ. 
• Δ݌ ൌ ܭܮ ܲ ∥ Θ ൌ Σ ݌௜ log ௣೔ 
ఏ೔ 
௜ 
DASFAA Security, privacy & trust DASFAA | 04.2014 11
Experiments – Computation Time 
• 100 schemas, 50 attributes, 1500 constraints 
 running time is about 6s 
Computation Time (log2 of msec.) 
DASFAA Security, privacy & trust DASFAA | 04.2014 12
Experiment – Privacy & Utility 
• Validate the trade‐off between privacy and utility 
• Evaluation procedure 
– Relax constraint: increase privacy threshold θ to 1 ൅ ݎ ߠ , ݎ is relaxing ratio 
• Observation 
– The higher privacy you enforce, the more the utility loss. 
Both utility loss and privacy loss 
are normalized to [0,1] 
Δݑ ൌ 
Δݑ െ ݉݅݊Δ௨ 
݉ܽݔΔ௨ െ ݉݅݊Δ௨ 
Δ݌ ൌ 
Δ݌ െ ݉݅݊Δ௣ 
݉ܽݔΔ௣ െ ݉݅݊Δ௣ 
DASFAA Security, privacy & trust DASFAA | 04.2014 13
Conclusion 
 Introduced schema reuse with privacy constraints 
 Defined privacy constraints 
 Defined an anonymized schema from multiple schemas 
 Defined a utility function for a certain anonymized schema 
 Constructed an anonymized schema that satisfies privacy 
constraints and maximizes the utility function 
DASFAA Security, privacy & trust DASFAA | 04.2014 14
Thank you! 
Questions

More Related Content

PDF
Exposing Real World Information for the Web of Things
PDF
Tractor Pulling on Data Warehouse
PDF
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
PDF
Arrays in database systems, the next frontier?
PDF
BotNetBenchmark - A Benchmark for Social Network
PDF
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
PDF
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams
Exposing Real World Information for the Web of Things
Tractor Pulling on Data Warehouse
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Arrays in database systems, the next frontier?
BotNetBenchmark - A Benchmark for Social Network
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
Efficiently Maintaining Distributed Model-Based Views on Real-Time Data Streams

Similar to Privacy-Preserving Schema Reuse (20)

PPTX
Social Security Company Nexgate's Success Relies on Apache Cassandra
PPTX
Cassandra
PDF
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
PPT
Data mining-primitives-languages-and-system-architectures2641
PPTX
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
PPTX
NoSQL - Cassandra & MongoDB.pptx
PPTX
Advanced Apex Security Expert Tips and Best Practices (1).pptx
PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PDF
Multi-Domain Alias Matching Using Machine Learning
PPT
Process.ppt
PPTX
AzureML – zero to hero
PPTX
2018 data warehouse features in spark
PPTX
ATLRUG Rails Security Presentation - 9/10/2014
PPTX
Net campus2015 antimomusone
PPTX
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PPTX
Master.pptx
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PPTX
Data Modelling for security and privacy PRAGUE.pptx
PDF
Sumo Logic QuickStart Webinar - Jan 2016
Social Security Company Nexgate's Success Relies on Apache Cassandra
Cassandra
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Data mining-primitives-languages-and-system-architectures2641
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
NoSQL - Cassandra & MongoDB.pptx
Advanced Apex Security Expert Tips and Best Practices (1).pptx
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Multi-Domain Alias Matching Using Machine Learning
Process.ppt
AzureML – zero to hero
2018 data warehouse features in spark
ATLRUG Rails Security Presentation - 9/10/2014
Net campus2015 antimomusone
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
Master.pptx
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Data Modelling for security and privacy PRAGUE.pptx
Sumo Logic QuickStart Webinar - Jan 2016
Ad

More from PlanetData Network of Excellence (20)

PDF
A Contextualized Knowledge Repository for Open Data about Trentino
PDF
Towards Enabling Probabilistic Databases for Participatory Sensing
PDF
Pay-as-you-go Reconciliation in Schema Matching Networks
PPTX
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
PPT
On the need for a W3C community group on RDF Stream Processing
PDF
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
PDF
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
PDF
SciQL, Bridging the Gap between Science and Relational DBMS
PPT
CLODA: A Crowdsourced Linked Open Data Architecture
PPT
Data and Knowledge Evolution
PPS
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
PPS
Access Control for RDF graphs using Abstract Models
PDF
Arrays in Databases, the next frontier?
PPS
Abstract Access Control Model for Dynamic RDF Datasets
PPTX
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
PDF
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
PDF
Heuristic based Query Optimisation for SPARQL
PDF
Adaptive Semantic Data Management Techniques for Federations of Endpoints
PDF
Building a Front End for a Sensor Data Cloud
PPTX
OntoGen Extension for Exploring Image Collections
A Contextualized Knowledge Repository for Open Data about Trentino
Towards Enabling Probabilistic Databases for Participatory Sensing
Pay-as-you-go Reconciliation in Schema Matching Networks
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
On the need for a W3C community group on RDF Stream Processing
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
SciQL, Bridging the Gap between Science and Relational DBMS
CLODA: A Crowdsourced Linked Open Data Architecture
Data and Knowledge Evolution
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Access Control for RDF graphs using Abstract Models
Arrays in Databases, the next frontier?
Abstract Access Control Model for Dynamic RDF Datasets
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Heuristic based Query Optimisation for SPARQL
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Building a Front End for a Sensor Data Cloud
OntoGen Extension for Exploring Image Collections
Ad

Privacy-Preserving Schema Reuse

  • 1. Privacy-Preserving Schema Reuse Nguyen Quoc Viet Hung, Do Son Thanh, Nguyen Thanh Tam, and Karl Aberer EPFL, Switzerland
  • 2. Schema Reuse Query Output Contribute Query Output Contribute schema.org factual.com Traditional approach: shows all original schemas Our approach: shows an anonymized (unified) schema DASFAA Security, privacy & trust DASFAA | 04.2014 2
  • 3. Motivation • Schema Reuse offers many benefits: – Reduce development complexity: • New schemas require small modifications  copy and adapt existing schemas • Large repositories exist: schema.org, freebase.com, factual.com, niem.gov – Increase the interoperability: • Share common standard • But, privacy needs to be considered: – Leak schema information  Potential attack (e.g. SQL injection) – Maintain competitiveness: some parts of schemas are the source of revenue and business strategy. DASFAA Security, privacy & trust DASFAA | 04.2014 3
  • 4. Challenges • How to define privacy constraints? • How to define an anonymized schema from multiple schemas? • How to define a utility function for a certain anonymized schema? • How to find an anonymized schema that satisfies privacy constraints and maximizes the utility function? Query Anonymized Schema Privacy constraints Contributors Our approach: shows an anonymized (unified) schema DASFAA Security, privacy & trust DASFAA | 04.2014 4
  • 5. Challenge 1 – Define privacy constraints • Need to identify two elements – Sensitive information • Attributes – Privacy requirement • Prevent leaking provenance of sensitive attributes • Use presence constraint: A presence constraint ߛ is a triple ൏ ݏ, ܦ, ߠ ൐, where ݏ is a schema, ܦ is a set of attributes, and ߠ is a specified threshold. An anonymized schema ܵ෡ satisfies the presence constraint ߛ if ܲݎ ܦ ∈ ݏ ܵ෡ ሻ ൑ ߠ. DASFAA Security, privacy & trust DASFAA | 04.2014 5
  • 6. Challenge 2 – Define anonymized schema • How to define “anonymized schema” given a set of schemas – Enough information to understand but not overwhelming • Anonymized schema contains a set of “abstract” attributes – Abstract attribute is a set similar attributes … Original schemas Name Num Name CC Holder CC {Name, Holder} {CC, Num} Anonymized schema Abstract attribute DASFAA Security, privacy & trust DASFAA | 04.2014 6
  • 7. Challenge 3 – Define utility function • How to define utility function for a certain “anonymized schema” – Importance: sum of popularity of attributes • A schema that contains more popular attributes is better • An attribute that appears in more schemas is more popular – Completeness: number of abstract attributes • The more abstract attributes, the better Let Σ be the set of all possible anonymized schemas. The utility function ݑ: Σ → Թ measures a mount of information of each anonymized schema. ? ൌ ݅݉݌݋ݎݐܽ݊ܿ݁ ܵመ ൅ ݓ݄݁݅݃ݐ ∗ ܿ݋݉݌݈݁ݐ݁݊݁ݏݏሺܵመ ሻ {Holder} {CC} Utility function: ݑ ܵመ {Holder} {Name, Holder} {CC, Num} Importance Completeness S1 S2 S3 DASFAA Security, privacy & trust DASFAA | 04.2014 7
  • 8. Challenge 4 – Optimization problem (1) Maximizing Anonymized Schema Given a schema group ܵ and a set of privacy constraints ߁, construct an anonymized schema ܵ∗ such that ܵ∗ satisfies all constraints ߁ and has the utility value. • NP‐Hard problem … DASFAA Security, privacy & trust DASFAA | 04.2014 8
  • 9. Challenge 4 – Optimization problem (2) • Problem modeling – Schema group: Affinity matrix – Anonymized schema: Affinity instance • Affinity instance is an affinity matrix with some empty cells ݏଵ a1 a2 Affinity matrix Anonymized schema DASFAA Security, privacy & trust DASFAA | 04.2014 9 b1 b2 c1 c2 a1 b1 c1 a2 b2 c2 {a1, b1} {a2, b2,c2} a1 b1 a2 b2 c2 a1 b1 c1 b2 … = = Affinity instance {a1, b1,c1} ݏ { b2} ଶ ݏଷ  Need to find an affinity instance satisfying privacy constraints and having highest utility value
  • 10. Challenge 4 – Optimization problem (4) • Overall solution: – Meta‐heuristic with 2 steps • Greedy algorithm: find a possible solution • Randomized local search: find optimal solution – Improve performance • Divide and conquer: partition the set of constraints into independent sets  satisfy each set independently DASFAA Security, privacy & trust DASFAA | 04.2014 10
  • 11. Experiments - Setting Datasets: • Real data: 117 schemas • Synthetic data: vary the number of schemas and the number of attributes Evaluation Metrics: – Utility loss: measures the amount of utility reduction w.r.t the existence of privacy constraints • Δݑ ൌ ௨∅ି௨౳ ௨∅ where u∅ is utility without constraints, ݑ୻ is utility with a set of constraints Γ – Privacy loss: measures the amount of disagreement between actual privacy ܲ ൌ ሼ௜ ݌ሽ and expected privacy Θ ൌ ሼ௜ ߠሽ. • Δ݌ ൌ ܭܮ ܲ ∥ Θ ൌ Σ ݌௜ log ௣೔ ఏ೔ ௜ DASFAA Security, privacy & trust DASFAA | 04.2014 11
  • 12. Experiments – Computation Time • 100 schemas, 50 attributes, 1500 constraints  running time is about 6s Computation Time (log2 of msec.) DASFAA Security, privacy & trust DASFAA | 04.2014 12
  • 13. Experiment – Privacy & Utility • Validate the trade‐off between privacy and utility • Evaluation procedure – Relax constraint: increase privacy threshold θ to 1 ൅ ݎ ߠ , ݎ is relaxing ratio • Observation – The higher privacy you enforce, the more the utility loss. Both utility loss and privacy loss are normalized to [0,1] Δݑ ൌ Δݑ െ ݉݅݊Δ௨ ݉ܽݔΔ௨ െ ݉݅݊Δ௨ Δ݌ ൌ Δ݌ െ ݉݅݊Δ௣ ݉ܽݔΔ௣ െ ݉݅݊Δ௣ DASFAA Security, privacy & trust DASFAA | 04.2014 13
  • 14. Conclusion  Introduced schema reuse with privacy constraints  Defined privacy constraints  Defined an anonymized schema from multiple schemas  Defined a utility function for a certain anonymized schema  Constructed an anonymized schema that satisfies privacy constraints and maximizes the utility function DASFAA Security, privacy & trust DASFAA | 04.2014 14