Privacy-Preserving Schema Reuse

Privacy-Preserving Schema Reuse
Nguyen Quoc Viet Hung, Do Son Thanh, Nguyen Thanh Tam, and Karl Aberer
EPFL, Switzerland

Schema Reuse
Query
Output
Contribute
Query
Output
Contribute
schema.org
factual.com
Traditional approach: shows all
original schemas
Our approach: shows an
anonymized (unified) schema
DASFAA Security, privacy & trust DASFAA | 04.2014 2

Motivation
• Schema Reuse offers many benefits:
– Reduce development complexity:
• New schemas require small modifications
 copy and adapt existing schemas
• Large repositories exist: schema.org, freebase.com, factual.com, niem.gov
– Increase the interoperability:
• Share common standard
• But, privacy needs to be considered:
– Leak schema information
 Potential attack (e.g. SQL injection)
– Maintain competitiveness: some parts of schemas are the source of
revenue and business strategy.

Challenges
• How to define privacy constraints?
• How to define an anonymized schema
from multiple schemas?
• How to define a utility function for a
certain anonymized schema?
• How to find an anonymized schema
that satisfies privacy constraints and
maximizes the utility function?
Query
Anonymized
Schema
Privacy constraints
Contributors
Our approach: shows an
anonymized (unified) schema

Challenge 1 – Define privacy constraints
• Need to identify two elements
– Sensitive information
• Attributes
– Privacy requirement
• Prevent leaking provenance of sensitive attributes
• Use presence constraint:
A presence constraint ߛ is a triple ൏ ݏ, ܦ, ߠ ൐, where ݏ is a schema, ܦ is a
set of attributes, and ߠ is a specified threshold. An anonymized schema ܵ෡
satisfies the presence constraint ߛ if ܲݎ ܦ ∈ ݏ ܵ෡ ሻ ൑ ߠ.

Challenge 2 – Define anonymized schema
• How to define “anonymized
schema” given a set of schemas
– Enough information to understand
but not overwhelming
• Anonymized schema contains a
set of “abstract” attributes
– Abstract attribute is a set similar
attributes
…
Original schemas
Name
Num
Name
CC Holder
CC
{Name, Holder}
{CC, Num}
Anonymized schema
Abstract attribute

Challenge 3 – Define utility function
• How to define utility function for a
certain “anonymized schema”
– Importance: sum of popularity of
attributes
• A schema that contains more popular
attributes is better
• An attribute that appears in more schemas is
more popular
– Completeness: number of abstract
attributes
• The more abstract attributes, the better
Let Σ be the set of all possible
anonymized schemas. The utility
function ݑ: Σ → Թ measures a
mount of information of each
anonymized schema.
?
ൌ ݅݉݌݋ݎݐܽ݊ܿ݁ ܵመ
൅ ݓ݄݁݅݃ݐ ∗ ܿ݋݉݌݈݁ݐ݁݊݁ݏݏሺܵመ
ሻ
{Holder}
{CC}
Utility function:
ݑ ܵመ
{Holder} {Name, Holder}
{CC, Num}
Importance Completeness
S1 S2 S3

Challenge 4 – Optimization problem (1)
Maximizing Anonymized Schema
Given a schema group ܵ and a set of privacy constraints ߁, construct
an anonymized schema ܵ∗ such that ܵ∗ satisfies all constraints ߁ and
has the utility value.
• NP‐Hard problem
…

• Problem modeling
– Schema group: Affinity matrix
– Anonymized schema: Affinity instance
• Affinity instance is an affinity matrix with some empty cells
ݏଵ
a1
a2
Affinity matrix
Anonymized schema
b1
b2
c1
c2
a1 b1 c1
a2 b2 c2
{a1, b1}
{a2, b2,c2}
a1 b1
a2 b2 c2
a1 b1 c1
b2
…
=
=
Affinity instance
{a1, b1,c1}
ݏ { b2} ଶ
ݏଷ
 Need to find an affinity instance satisfying privacy constraints and having
highest utility value

• Overall solution:
– Meta‐heuristic with 2 steps
• Greedy algorithm: find a possible solution
• Randomized local search: find optimal solution
– Improve performance
• Divide and conquer: partition the set of constraints into independent sets
 satisfy each set independently

Experiments - Setting
Datasets:
• Real data: 117 schemas
• Synthetic data: vary the number of schemas and the number of attributes
Evaluation Metrics:
– Utility loss: measures the amount of utility reduction w.r.t the existence
of privacy constraints
• Δݑ ൌ ௨∅ି௨౳
௨∅
where u∅ is utility without constraints, ݑ୻ is utility with a
set of constraints Γ
– Privacy loss: measures the amount of disagreement between actual
privacy ܲ ൌ ሼ௜
݌ሽ and expected privacy Θ ൌ ሼ௜
ߠሽ.
• Δ݌ ൌ ܭܮ ܲ ∥ Θ ൌ Σ ݌௜ log ௣೔
ఏ೔
௜

Experiments – Computation Time
• 100 schemas, 50 attributes, 1500 constraints
 running time is about 6s
Computation Time (log2 of msec.)

Experiment – Privacy & Utility
• Validate the trade‐off between privacy and utility
• Evaluation procedure
– Relax constraint: increase privacy threshold θ to 1 ൅ ݎ ߠ , ݎ is relaxing ratio
• Observation
– The higher privacy you enforce, the more the utility loss.
Both utility loss and privacy loss
are normalized to [0,1]
Δݑ ൌ
Δݑ െ ݉݅݊Δ௨
݉ܽݔΔ௨ െ ݉݅݊Δ௨
Δ݌ ൌ
Δ݌ െ ݉݅݊Δ௣
݉ܽݔΔ௣ െ ݉݅݊Δ௣

Conclusion
 Introduced schema reuse with privacy constraints
 Defined privacy constraints
 Defined an anonymized schema from multiple schemas
 Defined a utility function for a certain anonymized schema
 Constructed an anonymized schema that satisfies privacy
constraints and maximizes the utility function

Privacy-Preserving Schema Reuse

More Related Content

Similar to Privacy-Preserving Schema Reuse (20)

More from PlanetData Network of Excellence (20)

Privacy-Preserving Schema Reuse