SlideShare a Scribd company logo
1 
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer 
École Polytechnique Fédérale de Lausanne, Switzerland 
Zoltán Miklós 
Université de Rennes 1, IRISA, France 
DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013
2 
Database schema matching is an active research field: 
Surveys: [1], [2] 
Applications: data transformation, data migration, data alignment, … 
Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, … 
Schema matching is the task of establishing correspondences that connect related 
attributes in two (independently developed) database schemas. 
SA SB 
BirthName BirthName 
BirthDate 
Address Address 
[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001 
[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011
3 
Automatic schema matchers will 
(sometimes) fail to identify the correct 
correspondences 
There is a need for post‐matching 
reconciliation through human input 
This effort is the « real cost » in the company 
Schemas do not appear alone, they are 
part of a matching network 
The network‐level consistency constraints 
are very important for business users
4 
Real‐world scenario: a repository of schemas in the same domain 
Schema matching network: connect schemas by pair‐wise matchings 
Network‐level consistency constraints 
Automatic tools produce incorrect correspondences  need validation by 
human
5
6
7 
DASFAA’2013, BDA’2013: On Leveraging 
Crowdsourcing Techniques for Schema 
Matching Networks 
ER’2013: Minimizing Human Effort in 
Reconciling Match Networks 
coopIS’2013: Collaborative Schema Matching 
Reconciliation 
ICDE’2014: Pay‐as‐you‐go Reconciliation in 
Schema Matching Networks
“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions 
from a large group of people, and especially from an online community, rather than from traditional 
employees or suppliers.” ‐ Wiki 
Our context: employ many workers (users) to validate same correspondences and 
combine their answers. 
Surveys: [1], [2] 
A wide range of applications (e.g. CrowdSearch) have been developed on top of 
more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk). 
8 
Our contribution: 
Define network‐level constraints in schema matching network 
Design questions for workers to validate correspondences 
Leverage network‐level constraints to reduce user efforts 
[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011 
[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011
9
10
11 
Three elements of questions: 
Asking object: correspondence 
Possible choices: simple YES/NO question 
Support Information: alternatives, constraint satisfactions, constraint 
violations
12 
User Question Answer 
U1 C Yes 
U2 C Yes 
U3 C No 
User Quality 
User Reliability 
U1 r1 
U2 r2 
U3 r3 
User Feedbacks 
Answer 
Aggregation 
Probabilistic Model (*) 
Pr(C) 
Compute <a,e> 
aggregation + error rate 
Corr Aggregation Error Rate 
C True 0.19 
r1 = Pr (C=true | U1=yes) 
= Pr (C=false | U1=no) 
(*) Majority Voting, Expectation Maximization, … 
See full paper for details
To achieve higher accuracy, we need more answers  Cost‐Accuracy Tradeoff 
13 
r = 0.6 
Goal 
Solution: Leverage constraints to reduce error rate
14 
Idea: correspondences support each other if they satisfy a constraint 
1‐1 constraint: ONE source attribute matches to only ONE target attribute 
S T 
b1 
a 
b2 
Pr(ab1=true) = 0.8 
Pr(ab2=false) = 0.6 
By independence, 
0.8 x 0.6 
ab1 ab2 Prob 
T T 0.32 not satisfy 
T F 0.48 satisfy 
F T 0.08 satisfy 
F F 0.12 satisfy 
Pr ܾܽଶ ൌ ݂݈ܽݏ݁ ߛଵିଵ ൌ 
0.48 ൅ 0.12 
0.48 ൅ 0.08 ൅ 0.12 
ൌ ૙. ૡૡ 
Without Constraint With Constraint 
Corr Aggregation Error Rate 
ab2 False 0.4 (*) 
Corr Aggregation Error Rate 
ab2 False 0.12 (**) 
> 
(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr ሺܾܽଵ ൌ ݂݈ܽݏ݁|ߛଵିଵሻ
0.512 ൅ 3 ൈ Δ ൈ 0.032 ൅ Δ ൈ 0.008 
ൎ ૙. ૢૠ૜ with ઢ ൌ ૙. ૛ 
15 
Circle constraint: sequence of correspondences create a closed circle 
Δ: probability of compensating errors along the circle (*) 
b Pr(ab=T) = 0.8 
Pr(ac=T) = 0.8 Pr(bc=T) = 0.8 
S3 
S2 
c 
ab bc ac Prob 
T T T 0.512 1.0 
T T F 0.128 0.0 
T F T 0.128 0.0 
T F F 0.032 
F T T 0.128 0.0 
F T F 0.032 
F F T 0.032 
F F F 0.008 
By independence, 
0.8 x 0.8 x 0.8 
Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘ ൌ 
0.512 ൅ Δ ൈ 0.032 
Without Constraint With Constraint 
S1 
a 
Corr Aggregation Error Rate 
ab True 0.2 (**) 
Corr Aggregation Error Rate 
ab True 0.027 (***) 
> 
(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘ 
* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.
16 
Settings: 
Real‐world schemas. Use ground truth to simulate users/workers. 
Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, 
continue to ask users. 
Metric: Cost = 
Observation: Cost (With Constraints) Cost (Without Constraints)
We model a crowdsourcing process for schema 
matching network 
address optimization goals: minimize monetary cost, 
maximize accuracy (minimize error rate). 
We design a variety of questions with different support 
information. 
We leverage consistency constraints  reduce error 
rate  reduce the monetary cost. 
17
18

More Related Content

PDF
Simplified Data Processing On Large Cluster
PDF
A Correlative Information-Theoretic Measure for Image Similarity
PPTX
PCA and SVD in brief
PPTX
Principal component analysis
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
PPTX
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
PPTX
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
PDF
A comparative study of three validities computation methods for multimodel ap...
Simplified Data Processing On Large Cluster
A Correlative Information-Theoretic Measure for Image Similarity
PCA and SVD in brief
Principal component analysis
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
A comparative study of three validities computation methods for multimodel ap...

What's hot (19)

PDF
Pca ankita dubey
PPTX
Principal Component Analysis For Novelty Detection
PDF
20 26 jan17 walter latex
PDF
SVD and the Netflix Dataset
PDF
Jan vitek distributedrandomforest_5-2-2013
PDF
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
PDF
Higgs Boson Challenge
PPTX
Some Engg. Applications of Matrices and Partial Derivatives
PDF
A mathematical model for integrating product of two functions
PPTX
Fuzzy c means manual work
PDF
Reduct generation for the incremental data using rough set theory
PDF
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
PDF
Image Registration (Digital Image Processing)
PPTX
Presentation_OCR
PDF
Graph Analyses with Python and NetworkX
PDF
Dimensionality Reduction
PDF
Click Model-Based Information Retrieval Metrics
PDF
The International Journal of Engineering and Science (The IJES)
Pca ankita dubey
Principal Component Analysis For Novelty Detection
20 26 jan17 walter latex
SVD and the Netflix Dataset
Jan vitek distributedrandomforest_5-2-2013
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
Higgs Boson Challenge
Some Engg. Applications of Matrices and Partial Derivatives
A mathematical model for integrating product of two functions
Fuzzy c means manual work
Reduct generation for the incremental data using rough set theory
A Novel Optimization of Cloud Instances with Inventory Theory Applied on Real...
Image Registration (Digital Image Processing)
Presentation_OCR
Graph Analyses with Python and NetworkX
Dimensionality Reduction
Click Model-Based Information Retrieval Metrics
The International Journal of Engineering and Science (The IJES)
Ad

Similar to On Leveraging Crowdsourcing Techniques for Schema Matching Networks (20)

PDF
Pay-as-you-go Reconciliation in Schema Matching Networks
PPT
Data matching.ppt
PDF
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
PDF
Conceptual framework for entity integration from multiple data sources - Draz...
PPT
Public profile
PDF
4.on demand quality of web services using ranking by multi criteria 31-35
PDF
11.0004www.iiste.org call for paper.on demand quality of web services using r...
PPTX
20090411
PDF
Solvers and Applications with CP
PDF
Record matching over query results
PPT
csps.ppt
PDF
Duplicate Detection of Records in Queries using Clustering
DOC
Record matching over multiple query result - Document
PPTX
CH6,7.pptx
PPT
A PPT on Constraint Satisfaction problems
PDF
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
PDF
Query evaluation over network of data aggregators
PPTX
Constraint _ satisfaction _ problem.pptx
PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
PDF
Hybrid approach for generating non overlapped substring using genetic algorithm
Pay-as-you-go Reconciliation in Schema Matching Networks
Data matching.ppt
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Conceptual framework for entity integration from multiple data sources - Draz...
Public profile
4.on demand quality of web services using ranking by multi criteria 31-35
11.0004www.iiste.org call for paper.on demand quality of web services using r...
20090411
Solvers and Applications with CP
Record matching over query results
csps.ppt
Duplicate Detection of Records in Queries using Clustering
Record matching over multiple query result - Document
CH6,7.pptx
A PPT on Constraint Satisfaction problems
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
Query evaluation over network of data aggregators
Constraint _ satisfaction _ problem.pptx
Bytewise Approximate Match: Theory, Algorithms and Applications
Hybrid approach for generating non overlapped substring using genetic algorithm
Ad

More from PlanetData Network of Excellence (20)

PDF
A Contextualized Knowledge Repository for Open Data about Trentino
PDF
Towards Enabling Probabilistic Databases for Participatory Sensing
PDF
Privacy-Preserving Schema Reuse
PPTX
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
PPT
On the need for a W3C community group on RDF Stream Processing
PDF
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
PDF
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
PDF
SciQL, Bridging the Gap between Science and Relational DBMS
PPT
CLODA: A Crowdsourced Linked Open Data Architecture
PDF
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
PPT
Data and Knowledge Evolution
PPS
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
PPS
Access Control for RDF graphs using Abstract Models
PDF
Arrays in Databases, the next frontier?
PPS
Abstract Access Control Model for Dynamic RDF Datasets
PPTX
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
PDF
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
PDF
Heuristic based Query Optimisation for SPARQL
PDF
Adaptive Semantic Data Management Techniques for Federations of Endpoints
A Contextualized Knowledge Repository for Open Data about Trentino
Towards Enabling Probabilistic Databases for Participatory Sensing
Privacy-Preserving Schema Reuse
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
On the need for a W3C community group on RDF Stream Processing
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
SciQL, Bridging the Gap between Science and Relational DBMS
CLODA: A Crowdsourced Linked Open Data Architecture
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Data and Knowledge Evolution
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Access Control for RDF graphs using Abstract Models
Arrays in Databases, the next frontier?
Abstract Access Control Model for Dynamic RDF Datasets
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Heuristic based Query Optimisation for SPARQL
Adaptive Semantic Data Management Techniques for Federations of Endpoints

Recently uploaded (20)

PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Exploring VPS Hosting Trends for SMBs in 2025
PPTX
Introduction to Information and Communication Technology
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
artificialintelligenceai1-copy-210604123353.pptx
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
newyork.pptxirantrafgshenepalchinachinane
PPTX
artificial intelligence overview of it and more
PPT
tcp ip networks nd ip layering assotred slides
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
Funds Management Learning Material for Beg
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Decoding a Decade: 10 Years of Applied CTI Discipline
Exploring VPS Hosting Trends for SMBs in 2025
Introduction to Information and Communication Technology
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Tenda Login Guide: Access Your Router in 5 Easy Steps
SASE Traffic Flow - ZTNA Connector-1.pdf
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
artificialintelligenceai1-copy-210604123353.pptx
Module 1 - Cyber Law and Ethics 101.pptx
WebRTC in SignalWire - troubleshooting media negotiation
newyork.pptxirantrafgshenepalchinachinane
artificial intelligence overview of it and more
tcp ip networks nd ip layering assotred slides
Paper PDF World Game (s) Great Redesign.pdf
Funds Management Learning Material for Beg
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Slides PPTX World Game (s) Eco Economic Epochs.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

  • 1. 1 Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer École Polytechnique Fédérale de Lausanne, Switzerland Zoltán Miklós Université de Rennes 1, IRISA, France DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013
  • 2. 2 Database schema matching is an active research field: Surveys: [1], [2] Applications: data transformation, data migration, data alignment, … Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, … Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas. SA SB BirthName BirthName BirthDate Address Address [1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001 [2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011
  • 3. 3 Automatic schema matchers will (sometimes) fail to identify the correct correspondences There is a need for post‐matching reconciliation through human input This effort is the « real cost » in the company Schemas do not appear alone, they are part of a matching network The network‐level consistency constraints are very important for business users
  • 4. 4 Real‐world scenario: a repository of schemas in the same domain Schema matching network: connect schemas by pair‐wise matchings Network‐level consistency constraints Automatic tools produce incorrect correspondences  need validation by human
  • 5. 5
  • 6. 6
  • 7. 7 DASFAA’2013, BDA’2013: On Leveraging Crowdsourcing Techniques for Schema Matching Networks ER’2013: Minimizing Human Effort in Reconciling Match Networks coopIS’2013: Collaborative Schema Matching Reconciliation ICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks
  • 8. “Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐ Wiki Our context: employ many workers (users) to validate same correspondences and combine their answers. Surveys: [1], [2] A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk). 8 Our contribution: Define network‐level constraints in schema matching network Design questions for workers to validate correspondences Leverage network‐level constraints to reduce user efforts [1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011 [2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011
  • 9. 9
  • 10. 10
  • 11. 11 Three elements of questions: Asking object: correspondence Possible choices: simple YES/NO question Support Information: alternatives, constraint satisfactions, constraint violations
  • 12. 12 User Question Answer U1 C Yes U2 C Yes U3 C No User Quality User Reliability U1 r1 U2 r2 U3 r3 User Feedbacks Answer Aggregation Probabilistic Model (*) Pr(C) Compute <a,e> aggregation + error rate Corr Aggregation Error Rate C True 0.19 r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no) (*) Majority Voting, Expectation Maximization, … See full paper for details
  • 13. To achieve higher accuracy, we need more answers  Cost‐Accuracy Tradeoff 13 r = 0.6 Goal Solution: Leverage constraints to reduce error rate
  • 14. 14 Idea: correspondences support each other if they satisfy a constraint 1‐1 constraint: ONE source attribute matches to only ONE target attribute S T b1 a b2 Pr(ab1=true) = 0.8 Pr(ab2=false) = 0.6 By independence, 0.8 x 0.6 ab1 ab2 Prob T T 0.32 not satisfy T F 0.48 satisfy F T 0.08 satisfy F F 0.12 satisfy Pr ܾܽଶ ൌ ݂݈ܽݏ݁ ߛଵିଵ ൌ 0.48 ൅ 0.12 0.48 ൅ 0.08 ൅ 0.12 ൌ ૙. ૡૡ Without Constraint With Constraint Corr Aggregation Error Rate ab2 False 0.4 (*) Corr Aggregation Error Rate ab2 False 0.12 (**) > (*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr ሺܾܽଵ ൌ ݂݈ܽݏ݁|ߛଵିଵሻ
  • 15. 0.512 ൅ 3 ൈ Δ ൈ 0.032 ൅ Δ ൈ 0.008 ൎ ૙. ૢૠ૜ with ઢ ൌ ૙. ૛ 15 Circle constraint: sequence of correspondences create a closed circle Δ: probability of compensating errors along the circle (*) b Pr(ab=T) = 0.8 Pr(ac=T) = 0.8 Pr(bc=T) = 0.8 S3 S2 c ab bc ac Prob T T T 0.512 1.0 T T F 0.128 0.0 T F T 0.128 0.0 T F F 0.032 F T T 0.128 0.0 F T F 0.032 F F T 0.032 F F F 0.008 By independence, 0.8 x 0.8 x 0.8 Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘ ൌ 0.512 ൅ Δ ൈ 0.032 Without Constraint With Constraint S1 a Corr Aggregation Error Rate ab True 0.2 (**) Corr Aggregation Error Rate ab True 0.027 (***) > (**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr ܾܽ ൌ ܂ ߛ௖௜௥௖௟௘ * Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.
  • 16. 16 Settings: Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, continue to ask users. Metric: Cost = Observation: Cost (With Constraints) Cost (Without Constraints)
  • 17. We model a crowdsourcing process for schema matching network address optimization goals: minimize monetary cost, maximize accuracy (minimize error rate). We design a variety of questions with different support information. We leverage consistency constraints  reduce error rate  reduce the monetary cost. 17
  • 18. 18