SlideShare a Scribd company logo
Pay-as-you-go Reconciliation in Schema 
Matching Networks 
Nguyen Quoc Viet Hung1, Nguyen Thanh Tam 1, Zoltán Miklós2, Karl Aberer1, 
Avigdor Gal3, and Matthias Weidlich4 
1 École Polytechnique Fédérale de Lausanne 
2 Université de Rennes 1 
3 Technion – Israel Institute of Technology 
4 Imperial College London
ICDE | 2014 2 
Schema Matching - Where? 
Schema matching is the process of establishing correspondences between the 
attributes of schemas, for the purpose of data integration 
Large enterprises 
Cloud 
WWW 
Collaborative Systems 
P2P Networks
Private PhD Thesis Defense | 12.2013 3 
Schema Matching Network 
A network of schemas that are matched against each other 
Traditional approach: 
Mediated schema 
Our approach: 
Schema Matching Network 
S1 S2 S3 
S1 
S2 S3 
Require consensus on schema 
Updated Frequently
ICDE | 2014 4 
Pay-as-you-go Reconciliation 
 Reconciliation is the process of asking human user to give feedback on correspondences. 
 Need of reconciliation: automatic techniques use heuristics  results are inherently uncertain 
s1: EoverI 
s2: BBC 
s3: DVDizzy 
a4: productionDate 
a1: releaseDate 
a3: availabilityDate 
c4 
c2 
c1 
c3 
c5 
a2: screeningDate 
Attribute names are quite similar 
 automatic matching tools often fail to identify the 
correct correspondences. 
Instantiation 
Selective matching 
Uncertainty 
Reduction 
Pay‐as‐you‐go 
reconciliation 
Incrementally improve matching 
quality with minimal user effort 
Instantiate a single trusted 
set of correspondences
ICDE | 2014 5 
System Overview 
General approach: 
1. Develop a probabilistic matching network (pSMN)  can measure the overall 
uncertainty of the network 
2. Reduce network uncertainty: guide user feedback with minimal effort 
3. Instantiate a selective matching: maintain a good set of attribute correspondences 
to make the system available at any time
ICDE | 2014 6 
Outline 
 Probabilistic Schema Matching Network (pSMN): 
 Model 
 Computation 
 Uncertainty Reduction 
 Instantiation of the selective matching 
 Experimental results 
 Conclusion and future work
ICDE | 2014 7 
pSMN - Modeling 
 Schema matching network is modeled as a quadruple N ൌ ܵ, ܩ௦, Γ, ܥ, ܲ 
 ܵ – set of schemas ݏ 
 ܩ௦ ‐ interaction graph: represents the connections in the networks. 
 ܥ – set of attribute correspondences 
 Γ – set of integrity constraints 
 An integrity constraint is the formulation of natural properties 
 1‐1 constraint 
 Cycle constraint (transitivity) 
 Etc. 
 ܲ ൌ ሼpୡሽ – a set of probabilities. Each probability ݌௖ is associated with a 
correspondence ܿ ∈ ܥ.
ICDE | 2014 8 
pSMN - Computing 
 Probability of a correspondence 
 Semantics: indicate the correctness of these correspondences 
 Source: integrity constraints and user input. Idea: a correspondence that involves 
many violations has a high chance of being problematic. 
 Computation: 
 Step 1: construct all possible matching instances Ω ൌ ሼIଵ, … , I୬ሽ. Matching 
instance is a maximal set of correspondences satisfying all integrity constraints 
and user input. 
 Step 2: compute by the formula: 
݌௖ ൌ #௠௔௧௖௛௜௡௚ ௜௡௦௧௔௡௖௘௦ ௖௢௡௧௔௜௡ ௖ 
#௔௟௟ ௣௢௦௦௜௕௟௘ ௠௔௧௖௛௜௡௚ ௜௡௦௧௔௡௖௘௦ (i.e. ݌௖ ൌ ሼூ∈ஐ:௖∈ூሽ 
ஐ ) 
 Challenge: probability computation has a high complexity  We use non‐uniform 
sampling and a view‐maintenance technique to approximate the probability 
efficiently. 
 Network Uncertainty: quantify the uncertainty of pSMN based on entropy: 
ܪ ܥ ൌ െ෍݌௖ 
log ݌௖ ൅ ሺ1 െ ݌௖ሻ logሺ1 െ ݌௖ሻ 
௖∈஼
ICDE | 2014 9 
Outline 
 Probabilistic Schema Matching Network (pSMN): 
 Model 
 Computation 
 Uncertainty Reduction 
 Instantiation of the selective matching 
 Experimental results 
 Conclusion and future work
ICDE | 2014 10 
Reduce Network Uncertainty 
 Goal: guide user to give feedback with minimal user effort 
 Problem (UNCERTAINTY MINIMIZATION WITH LIMITED EFFORT BUDGET). Given a 
probabilistic matching network 〈ܵ, ܥ, ܩ, Γ, ܲ〉 and a budget of user effort ݇, find a set of 
correspondences ܥᇱ ⊆ ܥ with ܥᇱ ൑ ݇, such that ܪሺܥ, ܲሻ is minimal.
ICDE | 2014 11 
Approach – Use heuristic ordering 
 Idea: feed users the correspondences with highest information‐gain first. 
 Information gain: the uncertainty reduction before and after validation: 
ܫܩ ܿ ൌ ܪ ܥ െ ܪሺܥ|ܿሻ 
ܪ ܥ ܿ : expected network uncertainty when knowing the true value of c 
Two possible solutions: {c1,c2,c3} and 
{c1,c4,c5}. 
 Ask c1 first  the network is unchanged 
 no uncertainty reduction. 
 Ask c2 first  only 1 solution left  the 
network becomes certain. 
SA 
SB 
SC 
c3 
c4 
c5 
c1 c2 
SA 
SB 
SC 
c5 
c3 
c4 
c1 c2 
SA 
SB 
SC 
c3 
c1 c2
ICDE | 2014 12 
Instantiate a selective matching 
 Goal: Maintain a single trusted set of correspondences 
 Goodness measurement of a set of correspondences ܫ ⊆ ܥ: 
 Repair distance: information loss of eliminating some correspondences to 
guarantee integrity constraint 
Δ ܫ ൌ ܥ ∖ ܫ 
 Likelihood: represents the collective correctness of correspondences: 
ݑ ܫ ൌ ෑ݌௖ 
௖∈ூ 
 Instantiation problem: given a schema matching network, identify a set of 
correspondences ܫ ⊆ ܥ with minimal repair distance (w.r.t. ܥ) and maximal 
likelihood.
ICDE | 2014 13 
Approach 
 The instantiation problem is NP‐complete  use heuristic approach 
 Algorithm: 
 Step 1: Initialization ‐ Pickup a sampled matching instance with minimal repair 
distance 
 Step 2: Optimization – Randomized local search 
Repair Distance 
Likelihood 
minimal repair distance + maximal likelihood 
I0 
Iopt 
randomized local search 
matching instances: 
satisfy all constraints 
non‐sampled instance 
sampled instance 
sampled + minimal repair distance
ICDE | 2014 14 
Outline 
 Probabilistic Schema Matching Network (pSMN): 
 Model 
 Computation 
 Uncertainty Reduction 
 Instantiation of the selective matching 
 Experimental results 
 Conclusion and future work
ICDE | 2014 15 
Experiment – Dataset and Setting 
 Datasets: 
 Business Partner: schemas from enterprise systems 
 Purchase Order: purchase order e‐business schemas 
 University Application Form: schemas from Web interfaces of American university 
application forms 
 WebForm: schemas from Web forms of different domains 
 Thalia: schemas describing university courses 
 Metrics: 
 Precision: measures quality improvement at each user interaction step ݅, with G 
being the exact match. 
ܲ௜ ൌ ሺD୧ 
∩ ܩሻ/|D୧| 
 User effort: the percentage of feedback steps relative to the size of the matcher 
output. 
ܧ௜ ൌ ݅/|ܥ|
Efficiency of guiding strategy on uncertainty reduction 
 Goal: compare between guiding vs. non‐guiding strategy on uncertainty reduction 
 Evaluation procedure: 
ICDE | 2014 16 
 Increases user effort 
 Upon each user input, measure the network uncertainty and precision 
 Interesting finding: heuristic ordering strategy achieves savings of up to 48% user 
effort compared to random ordering.
ICDE | 2014 17 
Efficiency of guiding strategy on instantiation 
 Goal: compare between guiding vs. non‐guiding strategy on instantiation 
 Evaluation procedure: 
 Increases user effort 
 Measure the precision and recall of the instantiated matching 
 Interesting finding: heuristic ordering strategy outperforms the baseline with an 
average difference of 15% (precision) and 14% (recall).
ICDE | 2014 18 
Conclusions 
 We introduce the concept of schema matching networks and probabilistic matching 
networks 
 We define a model for pay‐as‐you‐go reconciliation on top of matching networks. 
 We propose a guiding technique to reduce network uncertainty and a heuristic 
approach to instantiate a selective matching. 
 Through experiments with real‐world schemas, our guiding strategy outperforms the 
baseline: 
 Saving user effort by up to 48% 
 Increasing precision (15%) and recall (14%)
ICDE | 2014 19 
Future Work 
 Generalizing pay‐as‐you‐go reconciliation for crowdsourced models: 
 Business process matching 
 Ontology alignment
ICDE | 2014 20 
THANK YOU 
Q&A

More Related Content

ODP
Eswc2009
PDF
Building Azure Machine Learning Models
PPTX
Numerical Integral using NNI
PDF
Show observe and tell giang nguyen
PDF
Linear Regression
PPTX
Introduction to Interpretable Machine Learning
PDF
Bellman Equation in Dynamic Programming
PDF
Network costing analysis
Eswc2009
Building Azure Machine Learning Models
Numerical Integral using NNI
Show observe and tell giang nguyen
Linear Regression
Introduction to Interpretable Machine Learning
Bellman Equation in Dynamic Programming
Network costing analysis

What's hot (19)

PDF
Human-centric Interpretability for Digital Pathology
PDF
Introduction to Model-Based Machine Learning
PDF
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
PPT
ProbabilisticModeling20080411
PDF
The Predictron: End-to-end Learning and Planning
PDF
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
PDF
copy for Gary Chin.
DOC
Evaluation of online learning
PDF
STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WE...
PDF
Machine learning and_neural_network_lecture_slide_ece_dku
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
PDF
Using Deep Learning to Find Similar Dresses
PDF
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
PDF
Kernel, RKHS, and Gaussian Processes
PDF
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
PPTX
Ajila (1)
PDF
Icml2018 naver review
PDF
Paper id 71201913
PDF
Machine Learning: Generative and Discriminative Models
Human-centric Interpretability for Digital Pathology
Introduction to Model-Based Machine Learning
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
ProbabilisticModeling20080411
The Predictron: End-to-end Learning and Planning
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
copy for Gary Chin.
Evaluation of online learning
STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WE...
Machine learning and_neural_network_lecture_slide_ece_dku
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Using Deep Learning to Find Similar Dresses
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
Kernel, RKHS, and Gaussian Processes
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
Ajila (1)
Icml2018 naver review
Paper id 71201913
Machine Learning: Generative and Discriminative Models
Ad

Similar to Pay-as-you-go Reconciliation in Schema Matching Networks (20)

PPT
A scalable collaborative filtering framework based on co clustering
PDF
Low rank models for recommender systems with limited preference information
PPT
2. visualization in data mining
PPTX
A scalable collaborative filtering framework based on co-clustering
 
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Matrix Factorization Technique for Recommender Systems
PDF
SVD and the Netflix Dataset
PDF
Study and development of methods and tools for testing, validation and verif...
PDF
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
PDF
rerngvit_phd_seminar
PDF
Towards Enabling Probabilistic Databases for Participatory Sensing
PPTX
PPT for ensembled techniques used for smoke detection
PDF
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
PPTX
ann1.pptx
PDF
A Review Study OF Movie Recommendation Using Machine Learning
PDF
Recuriter Recommendation System
PDF
Artificial Intelligence Certification
A scalable collaborative filtering framework based on co clustering
Low rank models for recommender systems with limited preference information
2. visualization in data mining
A scalable collaborative filtering framework based on co-clustering
 
International Journal of Computational Engineering Research(IJCER)
Matrix Factorization Technique for Recommender Systems
SVD and the Netflix Dataset
Study and development of methods and tools for testing, validation and verif...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
rerngvit_phd_seminar
Towards Enabling Probabilistic Databases for Participatory Sensing
PPT for ensembled techniques used for smoke detection
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
ann1.pptx
A Review Study OF Movie Recommendation Using Machine Learning
Recuriter Recommendation System
Artificial Intelligence Certification
Ad

More from PlanetData Network of Excellence (20)

PDF
A Contextualized Knowledge Repository for Open Data about Trentino
PDF
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
PDF
Privacy-Preserving Schema Reuse
PPTX
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
PPT
On the need for a W3C community group on RDF Stream Processing
PDF
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
PDF
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
PDF
SciQL, Bridging the Gap between Science and Relational DBMS
PPT
CLODA: A Crowdsourced Linked Open Data Architecture
PDF
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
PPT
Data and Knowledge Evolution
PPS
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
PPS
Access Control for RDF graphs using Abstract Models
PDF
Arrays in Databases, the next frontier?
PPS
Abstract Access Control Model for Dynamic RDF Datasets
PPTX
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
PDF
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
PDF
Heuristic based Query Optimisation for SPARQL
PDF
Adaptive Semantic Data Management Techniques for Federations of Endpoints
A Contextualized Knowledge Repository for Open Data about Trentino
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
Privacy-Preserving Schema Reuse
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
On the need for a W3C community group on RDF Stream Processing
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
SciQL, Bridging the Gap between Science and Relational DBMS
CLODA: A Crowdsourced Linked Open Data Architecture
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Data and Knowledge Evolution
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Access Control for RDF graphs using Abstract Models
Arrays in Databases, the next frontier?
Abstract Access Control Model for Dynamic RDF Datasets
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Heuristic based Query Optimisation for SPARQL
Adaptive Semantic Data Management Techniques for Federations of Endpoints

Recently uploaded (20)

PDF
Introduction to the IoT system, how the IoT system works
PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPT
tcp ip networks nd ip layering assotred slides
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPT
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Digital Literacy And Online Safety on internet
PPTX
Funds Management Learning Material for Beg
PPT
Ethics in Information System - Management Information System
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
innovation process that make everything different.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
Introduction to the IoT system, how the IoT system works
presentation_pfe-universite-molay-seltan.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Job_Card_System_Styled_lorem_ipsum_.pptx
tcp ip networks nd ip layering assotred slides
Module 1 - Cyber Law and Ethics 101.pptx
FIRE PREVENTION AND CONTROL PLAN- LUS.FM.MQ.OM.UTM.PLN.00014.ppt
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
introduction about ICD -10 & ICD-11 ppt.pptx
Digital Literacy And Online Safety on internet
Funds Management Learning Material for Beg
Ethics in Information System - Management Information System
SASE Traffic Flow - ZTNA Connector-1.pdf
Unit-1 introduction to cyber security discuss about how to secure a system
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Cloud-Scale Log Monitoring _ Datadog.pdf
international classification of diseases ICD-10 review PPT.pptx
innovation process that make everything different.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps

Pay-as-you-go Reconciliation in Schema Matching Networks

  • 1. Pay-as-you-go Reconciliation in Schema Matching Networks Nguyen Quoc Viet Hung1, Nguyen Thanh Tam 1, Zoltán Miklós2, Karl Aberer1, Avigdor Gal3, and Matthias Weidlich4 1 École Polytechnique Fédérale de Lausanne 2 Université de Rennes 1 3 Technion – Israel Institute of Technology 4 Imperial College London
  • 2. ICDE | 2014 2 Schema Matching - Where? Schema matching is the process of establishing correspondences between the attributes of schemas, for the purpose of data integration Large enterprises Cloud WWW Collaborative Systems P2P Networks
  • 3. Private PhD Thesis Defense | 12.2013 3 Schema Matching Network A network of schemas that are matched against each other Traditional approach: Mediated schema Our approach: Schema Matching Network S1 S2 S3 S1 S2 S3 Require consensus on schema Updated Frequently
  • 4. ICDE | 2014 4 Pay-as-you-go Reconciliation  Reconciliation is the process of asking human user to give feedback on correspondences.  Need of reconciliation: automatic techniques use heuristics  results are inherently uncertain s1: EoverI s2: BBC s3: DVDizzy a4: productionDate a1: releaseDate a3: availabilityDate c4 c2 c1 c3 c5 a2: screeningDate Attribute names are quite similar  automatic matching tools often fail to identify the correct correspondences. Instantiation Selective matching Uncertainty Reduction Pay‐as‐you‐go reconciliation Incrementally improve matching quality with minimal user effort Instantiate a single trusted set of correspondences
  • 5. ICDE | 2014 5 System Overview General approach: 1. Develop a probabilistic matching network (pSMN)  can measure the overall uncertainty of the network 2. Reduce network uncertainty: guide user feedback with minimal effort 3. Instantiate a selective matching: maintain a good set of attribute correspondences to make the system available at any time
  • 6. ICDE | 2014 6 Outline  Probabilistic Schema Matching Network (pSMN):  Model  Computation  Uncertainty Reduction  Instantiation of the selective matching  Experimental results  Conclusion and future work
  • 7. ICDE | 2014 7 pSMN - Modeling  Schema matching network is modeled as a quadruple N ൌ ܵ, ܩ௦, Γ, ܥ, ܲ  ܵ – set of schemas ݏ  ܩ௦ ‐ interaction graph: represents the connections in the networks.  ܥ – set of attribute correspondences  Γ – set of integrity constraints  An integrity constraint is the formulation of natural properties  1‐1 constraint  Cycle constraint (transitivity)  Etc.  ܲ ൌ ሼpୡሽ – a set of probabilities. Each probability ݌௖ is associated with a correspondence ܿ ∈ ܥ.
  • 8. ICDE | 2014 8 pSMN - Computing  Probability of a correspondence  Semantics: indicate the correctness of these correspondences  Source: integrity constraints and user input. Idea: a correspondence that involves many violations has a high chance of being problematic.  Computation:  Step 1: construct all possible matching instances Ω ൌ ሼIଵ, … , I୬ሽ. Matching instance is a maximal set of correspondences satisfying all integrity constraints and user input.  Step 2: compute by the formula: ݌௖ ൌ #௠௔௧௖௛௜௡௚ ௜௡௦௧௔௡௖௘௦ ௖௢௡௧௔௜௡ ௖ #௔௟௟ ௣௢௦௦௜௕௟௘ ௠௔௧௖௛௜௡௚ ௜௡௦௧௔௡௖௘௦ (i.e. ݌௖ ൌ ሼூ∈ஐ:௖∈ூሽ ஐ )  Challenge: probability computation has a high complexity  We use non‐uniform sampling and a view‐maintenance technique to approximate the probability efficiently.  Network Uncertainty: quantify the uncertainty of pSMN based on entropy: ܪ ܥ ൌ െ෍݌௖ log ݌௖ ൅ ሺ1 െ ݌௖ሻ logሺ1 െ ݌௖ሻ ௖∈஼
  • 9. ICDE | 2014 9 Outline  Probabilistic Schema Matching Network (pSMN):  Model  Computation  Uncertainty Reduction  Instantiation of the selective matching  Experimental results  Conclusion and future work
  • 10. ICDE | 2014 10 Reduce Network Uncertainty  Goal: guide user to give feedback with minimal user effort  Problem (UNCERTAINTY MINIMIZATION WITH LIMITED EFFORT BUDGET). Given a probabilistic matching network 〈ܵ, ܥ, ܩ, Γ, ܲ〉 and a budget of user effort ݇, find a set of correspondences ܥᇱ ⊆ ܥ with ܥᇱ ൑ ݇, such that ܪሺܥ, ܲሻ is minimal.
  • 11. ICDE | 2014 11 Approach – Use heuristic ordering  Idea: feed users the correspondences with highest information‐gain first.  Information gain: the uncertainty reduction before and after validation: ܫܩ ܿ ൌ ܪ ܥ െ ܪሺܥ|ܿሻ ܪ ܥ ܿ : expected network uncertainty when knowing the true value of c Two possible solutions: {c1,c2,c3} and {c1,c4,c5}.  Ask c1 first  the network is unchanged  no uncertainty reduction.  Ask c2 first  only 1 solution left  the network becomes certain. SA SB SC c3 c4 c5 c1 c2 SA SB SC c5 c3 c4 c1 c2 SA SB SC c3 c1 c2
  • 12. ICDE | 2014 12 Instantiate a selective matching  Goal: Maintain a single trusted set of correspondences  Goodness measurement of a set of correspondences ܫ ⊆ ܥ:  Repair distance: information loss of eliminating some correspondences to guarantee integrity constraint Δ ܫ ൌ ܥ ∖ ܫ  Likelihood: represents the collective correctness of correspondences: ݑ ܫ ൌ ෑ݌௖ ௖∈ூ  Instantiation problem: given a schema matching network, identify a set of correspondences ܫ ⊆ ܥ with minimal repair distance (w.r.t. ܥ) and maximal likelihood.
  • 13. ICDE | 2014 13 Approach  The instantiation problem is NP‐complete  use heuristic approach  Algorithm:  Step 1: Initialization ‐ Pickup a sampled matching instance with minimal repair distance  Step 2: Optimization – Randomized local search Repair Distance Likelihood minimal repair distance + maximal likelihood I0 Iopt randomized local search matching instances: satisfy all constraints non‐sampled instance sampled instance sampled + minimal repair distance
  • 14. ICDE | 2014 14 Outline  Probabilistic Schema Matching Network (pSMN):  Model  Computation  Uncertainty Reduction  Instantiation of the selective matching  Experimental results  Conclusion and future work
  • 15. ICDE | 2014 15 Experiment – Dataset and Setting  Datasets:  Business Partner: schemas from enterprise systems  Purchase Order: purchase order e‐business schemas  University Application Form: schemas from Web interfaces of American university application forms  WebForm: schemas from Web forms of different domains  Thalia: schemas describing university courses  Metrics:  Precision: measures quality improvement at each user interaction step ݅, with G being the exact match. ܲ௜ ൌ ሺD୧ ∩ ܩሻ/|D୧|  User effort: the percentage of feedback steps relative to the size of the matcher output. ܧ௜ ൌ ݅/|ܥ|
  • 16. Efficiency of guiding strategy on uncertainty reduction  Goal: compare between guiding vs. non‐guiding strategy on uncertainty reduction  Evaluation procedure: ICDE | 2014 16  Increases user effort  Upon each user input, measure the network uncertainty and precision  Interesting finding: heuristic ordering strategy achieves savings of up to 48% user effort compared to random ordering.
  • 17. ICDE | 2014 17 Efficiency of guiding strategy on instantiation  Goal: compare between guiding vs. non‐guiding strategy on instantiation  Evaluation procedure:  Increases user effort  Measure the precision and recall of the instantiated matching  Interesting finding: heuristic ordering strategy outperforms the baseline with an average difference of 15% (precision) and 14% (recall).
  • 18. ICDE | 2014 18 Conclusions  We introduce the concept of schema matching networks and probabilistic matching networks  We define a model for pay‐as‐you‐go reconciliation on top of matching networks.  We propose a guiding technique to reduce network uncertainty and a heuristic approach to instantiate a selective matching.  Through experiments with real‐world schemas, our guiding strategy outperforms the baseline:  Saving user effort by up to 48%  Increasing precision (15%) and recall (14%)
  • 19. ICDE | 2014 19 Future Work  Generalizing pay‐as‐you‐go reconciliation for crowdsourced models:  Business process matching  Ontology alignment
  • 20. ICDE | 2014 20 THANK YOU Q&A