Data Integration
Introduction to Data Integration
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras)
Agenda
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #1
 Integration of Information
 The Heterogeneity Problem
 Data Integration Architectures
 Federated Database Schemas
 Data Warehouses
 Mediators
 Extractors and Wrappers
 Capability Based Optimization
 Adornments and Query Plan Selection
 Optimizing Mediator Queries
 Consult all sources VS Best effort approach
 Local as View Mediators
 Entity Resolution
Integration of Information
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #2
 Definition: The process of taking several databases or other information sources and making the data in
these sources work together as if they were a single database.
Why Data Integration?
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #3
 Databases are often created independently, even if they later
need to be combined.
 At that point, multiple different systems might have been
used with different schemas and maybe limited interfaces to
the data.
 The use of a database involves, so we cannot design a database
to support every possible future use.
 Starting over is not always possible
 High development cost
 Incompatibility issues with legacy software
 Transition from old systems to new ones
 The goal of data integration: tie together different sources,
controlled by many people, under a common schema.
Why Data Integration?
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #4
 This can be solved through the use of an abstraction layer (
middleware ) on top of all the available data sources.
 The layer of abstraction could be a set of relational views.
 These views could be either virtual or materialized.
 A form of SQL could be used to access this abstraction layer.
Why Data Integration? ( Scenario 1 )
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #5
 A company may have several different systems with totally independent databases describing different
entities.
 Management may need to pose queries that require the combination of these data sources (DS) in order to
provide an answer.
DS1
DS2
DS3
DS4
Middleware
DS5
Why Data Integration? ( Scenario 2 )
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #6
 Company A acquired Company B
 Both companies have data sources that describe similar entities through different schemas
 There datasets need to be integrated
DS1.a
DS2.a
DS1.b
DS2.b
DS1.c
DS2.c
DS1.d
DS2.d
Middleware
DS1.e
DS2.e
The Heterogeneity Problem
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #7
 Data sources differ in many ways, even if they are intended
to store the same kinds of data.
 Such sources are called heterogeneous, and the problem of
integrating them is referred to as the heterogeneity
problem.
Heterogeneity Problem Types
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #8
 Communication Heterogeneity
 HTTP
 Direct through VPN
 Remote Connection
 Query Language Heterogeneity
 Different SQL Dialects
 JS (eg MongoDB)
 URI (eg HTTP APIs)
 Excel
 Schema Heterogeneity
 Sample schema 1:
 Sample schema 2:
Heterogeneity Problem Types
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #9
 Data type differences
 Serial numbers might be represented by character strings of varying
length at one source and fixed length at another.
 Value Heterogeneity
 Eg: Storing the color ”Black”
 Data source A: ”BLACK” { String }
 Data source B: ”BL” { String }
 Data source C: 12 { Integer }
 Semantic Heterogeneity
 Do trucks belong to ”Cars”?
 Do minivans == station wagons?
Data Integration Architectures
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #10
 There are several ways that databases or other distributed information
sources can be made to work together.
 Federation: everybody talks directly to everyone else.
 Warehouse: Sources are translated from their local schema to a global
schema and copied to a central DB.
 Mediator: Virtual warehouse --- turns a user query into a sequence of source
queries.
Federated Database Systems
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #11
 Perhaps the simplest architecture for integrating several
databases
 One source can call on others to supply information
 Each query works properly for the database to which it is
addressed
 If n databases each need to talk to the n — 1 other
databases, then we must write n(n — 1) pieces of code to
support queries between systems.
 Could be useful if the systems need to query the database
of a very limited number of other systems
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Data Warehouses
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #12
 Data from several sources is extracted and combined into a
global schema
 Queries may be issued by the user exactly as they would be
issued to any database
 Two approaches on updating the DW:
 The warehouse is periodically closed to queries and
reconstructed from the current data in the sources. (e.g.
once a night or at even longer intervals)
 The warehouse is updated periodically (e.g., each night),
based on the changes that have been made to the
sources since the last time the warehouse was modified.
( incremental update )
 It is generally too expensive to reflect immediately, at the
warehouse, every change to the underlying databases.
Warehouse
Extractor Extractor
Source 1 Source 2
Mediators
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #13
 A mediator supports a virtual view, or collection of
views, that integrates several sources.
 The mediator doesn’t store any data. The mechanics
of mediators and warehouses are rather different.
 To begin, the user or application program issues a
query to the mediator.
 Since the mediator has no data of its own, it must get
the relevant data from its sources and use that data to
form the answer to the user’s query.
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
Mediators
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #14
 The mediator sends a query to each of its wrappers,
which in turn send queries to their corresponding
sources.
 The mediator may send several queries to a wrapper,
and may not query all wrappers.
 The results come back and are combined at the
mediator.
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
Extractors and Wrappers
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #15
In a data warehouse system, the source extractors
consist of:
 One or more predefined queries that are executed at
the source to produce data for the warehouse.
 Suitable communication mechanisms, so the extractor
can:
 Pass ad-hoc queries to the source
 Receive responses from the source
 Pass information to the warehouse.
Warehouse
Extractor Extractor
Source 1 Source 2
Extractors and Wrappers
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #16
The predefined queries to the source could be:
 SQL queries if the source is a SQL database
 Operations in whatever language was appropriate for
a source that was not a database system
Warehouse
Extractor Extractor
Source 1 Source 2
Extractors and Wrappers
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #17
Example ETL tool chain:
 This is an example for e-commerce loading
 Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group
and load
Invoice
line items
Split
Date -
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Data
Warehouse
Customer
records
Extractors and Wrappers
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #18
 Mediator systems require more complex wrappers
than do most warehouse systems
 The wrapper must be able to accept a variety of
queries from the mediator and translate any of them to
the terms of the source
 The wrapper must then communicate the result to the
mediator
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
Templates for Query Patterns
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #19
 A systematic way to design a wrapper that connects a
mediator to a source is to classify the possible queries
that the mediator can ask into templates.
 Templates are queries with parameters that represent
constants. Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
Templates for Query Patterns
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #20
 Suppose we want to build a wrapper for the source of Car Dealer 1, which has the schema:
 For use by a mediator with schema:
Templates for Query Patterns
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #21
 Suppose we want to build a wrapper for the source of Dealer 1, which has the schema:
 For use by a mediator with schema:
 A possible solution:
Templates for Query Patterns
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #22
 In this case, there are eight choices, if queries are
allowed to specify any of three attributes: model, color,
and autoTrans.
 In general, there would be 2n templates if we have the
option of specifying n attributes
 The number of templates could grow unreasonably
large
Wrapper Generators
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #23
 The templates defining a wrapper must be turned into
code for the wrapper itself.
 The software that creates the wrapper is called a wrapper
generator
 The wrapper generator creates a table that holds
 the various query patterns contained in the templates
 the source queries that are associated with each
Wrapper
Generator
Driver
Table
Source
Templates
Queries from
mediator Results
ResultsQueries
Wrapper Generators
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #24
 A driver is used in each wrapper; in general the driver can
be the same for each generated wrapper.
 The task of the driver is to:
 Accept a query from the mediator
 Search the table for a template that matches the
query
 Send the query to the source
 Process the response and return it to the mediator
Wrapper
Generator
Driver
Table
Source
Templates
Queries from
mediator Results
ResultsQueries
Filters
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #25
 It is not always realistic to write a template for every possible form of query
 Suppose that a wrapper on a car dealer’s database has the template displayed below for finding cars by color
 The mediator is asked to find cars of a particular model and color
 As long as the wrapper has a template that (after proper substitution for the parameters) returns a superset of
what the query wants, then it is possible to filter the dataset at the wrapper and pass only the desired tuples to
the mediator.
 In practice, the tuples could be produced one-at-a-time and filtered one-at-a-time, in a pipelined fashion, rather
than having the entire stored at the wrapper and then filtered
Capability – Based Optimization
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #26
 A typical DBMS estimates the cost of each query plan and
picks what it believes to be the best.
 Optimization by a mediator usually follows a strategy
known as capability-based optimization.
 The central issue is not what a query plan costs, but
whether the plan can be executed at all.
 Only among plans found to be executable (“feasible”) do
we try to estimate costs.
 There are several reasons why a source may be
incapable of providing parts of the information needed for
a query to be executed successfully.
The Problem of Limited Source Capabilities
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #27
 Reasons why a source may limit the ways in which queries can
be asked:
 Protection against a rival exploiting its database (e.g.
Amazon.com would never allow the equivalent of a
“SELECT * FROM books”)
 Privacy Protection (e.g. a medical database may answer
queries about averages, but won’t disclose the details of a
particular patient’s medical history)
 Lack of proper Indexes may make certain kinds of queries
too expensive to execute.
 The interface provide is not designed to support such
queries (e.g. an API that provides a specific number of pre-
defined HTTP requests)
Adornments
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #28
 We need a way to describe the capabilities of a data
source.
 In order to describe the legal forms of queries, we may
use adornments.
 Adornments are sequences of codes that represent the
requirements for the attributes of the relation, in their
standard order.
Adornments
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #29
 The codes we shall use for adornments reflect the most common capabilities of sources. They are:
 In addition, we place a prime (e.g., f ') on a code to indicate that the attribute is not part of the output of the query.
Symbol Description
f (free) means that the attribute can be specified or not, as we choose.
b (bound) means that we must specify a value for the attribute, but any value is allowed.
u (unspecified) means that we are not permitted to specify a value for the attribute.
c[S] (choice from set S) means that a value must be specified, and that value must be one of the values in the finite set S. (e.g. values
from a dropdown menu)
o[S] (optional, from set S) means that we either do not specify a value, or we specify one of the values in the finite set S.
Adornments - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #30
 Suppose we have the following schema:
 Scenario 1. The user specifies a serial number. All the information about the car with that serial number
(i.e., the other four attributes) is produced as output.
Adornments - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #31
 Suppose we have the following schema:
 Scenario 1. The user specifies a serial number. All the information about the car with that serial number
(i.e., the other four attributes) is produced as output.
 The adornment for this query form is:
b'uuuu
Adornments - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #32
 Suppose we have the following schema:
 Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission
and navigation system are wanted. All five attributes are printed for all matching cars.
Adornments - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #33
 Suppose we have the following schema:
 Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission
and navigation system are wanted. All five attributes are printed for all matching cars.
 The adornment for this query form is:
ubbo[yes, no]o[yes, no]
Query Plan Selection
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #34
 Given a query at the mediator, a capability-based query optimizer first considers what queries it can
ask at the sources to help answer the query.
 If we imagine those queries asked and answered, then we have bindings for some more attributes,
and these bindings may make some more queries at the sources possible.
 We repeat this process until either:
 We have asked enough queries at the sources to resolve all the conditions of the mediator query,
and therefore we may answer that query. Such a plan is called feasible.
 We can construct no more valid forms of source queries, yet we still cannot answer the mediator
query, in which case the mediator must give up; it has been given an impossible query.
Query Plan Selection - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #35
 Suppose we have the following sources:
 The adornment for Autos is: ubf
 Options has two adornments: bu and uc[autoTrans, navi]
 Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”
Query Plan Selection - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #36
 Suppose we have the following sources:
 The adornment for Autos is: ubf
 Options has two adornments: bu and uc[autoTrans, navi]
 Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”
Approach 1
Specifying that the model is “Megane”, query Autos and get the serial numbers and colors of all Meganes. Then,
using the bu adornment for Options, for each such serial number, find the options for that car and filter to make
sure it has a navigation system.
Query Plan Selection - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #37
 Suppose we have the following sources:
 The adornment for Autos is: ubf
 Options has two adornments: bu and uc[autoTrans, navi]
 Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”
Approach 2
Specifying the navigation-system option, query Options using the uc[autoTrans, navi] adornment and get all the
serial numbers for cars with a navigation system. Then query Autos as in (1), to get all the serial numbers and
colors of Meganes, and intersect the two sets of serial numbers.
Query Plan Selection - Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #38
 Suppose we have the following sources:
 The adornment for Autos is: ubf
 Options has two adornments: bu and uc[autoTrans, navi]
 Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”
Approach 3
Query Options as in (2) to get the serial numbers for cars with a navigation system. Then use these serial
numbers to query Autos and see which of these cars are Meganes
This is one of the approaches that would fail.
The system does not have the capability to execute this plan.
Notes on Cost-Based Optimization
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #39
 Having found the feasible plans, the Mediator must choose
among them.
 Since the sources are usually independent of the mediator, it is
difficult to estimate the cost. ( E.g.: A source may take less time
during periods when it is lightly loaded, but when are those
periods? )
 Long-term observation by the mediator is necessary for the
mediator even to guess what the response time might be.
 Consider the previous example. Approach (2) uses only two
source queries, while Approach (1) uses one plus the number of
Meganes found in the Autos relation. Thus, it appears that plan
(2) has lower cost.
An Algorithm for Answering Queries
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #40
 The algorithm is called “chain”.
 It is not guaranteed to provide the most efficient solution, but it will
provide a solution whenever one exists, and in practice, it is very
likely to obtain the most efficient solution.
 Lets agree on the following notation for data sources:
 And on the following notation for expressing queries:
“find the serial numbers and colors of Gobi models with a navigation system”
Notice the simplified adornment notation used to represent the
arguments of the subgoals that are bound to a set of constants
An Algorithm for Answering Queries
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #41
 The algorithm maintains two kinds of information:
 An adornment is maintained for each subgoal.
 Initially, the adornment for a subgoal has b if and only if the
mediator query provides a constant binding for the corresponding
argument of that subgoal.
 In all other places, the adornment has f ’s.
 After every step of the algorithm, the adornment of each subgoal is
updated.
 A relation X that is (a projection of) the join of the relations for all the
subgoals that have been resolved.
 Initially, since no subgoals have been resolved, X is a relation over
no attributes, containing just the empty tuple.
 As the algorithm progresses, X will have attributes that are
variables of the rule — those variables that correspond to b’s in the
adornments of the subgoals in which they appear.
An Algorithm for Answering Queries
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #42
 The core of the Chain Algorithm is as follows
 Initialize a relation X and the adornments of the subgoals.
 Select a subgoal that can be resolved.
 Join X and the result of the subgoal.
 Project out of X all components that correspond to variables that do
not appear in the head or in any unresolved subgoal.
 Update the adornments of the unresolved subgoals.
 Repeatedly select a subgoal that can be resolved and update X
accordingly until no unresolved subgoal has been left.
 If we succeed in resolving every subgoal, then relation X will be the
answer to the query. If at some point, there are unresolved subgoals,
yet none can be resolved, then the algorithm fails. In that case, there
can be no other sequence of resolution steps that answers the query.
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #43
 Consider the following query:
 And the following sources:
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #44
 Initially, the adornments on the subgoals are as shown in the query Q, and
the relation X that we construct initially contains only the empty tuple.
 Since subgoals S and T have f f adornments, but the adornments at the
corresponding sources each have a component with b or c, neither of these
subgoals can be resolved.
 Fortunately, the first subgoal, R(1,a), can be resolved, since the bf
adornment at the corresponding source is matched by the adornment of the
subgoal. Thus, we send the source for R(w,x) a query with w = 1, and the
response is the set of three tuples shown in the first column.
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #45
 We next project the subgoal’s relation onto its second component, since
only the second component of R(1,a) is a variable.
 That gives us the relation on the bottom.
 This relation is joined with X, which currently has no attributes and only the
empty tuple. The result is that X becomes the relation below.
 Since a is now bound, we change the adornment on the S subgoal from ff to
bf
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #46
 At this point, the second subgoal, Sbf(a,b), can be resolved.
 We obtain bindings for the first component by projecting X onto a; the result
is X itself. That is, we can go to the source for S(x,y) with bindings 2, 3, and
4 for x.
 We do not need bindings for y, since the second component of the
adornment for the source is f.
 The c'[2,3,5] code for x says that we can give the source the value 2, 3, or 5
for the first argument.
 Since there is a prime on the c, we know that only the corresponding y
value(s) will be returned, not the value of x that we supplied in the request.
 We care about values 2, 3, and 4, but 4 is not a possible value at the source
for 5, so we never ask about it.
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #47
 When we ask about x = 2, we get one response: y = 4.
 We pad this response with the value 2 we supplied to conclude that (2,4) is
a tuple in the relation for the 5 subgoal.
 Similarly, when we ask about x = 3, we get y = 5 as the only response and
we add (3,5) to the set of tuples constructed for the S subgoal.
 There are no more requests to ask at the source for S, so we conclude that
the relation for the S subgoal is
 When we join this relation with the previous value of X, the result is just the
relation above. However, variable a now appears neither in the head nor in
any unresolved subgoal. Thus, we project it out.
Example of The Chain Algorithm
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #48
 Since b is now bound, we change the adornment on the T subgoal, so it
becomes Tbf(b,c).
 Now this last subgoal can be resolved, which we do by sending requests to
the source for T(y, z) with y = 4 and y = 5.
 The responses we get back give us the following relation for the T subgoal:
 We join it with the relation for X above, and then project onto the c attribute
to get the relation for the head. That is, the answer to the query at the
mediator is {(6), (7), (8)}.
Incorporating Union Views
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #49
 In our description of the Chain Algorithm we assumed that each query subgoal was a
“view” of data at one particular source.
 It is common for there to be several sources that can contribute tuples.
 If more than one sources are available:
 The sources may contain replicated information. In that case, we can turn to any
one of the sources. However, there may be several adornments that allow us to
query that source.
 Sources each contribute some tuples that the other sources may not contribute.
In that case, we should consult all the sources for the predicate.
 There is a policy choice to be made:
 Either we can refuse to answer the query unless we can consult all the
sources
 Or we can make best efforts to return all the answers to the query that we can
obtain by combinations of sources.
Incorporating Union Views – A Best Effort Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #50
 Suppose we have the following query:
 We have the following sources for R: R1
ff and R2
fb
 We have the following sources for S: S1
ff and S2
bf
Incorporating Union Views – A Best Effort Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #51
 Suppose we have the following query:
 We have the following sources for R: R1
ff and R2
fb
 We have the following sources for S: S1
ff and S2
bf
 Suppose we start with R’s source. We query this source and get some tuples for R ( using R1
ff ).
 Now, we have some bindings, but perhaps not all, for the variable b.
 We can now use both sources for S to obtain tuples and the relation for S can be set to their union.
 At this point, we can project the relation for S onto variable b and get some b-values. These can be used to query the second
source for R, the one with adornment fb.
 In this manner, we can get some additional R-tuples. It is only at this point that we can join the relations for R and S, and
project onto a and c to get the best-effort answer to the query.
GAV & LAV Mediators
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #52
 The mediators discussed so far are called global-as-view (GAV) mediators.
 GAV mediators are easy to construct. You decide on the global predicates or relations that
the mediator will support, and for each source, you consider which predicates it can support,
and how it can be queried.
 In a local-as-view (LAV) mediator, we define global predicates at the mediator, but we do not
define these predicates as views of the source data.
 Rather, we define, for each source, one or more expressions involving the global
predicates that describe the tuples that the source is able to produce.
 Queries are answered at the mediator by discovering all possible ways to construct the query
using the views provided by the sources.
LAV Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #53
 We shall look at an example where the mediator is intended to provide a single predicate Par(c,p), meaning
that p is a parent of c.
 Suppose we have a database maintained by the Association of Grandparents (DS1) that doesn’t provide any
child-parent facts at all, but provides child-grandparent facts and we have another database (DS2) that provides
child-parent relationships.
 GAV mediators do not allow us to use a grandparents source at all, if our goal is to produce a Par relation.
However, LAV mediators allow us to say that a certain source provides grandparent facts.
 The DS1 can be described as:
 The DS2 can be described as:
LAV Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #54
 Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is,
the mediator query is
LAV Example
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #55
 Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is,
the mediator query is
 The possible solutions are:
 Using only V1:
 Using a combination of V1 and V2:
 Alternative 1:
 Alternative 2:
Entity Resolution
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #56
 Sometimes it is unclear whether records at two sources represent the same entity.
 Reasons why discrepancies can occur:
 Misspellings (e.g. Jones & Jomes)
 Variants (e.g. Susan Williams & Sue Williams)
 Misunderstanding of Names (e.g. Asian Names)
 Evolution of values (A person that moved)
 Abbreviations (Sesame St. & Sesame Street)
 When deciding whether two records represent the same entity, we need to look carefully at the kinds of
discrepancies that occur and devise a scoring system or other test that measures the similarity of records
Measuring the Similarity of Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #57
 Two useful approaches to measuring the similarity of records:
 Edit Distance
 Values that are strings can be compared by counting the number of insertions and/or deletions of
characters it takes to turn one string into another. (e.g. Smythe and Smith are at distance 3).
 We may devise a specialized distance that takes into account the way the data was constructed.
 Once we have decided on the appropriate edit distance for each field, we can define a similarity measure
for records.
 Normalization
 Before applying an edit distance, we might wish to “normalize” records by replacing certain substrings by
others. (e.g. St. would be replaced by Street in street addresses and by Saint in town names.)
 One could even use the Soundex encoding of names, so names that sound the same are represented by
the same string.
Merging Similar Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #58
 In many applications, when we find two records that are similar enough to merge, we would like to replace
them by a single record
 We might take the union of all the values in each field.
 Or we might somehow combine the values in corresponding fields to make a single value.
 A problem that arises if we use certain combinations of a similarity test and a merging rule is that our decision
to merge one pair of records may preclude our merging another pair.
 Suppose we have the following tuples and our similarity rule is: “must agree exactly in at least two out of the
three fields.”. Suppose also that our merge rule is: “set the field in which the records disagree to the empty
string.”:
Merging Similar Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #59
 Another choice for similarity and merge rules is:
 Merge by taking the union of the values in each field
 Declare two records similar if at least two of the three fields have a nonempty intersection.
 So, step 1 would be:
 And then, step 2:
Merging Similar Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #60
 Any choice of similarity and merge functions allows us to test pairs of records for similarity and merge them if
so.
 There are several properties that we would expect any merge function to satisfy. If Λ is the operation that
produces the merge of two records, it is reasonable to expect:
 r Λ r = r (Idempotence). That is, the merge of a record with itself should surely be that record.
 r Λ s = s Λ r ( Commutativity). If we merge two records, the order in which we list them should not matter.
 (r Λ s) Λ t = r Λ (s Λ t) (Associativity). The order in which we group records for a merger should not matter.
 We assume that:
 If r and s are similar, then r Λ s is defined.
Merging Similar Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #61
 There are also some properties that we expect the similarity relationship to have, and ways that we expect
similarity and merging to interact.
 We shall use r ≈ s to say that records r and s are similar.
 r ≈ r (Idempotence for similarity). A record is always similar to itself.
 r ≈ s if and only if s ≈ r (Commutativity of similarity). That is, in deciding whether two records are similar, it
does not matter in which order we list them.
 If r ≈ s, then r ≈ (s Λ t) (Representability). This rule requires that if r is similar to some other record s (and
thus could be merged with s), but s is instead merged with some other record t, then r remains similar to
the merger of s and t and can be merged with that record.
 The collection of properties above are called the ICAR properties.
The R-Swoosh Algorithm for ICAR Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #62
 When the similarity and merge functions satisfy the ICAR properties, there is a simple algorithm that merges all
possible records.
 INPUT: A set of records I, a similarity function ~ , and a merge function Λ. We assume that ~ and Λ satisfy the
ICAR properties.
 If they do not, then the algorithm will still merge some records, but the result may not be the maximum or best
possible merging.
 OUTPUT: A set of merged records O.
The R-Swoosh Algorithm for ICAR Records
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #63
Other Approaches to Entity Resolution
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #64
 Clustering
 In some entity-resolution applications, we do not want to merge at all, but will instead group records into
clusters such that members of a cluster are in some sense similar to each other and members of different
clusters are not similar.
 For example, if we are looking for similar products sold on eBay, we might want the result to be not a
single record for each kind of product, but rather a list of the records that represent a common product for
sale. Clustering of large-scale data involves a complex set of options.
 Partitioning
 Since any algorithm for doing a complete merger of similar records may be forced to examine each pair of
records, it may be infeasible to get an exact answer to a large entity-resolution problem.
 One solution is to group the records, perhaps several times, into groups that are likely to contain similar
records, and look only within each group for pairs of similar records.
Thank you!

More Related Content

PPT
Lecture 16
PDF
Data science training in hyderabad
PDF
Building Knowledge Graphs in 10 steps
PPT
Data Warehouse By Piyush
PPTX
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
PDF
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
PDF
2013 NIST Big Data Subgroups Combined Outputs
PPT
Part1
Lecture 16
Data science training in hyderabad
Building Knowledge Graphs in 10 steps
Data Warehouse By Piyush
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
2013 NIST Big Data Subgroups Combined Outputs
Part1

What's hot (20)

PPTX
142230 633685297550892500
PDF
A memory capacity model for high performing data-filtering applications in Sa...
PDF
Link Reuse and Evolution for Data Integration (LSWT 2020)
PPTX
Enabling Clinical Research in the Real World
PPTX
Using R for Advanced Analytics with MongoDB
PPT
Cssu dw dm
PPTX
Meetup Junio Data Analysis with python 2018
PPTX
Evolution of big data
PPTX
Data Café — A Platform For Creating Biomedical Data Lakes
PDF
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
PPTX
Olap and metadata
PPT
The Big Metadata
PPTX
Classification and prediction in data mining
PDF
Integration Patterns for Big Data Applications
PDF
Big data analysis using spark r published
PPTX
Popular Text Analytics Algorithms
DOC
Dwh faqs
PPTX
An overview of data warehousing and OLAP technology
POTX
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
ODP
Migration to Drupal
142230 633685297550892500
A memory capacity model for high performing data-filtering applications in Sa...
Link Reuse and Evolution for Data Integration (LSWT 2020)
Enabling Clinical Research in the Real World
Using R for Advanced Analytics with MongoDB
Cssu dw dm
Meetup Junio Data Analysis with python 2018
Evolution of big data
Data Café — A Platform For Creating Biomedical Data Lakes
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
Olap and metadata
The Big Metadata
Classification and prediction in data mining
Integration Patterns for Big Data Applications
Big data analysis using spark r published
Popular Text Analytics Algorithms
Dwh faqs
An overview of data warehousing and OLAP technology
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Migration to Drupal
Ad

Similar to Data Integration Lecture Notes (20)

PPT
PPT
PPT
The Database Environment Chapter 1
PDF
Lecture2 is331 data&infomanag(databaseenv)
PDF
dbms Unit 1.pdf arey bhai teri maa chodunga
PPTX
Data Management
PDF
DBMS Unit 1 nice content please download it
PPT
DBMS - Introduction
PDF
Data integration
PDF
M.sc. engg (ict) admission guide database management system 4
PPTX
Database fundamentals(database)
PDF
2 data warehouse life cycle golfarelli
PPTX
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
PPTX
Database Management System, Lecture-1
PPTX
DBMS and its Models
PPT
Warehousing_Ch10.ppt
PPTX
DATABASE MANAGEMENT SYSTEMS CS 3492.pptx
PPTX
Data modeling trends for analytics
PPT
DBMS - Introduction.ppt
PPTX
Unit 2 DATABASE ESSENTIALS.pptx
The Database Environment Chapter 1
Lecture2 is331 data&infomanag(databaseenv)
dbms Unit 1.pdf arey bhai teri maa chodunga
Data Management
DBMS Unit 1 nice content please download it
DBMS - Introduction
Data integration
M.sc. engg (ict) admission guide database management system 4
Database fundamentals(database)
2 data warehouse life cycle golfarelli
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
Database Management System, Lecture-1
DBMS and its Models
Warehousing_Ch10.ppt
DATABASE MANAGEMENT SYSTEMS CS 3492.pptx
Data modeling trends for analytics
DBMS - Introduction.ppt
Unit 2 DATABASE ESSENTIALS.pptx
Ad

Recently uploaded (20)

PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PPT
What is a Computer? Input Devices /output devices
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Geologic Time for studying geology for geologist
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Flame analysis and combustion estimation using large language and vision assi...
A review of recent deep learning applications in wood surface defect identifi...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Consumable AI The What, Why & How for Small Teams.pdf
Credit Without Borders: AI and Financial Inclusion in Bangladesh
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Module 1.ppt Iot fundamentals and Architecture
What is a Computer? Input Devices /output devices
Hindi spoken digit analysis for native and non-native speakers
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
sbt 2.0: go big (Scala Days 2025 edition)
Custom Battery Pack Design Considerations for Performance and Safety
sustainability-14-14877-v2.pddhzftheheeeee
The influence of sentiment analysis in enhancing early warning system model f...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Geologic Time for studying geology for geologist
Enhancing emotion recognition model for a student engagement use case through...
1 - Historical Antecedents, Social Consideration.pdf
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx

Data Integration Lecture Notes

  • 1. Data Integration Introduction to Data Integration AUEB - MSc in Business Analytics – Spiros Safras (@ssafras)
  • 2. Agenda AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #1  Integration of Information  The Heterogeneity Problem  Data Integration Architectures  Federated Database Schemas  Data Warehouses  Mediators  Extractors and Wrappers  Capability Based Optimization  Adornments and Query Plan Selection  Optimizing Mediator Queries  Consult all sources VS Best effort approach  Local as View Mediators  Entity Resolution
  • 3. Integration of Information AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #2  Definition: The process of taking several databases or other information sources and making the data in these sources work together as if they were a single database.
  • 4. Why Data Integration? AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #3  Databases are often created independently, even if they later need to be combined.  At that point, multiple different systems might have been used with different schemas and maybe limited interfaces to the data.  The use of a database involves, so we cannot design a database to support every possible future use.  Starting over is not always possible  High development cost  Incompatibility issues with legacy software  Transition from old systems to new ones  The goal of data integration: tie together different sources, controlled by many people, under a common schema.
  • 5. Why Data Integration? AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #4  This can be solved through the use of an abstraction layer ( middleware ) on top of all the available data sources.  The layer of abstraction could be a set of relational views.  These views could be either virtual or materialized.  A form of SQL could be used to access this abstraction layer.
  • 6. Why Data Integration? ( Scenario 1 ) AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #5  A company may have several different systems with totally independent databases describing different entities.  Management may need to pose queries that require the combination of these data sources (DS) in order to provide an answer. DS1 DS2 DS3 DS4 Middleware DS5
  • 7. Why Data Integration? ( Scenario 2 ) AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #6  Company A acquired Company B  Both companies have data sources that describe similar entities through different schemas  There datasets need to be integrated DS1.a DS2.a DS1.b DS2.b DS1.c DS2.c DS1.d DS2.d Middleware DS1.e DS2.e
  • 8. The Heterogeneity Problem AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #7  Data sources differ in many ways, even if they are intended to store the same kinds of data.  Such sources are called heterogeneous, and the problem of integrating them is referred to as the heterogeneity problem.
  • 9. Heterogeneity Problem Types AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #8  Communication Heterogeneity  HTTP  Direct through VPN  Remote Connection  Query Language Heterogeneity  Different SQL Dialects  JS (eg MongoDB)  URI (eg HTTP APIs)  Excel  Schema Heterogeneity  Sample schema 1:  Sample schema 2:
  • 10. Heterogeneity Problem Types AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #9  Data type differences  Serial numbers might be represented by character strings of varying length at one source and fixed length at another.  Value Heterogeneity  Eg: Storing the color ”Black”  Data source A: ”BLACK” { String }  Data source B: ”BL” { String }  Data source C: 12 { Integer }  Semantic Heterogeneity  Do trucks belong to ”Cars”?  Do minivans == station wagons?
  • 11. Data Integration Architectures AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #10  There are several ways that databases or other distributed information sources can be made to work together.  Federation: everybody talks directly to everyone else.  Warehouse: Sources are translated from their local schema to a global schema and copied to a central DB.  Mediator: Virtual warehouse --- turns a user query into a sequence of source queries.
  • 12. Federated Database Systems AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #11  Perhaps the simplest architecture for integrating several databases  One source can call on others to supply information  Each query works properly for the database to which it is addressed  If n databases each need to talk to the n — 1 other databases, then we must write n(n — 1) pieces of code to support queries between systems.  Could be useful if the systems need to query the database of a very limited number of other systems Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper
  • 13. Data Warehouses AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #12  Data from several sources is extracted and combined into a global schema  Queries may be issued by the user exactly as they would be issued to any database  Two approaches on updating the DW:  The warehouse is periodically closed to queries and reconstructed from the current data in the sources. (e.g. once a night or at even longer intervals)  The warehouse is updated periodically (e.g., each night), based on the changes that have been made to the sources since the last time the warehouse was modified. ( incremental update )  It is generally too expensive to reflect immediately, at the warehouse, every change to the underlying databases. Warehouse Extractor Extractor Source 1 Source 2
  • 14. Mediators AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #13  A mediator supports a virtual view, or collection of views, that integrates several sources.  The mediator doesn’t store any data. The mechanics of mediators and warehouses are rather different.  To begin, the user or application program issues a query to the mediator.  Since the mediator has no data of its own, it must get the relevant data from its sources and use that data to form the answer to the user’s query. Mediator Wrapper Wrapper Source 1 Source 2 User query Query Query QueryQuery Result Result Result Result Result
  • 15. Mediators AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #14  The mediator sends a query to each of its wrappers, which in turn send queries to their corresponding sources.  The mediator may send several queries to a wrapper, and may not query all wrappers.  The results come back and are combined at the mediator. Mediator Wrapper Wrapper Source 1 Source 2 User query Query Query QueryQuery Result Result Result Result Result
  • 16. Extractors and Wrappers AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #15 In a data warehouse system, the source extractors consist of:  One or more predefined queries that are executed at the source to produce data for the warehouse.  Suitable communication mechanisms, so the extractor can:  Pass ad-hoc queries to the source  Receive responses from the source  Pass information to the warehouse. Warehouse Extractor Extractor Source 1 Source 2
  • 17. Extractors and Wrappers AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #16 The predefined queries to the source could be:  SQL queries if the source is a SQL database  Operations in whatever language was appropriate for a source that was not a database system Warehouse Extractor Extractor Source 1 Source 2
  • 18. Extractors and Wrappers AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #17 Example ETL tool chain:  This is an example for e-commerce loading  Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load Invoice line items Split Date - time Filter invalid Join Filter invalid Invalid dates/times Invalid items Item records Filter non - match Invalid customers Data Warehouse Customer records
  • 19. Extractors and Wrappers AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #18  Mediator systems require more complex wrappers than do most warehouse systems  The wrapper must be able to accept a variety of queries from the mediator and translate any of them to the terms of the source  The wrapper must then communicate the result to the mediator Mediator Wrapper Wrapper Source 1 Source 2 User query Query Query QueryQuery Result Result Result Result Result
  • 20. Templates for Query Patterns AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #19  A systematic way to design a wrapper that connects a mediator to a source is to classify the possible queries that the mediator can ask into templates.  Templates are queries with parameters that represent constants. Mediator Wrapper Wrapper Source 1 Source 2 User query Query Query QueryQuery Result Result Result Result Result
  • 21. Templates for Query Patterns AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #20  Suppose we want to build a wrapper for the source of Car Dealer 1, which has the schema:  For use by a mediator with schema:
  • 22. Templates for Query Patterns AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #21  Suppose we want to build a wrapper for the source of Dealer 1, which has the schema:  For use by a mediator with schema:  A possible solution:
  • 23. Templates for Query Patterns AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #22  In this case, there are eight choices, if queries are allowed to specify any of three attributes: model, color, and autoTrans.  In general, there would be 2n templates if we have the option of specifying n attributes  The number of templates could grow unreasonably large
  • 24. Wrapper Generators AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #23  The templates defining a wrapper must be turned into code for the wrapper itself.  The software that creates the wrapper is called a wrapper generator  The wrapper generator creates a table that holds  the various query patterns contained in the templates  the source queries that are associated with each Wrapper Generator Driver Table Source Templates Queries from mediator Results ResultsQueries
  • 25. Wrapper Generators AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #24  A driver is used in each wrapper; in general the driver can be the same for each generated wrapper.  The task of the driver is to:  Accept a query from the mediator  Search the table for a template that matches the query  Send the query to the source  Process the response and return it to the mediator Wrapper Generator Driver Table Source Templates Queries from mediator Results ResultsQueries
  • 26. Filters AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #25  It is not always realistic to write a template for every possible form of query  Suppose that a wrapper on a car dealer’s database has the template displayed below for finding cars by color  The mediator is asked to find cars of a particular model and color  As long as the wrapper has a template that (after proper substitution for the parameters) returns a superset of what the query wants, then it is possible to filter the dataset at the wrapper and pass only the desired tuples to the mediator.  In practice, the tuples could be produced one-at-a-time and filtered one-at-a-time, in a pipelined fashion, rather than having the entire stored at the wrapper and then filtered
  • 27. Capability – Based Optimization AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #26  A typical DBMS estimates the cost of each query plan and picks what it believes to be the best.  Optimization by a mediator usually follows a strategy known as capability-based optimization.  The central issue is not what a query plan costs, but whether the plan can be executed at all.  Only among plans found to be executable (“feasible”) do we try to estimate costs.  There are several reasons why a source may be incapable of providing parts of the information needed for a query to be executed successfully.
  • 28. The Problem of Limited Source Capabilities AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #27  Reasons why a source may limit the ways in which queries can be asked:  Protection against a rival exploiting its database (e.g. Amazon.com would never allow the equivalent of a “SELECT * FROM books”)  Privacy Protection (e.g. a medical database may answer queries about averages, but won’t disclose the details of a particular patient’s medical history)  Lack of proper Indexes may make certain kinds of queries too expensive to execute.  The interface provide is not designed to support such queries (e.g. an API that provides a specific number of pre- defined HTTP requests)
  • 29. Adornments AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #28  We need a way to describe the capabilities of a data source.  In order to describe the legal forms of queries, we may use adornments.  Adornments are sequences of codes that represent the requirements for the attributes of the relation, in their standard order.
  • 30. Adornments AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #29  The codes we shall use for adornments reflect the most common capabilities of sources. They are:  In addition, we place a prime (e.g., f ') on a code to indicate that the attribute is not part of the output of the query. Symbol Description f (free) means that the attribute can be specified or not, as we choose. b (bound) means that we must specify a value for the attribute, but any value is allowed. u (unspecified) means that we are not permitted to specify a value for the attribute. c[S] (choice from set S) means that a value must be specified, and that value must be one of the values in the finite set S. (e.g. values from a dropdown menu) o[S] (optional, from set S) means that we either do not specify a value, or we specify one of the values in the finite set S.
  • 31. Adornments - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #30  Suppose we have the following schema:  Scenario 1. The user specifies a serial number. All the information about the car with that serial number (i.e., the other four attributes) is produced as output.
  • 32. Adornments - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #31  Suppose we have the following schema:  Scenario 1. The user specifies a serial number. All the information about the car with that serial number (i.e., the other four attributes) is produced as output.  The adornment for this query form is: b'uuuu
  • 33. Adornments - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #32  Suppose we have the following schema:  Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission and navigation system are wanted. All five attributes are printed for all matching cars.
  • 34. Adornments - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #33  Suppose we have the following schema:  Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission and navigation system are wanted. All five attributes are printed for all matching cars.  The adornment for this query form is: ubbo[yes, no]o[yes, no]
  • 35. Query Plan Selection AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #34  Given a query at the mediator, a capability-based query optimizer first considers what queries it can ask at the sources to help answer the query.  If we imagine those queries asked and answered, then we have bindings for some more attributes, and these bindings may make some more queries at the sources possible.  We repeat this process until either:  We have asked enough queries at the sources to resolve all the conditions of the mediator query, and therefore we may answer that query. Such a plan is called feasible.  We can construct no more valid forms of source queries, yet we still cannot answer the mediator query, in which case the mediator must give up; it has been given an impossible query.
  • 36. Query Plan Selection - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #35  Suppose we have the following sources:  The adornment for Autos is: ubf  Options has two adornments: bu and uc[autoTrans, navi]  Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”
  • 37. Query Plan Selection - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #36  Suppose we have the following sources:  The adornment for Autos is: ubf  Options has two adornments: bu and uc[autoTrans, navi]  Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.” Approach 1 Specifying that the model is “Megane”, query Autos and get the serial numbers and colors of all Meganes. Then, using the bu adornment for Options, for each such serial number, find the options for that car and filter to make sure it has a navigation system.
  • 38. Query Plan Selection - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #37  Suppose we have the following sources:  The adornment for Autos is: ubf  Options has two adornments: bu and uc[autoTrans, navi]  Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.” Approach 2 Specifying the navigation-system option, query Options using the uc[autoTrans, navi] adornment and get all the serial numbers for cars with a navigation system. Then query Autos as in (1), to get all the serial numbers and colors of Meganes, and intersect the two sets of serial numbers.
  • 39. Query Plan Selection - Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #38  Suppose we have the following sources:  The adornment for Autos is: ubf  Options has two adornments: bu and uc[autoTrans, navi]  Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.” Approach 3 Query Options as in (2) to get the serial numbers for cars with a navigation system. Then use these serial numbers to query Autos and see which of these cars are Meganes This is one of the approaches that would fail. The system does not have the capability to execute this plan.
  • 40. Notes on Cost-Based Optimization AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #39  Having found the feasible plans, the Mediator must choose among them.  Since the sources are usually independent of the mediator, it is difficult to estimate the cost. ( E.g.: A source may take less time during periods when it is lightly loaded, but when are those periods? )  Long-term observation by the mediator is necessary for the mediator even to guess what the response time might be.  Consider the previous example. Approach (2) uses only two source queries, while Approach (1) uses one plus the number of Meganes found in the Autos relation. Thus, it appears that plan (2) has lower cost.
  • 41. An Algorithm for Answering Queries AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #40  The algorithm is called “chain”.  It is not guaranteed to provide the most efficient solution, but it will provide a solution whenever one exists, and in practice, it is very likely to obtain the most efficient solution.  Lets agree on the following notation for data sources:  And on the following notation for expressing queries: “find the serial numbers and colors of Gobi models with a navigation system” Notice the simplified adornment notation used to represent the arguments of the subgoals that are bound to a set of constants
  • 42. An Algorithm for Answering Queries AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #41  The algorithm maintains two kinds of information:  An adornment is maintained for each subgoal.  Initially, the adornment for a subgoal has b if and only if the mediator query provides a constant binding for the corresponding argument of that subgoal.  In all other places, the adornment has f ’s.  After every step of the algorithm, the adornment of each subgoal is updated.  A relation X that is (a projection of) the join of the relations for all the subgoals that have been resolved.  Initially, since no subgoals have been resolved, X is a relation over no attributes, containing just the empty tuple.  As the algorithm progresses, X will have attributes that are variables of the rule — those variables that correspond to b’s in the adornments of the subgoals in which they appear.
  • 43. An Algorithm for Answering Queries AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #42  The core of the Chain Algorithm is as follows  Initialize a relation X and the adornments of the subgoals.  Select a subgoal that can be resolved.  Join X and the result of the subgoal.  Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal.  Update the adornments of the unresolved subgoals.  Repeatedly select a subgoal that can be resolved and update X accordingly until no unresolved subgoal has been left.  If we succeed in resolving every subgoal, then relation X will be the answer to the query. If at some point, there are unresolved subgoals, yet none can be resolved, then the algorithm fails. In that case, there can be no other sequence of resolution steps that answers the query.
  • 44. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #43  Consider the following query:  And the following sources:
  • 45. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #44  Initially, the adornments on the subgoals are as shown in the query Q, and the relation X that we construct initially contains only the empty tuple.  Since subgoals S and T have f f adornments, but the adornments at the corresponding sources each have a component with b or c, neither of these subgoals can be resolved.  Fortunately, the first subgoal, R(1,a), can be resolved, since the bf adornment at the corresponding source is matched by the adornment of the subgoal. Thus, we send the source for R(w,x) a query with w = 1, and the response is the set of three tuples shown in the first column.
  • 46. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #45  We next project the subgoal’s relation onto its second component, since only the second component of R(1,a) is a variable.  That gives us the relation on the bottom.  This relation is joined with X, which currently has no attributes and only the empty tuple. The result is that X becomes the relation below.  Since a is now bound, we change the adornment on the S subgoal from ff to bf
  • 47. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #46  At this point, the second subgoal, Sbf(a,b), can be resolved.  We obtain bindings for the first component by projecting X onto a; the result is X itself. That is, we can go to the source for S(x,y) with bindings 2, 3, and 4 for x.  We do not need bindings for y, since the second component of the adornment for the source is f.  The c'[2,3,5] code for x says that we can give the source the value 2, 3, or 5 for the first argument.  Since there is a prime on the c, we know that only the corresponding y value(s) will be returned, not the value of x that we supplied in the request.  We care about values 2, 3, and 4, but 4 is not a possible value at the source for 5, so we never ask about it.
  • 48. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #47  When we ask about x = 2, we get one response: y = 4.  We pad this response with the value 2 we supplied to conclude that (2,4) is a tuple in the relation for the 5 subgoal.  Similarly, when we ask about x = 3, we get y = 5 as the only response and we add (3,5) to the set of tuples constructed for the S subgoal.  There are no more requests to ask at the source for S, so we conclude that the relation for the S subgoal is  When we join this relation with the previous value of X, the result is just the relation above. However, variable a now appears neither in the head nor in any unresolved subgoal. Thus, we project it out.
  • 49. Example of The Chain Algorithm AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #48  Since b is now bound, we change the adornment on the T subgoal, so it becomes Tbf(b,c).  Now this last subgoal can be resolved, which we do by sending requests to the source for T(y, z) with y = 4 and y = 5.  The responses we get back give us the following relation for the T subgoal:  We join it with the relation for X above, and then project onto the c attribute to get the relation for the head. That is, the answer to the query at the mediator is {(6), (7), (8)}.
  • 50. Incorporating Union Views AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #49  In our description of the Chain Algorithm we assumed that each query subgoal was a “view” of data at one particular source.  It is common for there to be several sources that can contribute tuples.  If more than one sources are available:  The sources may contain replicated information. In that case, we can turn to any one of the sources. However, there may be several adornments that allow us to query that source.  Sources each contribute some tuples that the other sources may not contribute. In that case, we should consult all the sources for the predicate.  There is a policy choice to be made:  Either we can refuse to answer the query unless we can consult all the sources  Or we can make best efforts to return all the answers to the query that we can obtain by combinations of sources.
  • 51. Incorporating Union Views – A Best Effort Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #50  Suppose we have the following query:  We have the following sources for R: R1 ff and R2 fb  We have the following sources for S: S1 ff and S2 bf
  • 52. Incorporating Union Views – A Best Effort Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #51  Suppose we have the following query:  We have the following sources for R: R1 ff and R2 fb  We have the following sources for S: S1 ff and S2 bf  Suppose we start with R’s source. We query this source and get some tuples for R ( using R1 ff ).  Now, we have some bindings, but perhaps not all, for the variable b.  We can now use both sources for S to obtain tuples and the relation for S can be set to their union.  At this point, we can project the relation for S onto variable b and get some b-values. These can be used to query the second source for R, the one with adornment fb.  In this manner, we can get some additional R-tuples. It is only at this point that we can join the relations for R and S, and project onto a and c to get the best-effort answer to the query.
  • 53. GAV & LAV Mediators AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #52  The mediators discussed so far are called global-as-view (GAV) mediators.  GAV mediators are easy to construct. You decide on the global predicates or relations that the mediator will support, and for each source, you consider which predicates it can support, and how it can be queried.  In a local-as-view (LAV) mediator, we define global predicates at the mediator, but we do not define these predicates as views of the source data.  Rather, we define, for each source, one or more expressions involving the global predicates that describe the tuples that the source is able to produce.  Queries are answered at the mediator by discovering all possible ways to construct the query using the views provided by the sources.
  • 54. LAV Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #53  We shall look at an example where the mediator is intended to provide a single predicate Par(c,p), meaning that p is a parent of c.  Suppose we have a database maintained by the Association of Grandparents (DS1) that doesn’t provide any child-parent facts at all, but provides child-grandparent facts and we have another database (DS2) that provides child-parent relationships.  GAV mediators do not allow us to use a grandparents source at all, if our goal is to produce a Par relation. However, LAV mediators allow us to say that a certain source provides grandparent facts.  The DS1 can be described as:  The DS2 can be described as:
  • 55. LAV Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #54  Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is, the mediator query is
  • 56. LAV Example AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #55  Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is, the mediator query is  The possible solutions are:  Using only V1:  Using a combination of V1 and V2:  Alternative 1:  Alternative 2:
  • 57. Entity Resolution AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #56  Sometimes it is unclear whether records at two sources represent the same entity.  Reasons why discrepancies can occur:  Misspellings (e.g. Jones & Jomes)  Variants (e.g. Susan Williams & Sue Williams)  Misunderstanding of Names (e.g. Asian Names)  Evolution of values (A person that moved)  Abbreviations (Sesame St. & Sesame Street)  When deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies that occur and devise a scoring system or other test that measures the similarity of records
  • 58. Measuring the Similarity of Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #57  Two useful approaches to measuring the similarity of records:  Edit Distance  Values that are strings can be compared by counting the number of insertions and/or deletions of characters it takes to turn one string into another. (e.g. Smythe and Smith are at distance 3).  We may devise a specialized distance that takes into account the way the data was constructed.  Once we have decided on the appropriate edit distance for each field, we can define a similarity measure for records.  Normalization  Before applying an edit distance, we might wish to “normalize” records by replacing certain substrings by others. (e.g. St. would be replaced by Street in street addresses and by Saint in town names.)  One could even use the Soundex encoding of names, so names that sound the same are represented by the same string.
  • 59. Merging Similar Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #58  In many applications, when we find two records that are similar enough to merge, we would like to replace them by a single record  We might take the union of all the values in each field.  Or we might somehow combine the values in corresponding fields to make a single value.  A problem that arises if we use certain combinations of a similarity test and a merging rule is that our decision to merge one pair of records may preclude our merging another pair.  Suppose we have the following tuples and our similarity rule is: “must agree exactly in at least two out of the three fields.”. Suppose also that our merge rule is: “set the field in which the records disagree to the empty string.”:
  • 60. Merging Similar Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #59  Another choice for similarity and merge rules is:  Merge by taking the union of the values in each field  Declare two records similar if at least two of the three fields have a nonempty intersection.  So, step 1 would be:  And then, step 2:
  • 61. Merging Similar Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #60  Any choice of similarity and merge functions allows us to test pairs of records for similarity and merge them if so.  There are several properties that we would expect any merge function to satisfy. If Λ is the operation that produces the merge of two records, it is reasonable to expect:  r Λ r = r (Idempotence). That is, the merge of a record with itself should surely be that record.  r Λ s = s Λ r ( Commutativity). If we merge two records, the order in which we list them should not matter.  (r Λ s) Λ t = r Λ (s Λ t) (Associativity). The order in which we group records for a merger should not matter.  We assume that:  If r and s are similar, then r Λ s is defined.
  • 62. Merging Similar Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #61  There are also some properties that we expect the similarity relationship to have, and ways that we expect similarity and merging to interact.  We shall use r ≈ s to say that records r and s are similar.  r ≈ r (Idempotence for similarity). A record is always similar to itself.  r ≈ s if and only if s ≈ r (Commutativity of similarity). That is, in deciding whether two records are similar, it does not matter in which order we list them.  If r ≈ s, then r ≈ (s Λ t) (Representability). This rule requires that if r is similar to some other record s (and thus could be merged with s), but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.  The collection of properties above are called the ICAR properties.
  • 63. The R-Swoosh Algorithm for ICAR Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #62  When the similarity and merge functions satisfy the ICAR properties, there is a simple algorithm that merges all possible records.  INPUT: A set of records I, a similarity function ~ , and a merge function Λ. We assume that ~ and Λ satisfy the ICAR properties.  If they do not, then the algorithm will still merge some records, but the result may not be the maximum or best possible merging.  OUTPUT: A set of merged records O.
  • 64. The R-Swoosh Algorithm for ICAR Records AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #63
  • 65. Other Approaches to Entity Resolution AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #64  Clustering  In some entity-resolution applications, we do not want to merge at all, but will instead group records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar.  For example, if we are looking for similar products sold on eBay, we might want the result to be not a single record for each kind of product, but rather a list of the records that represent a common product for sale. Clustering of large-scale data involves a complex set of options.  Partitioning  Since any algorithm for doing a complete merger of similar records may be forced to examine each pair of records, it may be infeasible to get an exact answer to a large entity-resolution problem.  One solution is to group the records, perhaps several times, into groups that are likely to contain similar records, and look only within each group for pairs of similar records.