Data Integration Lecture Notes

Data Integration
Introduction to Data Integration
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras)

Agenda
AUEB - MSc in Business Analytics – Spiros Safras (@ssafras) #1
 Integration of Information
 The Heterogeneity Problem
 Data Integration Architectures
 Federated Database Schemas
 Data Warehouses
 Mediators
 Extractors and Wrappers
 Capability Based Optimization
 Adornments and Query Plan Selection
 Optimizing Mediator Queries
 Consult all sources VS Best effort approach
 Local as View Mediators
 Entity Resolution

Integration of Information
 Definition: The process of taking several databases or other information sources and making the data in
these sources work together as if they were a single database.

Why Data Integration?
 Databases are often created independently, even if they later
need to be combined.
 At that point, multiple different systems might have been
used with different schemas and maybe limited interfaces to
the data.
 The use of a database involves, so we cannot design a database
to support every possible future use.
 Starting over is not always possible
 High development cost
 Incompatibility issues with legacy software
 Transition from old systems to new ones
 The goal of data integration: tie together different sources,
controlled by many people, under a common schema.

Why Data Integration?
 This can be solved through the use of an abstraction layer (
middleware ) on top of all the available data sources.
 The layer of abstraction could be a set of relational views.
 These views could be either virtual or materialized.
 A form of SQL could be used to access this abstraction layer.

Why Data Integration? ( Scenario 1 )
 A company may have several different systems with totally independent databases describing different
entities.
 Management may need to pose queries that require the combination of these data sources (DS) in order to
provide an answer.
DS1
DS2
DS3
DS4
Middleware
DS5

Why Data Integration? ( Scenario 2 )
 Company A acquired Company B
 Both companies have data sources that describe similar entities through different schemas
 There datasets need to be integrated
DS1.a
DS2.a
DS1.b
DS2.b
DS1.c
DS2.c
DS1.d
DS2.d
Middleware
DS1.e
DS2.e

The Heterogeneity Problem
 Data sources differ in many ways, even if they are intended
to store the same kinds of data.
 Such sources are called heterogeneous, and the problem of
integrating them is referred to as the heterogeneity
problem.

Heterogeneity Problem Types
 Communication Heterogeneity
 HTTP
 Direct through VPN
 Remote Connection
 Query Language Heterogeneity
 Different SQL Dialects
 JS (eg MongoDB)
 URI (eg HTTP APIs)
 Excel
 Schema Heterogeneity
 Sample schema 1:
 Sample schema 2:

Heterogeneity Problem Types
 Data type differences
 Serial numbers might be represented by character strings of varying
length at one source and fixed length at another.
 Value Heterogeneity
 Eg: Storing the color ”Black”
 Data source A: ”BLACK” { String }
 Data source B: ”BL” { String }
 Data source C: 12 { Integer }
 Semantic Heterogeneity
 Do trucks belong to ”Cars”?
 Do minivans == station wagons?

Data Integration Architectures
 There are several ways that databases or other distributed information
sources can be made to work together.
 Federation: everybody talks directly to everyone else.
 Warehouse: Sources are translated from their local schema to a global
schema and copied to a central DB.
 Mediator: Virtual warehouse --- turns a user query into a sequence of source
queries.

Federated Database Systems
 Perhaps the simplest architecture for integrating several
databases
 One source can call on others to supply information
 Each query works properly for the database to which it is
addressed
 If n databases each need to talk to the n — 1 other
databases, then we must write n(n — 1) pieces of code to
support queries between systems.
 Could be useful if the systems need to query the database
of a very limited number of other systems
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper

Data Warehouses
 Data from several sources is extracted and combined into a
global schema
 Queries may be issued by the user exactly as they would be
issued to any database
 Two approaches on updating the DW:
 The warehouse is periodically closed to queries and
reconstructed from the current data in the sources. (e.g.
once a night or at even longer intervals)
 The warehouse is updated periodically (e.g., each night),
based on the changes that have been made to the
sources since the last time the warehouse was modified.
( incremental update )
 It is generally too expensive to reflect immediately, at the
warehouse, every change to the underlying databases.
Warehouse
Extractor Extractor
Source 1 Source 2

Mediators
 A mediator supports a virtual view, or collection of
views, that integrates several sources.
 The mediator doesn’t store any data. The mechanics
of mediators and warehouses are rather different.
 To begin, the user or application program issues a
query to the mediator.
 Since the mediator has no data of its own, it must get
the relevant data from its sources and use that data to
form the answer to the user’s query.
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result

Mediators
 The mediator sends a query to each of its wrappers,
which in turn send queries to their corresponding
sources.
 The mediator may send several queries to a wrapper,
and may not query all wrappers.
 The results come back and are combined at the
mediator.
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result

Extractors and Wrappers
In a data warehouse system, the source extractors
consist of:
 One or more predefined queries that are executed at
the source to produce data for the warehouse.
 Suitable communication mechanisms, so the extractor
can:
 Pass ad-hoc queries to the source
 Receive responses from the source
 Pass information to the warehouse.
Warehouse
Extractor Extractor
Source 1 Source 2

The predefined queries to the source could be:
 SQL queries if the source is a SQL database
 Operations in whatever language was appropriate for
a source that was not a database system
Warehouse
Extractor Extractor
Source 1 Source 2

Example ETL tool chain:
 This is an example for e-commerce loading
 Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group
and load
Invoice
line items
Split
Date -
time
Filter
invalid
Join
Filter
invalid
Invalid
dates/times
Invalid
items
Item
records
Filter
non -
match
Invalid
customers
Data
Warehouse
Customer
records

 Mediator systems require more complex wrappers
than do most warehouse systems
 The wrapper must be able to accept a variety of
queries from the mediator and translate any of them to
the terms of the source
 The wrapper must then communicate the result to the
mediator
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result

Templates for Query Patterns
 A systematic way to design a wrapper that connects a
mediator to a source is to classify the possible queries
that the mediator can ask into templates.
 Templates are queries with parameters that represent
constants. Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result

 Suppose we want to build a wrapper for the source of Car Dealer 1, which has the schema:
 For use by a mediator with schema:

 Suppose we want to build a wrapper for the source of Dealer 1, which has the schema:
 For use by a mediator with schema:
 A possible solution:

 In this case, there are eight choices, if queries are
allowed to specify any of three attributes: model, color,
and autoTrans.
 In general, there would be 2n templates if we have the
option of specifying n attributes
 The number of templates could grow unreasonably
large

Wrapper Generators
 The templates defining a wrapper must be turned into
code for the wrapper itself.
 The software that creates the wrapper is called a wrapper
generator
 The wrapper generator creates a table that holds
 the various query patterns contained in the templates
 the source queries that are associated with each
Wrapper
Generator
Driver
Table
Source
Templates
Queries from
mediator Results
ResultsQueries

Wrapper Generators
 A driver is used in each wrapper; in general the driver can
be the same for each generated wrapper.
 The task of the driver is to:
 Accept a query from the mediator
 Search the table for a template that matches the
query
 Send the query to the source
 Process the response and return it to the mediator
Wrapper
Generator
Driver
Table
Source
Templates
Queries from
mediator Results
ResultsQueries

Filters
 It is not always realistic to write a template for every possible form of query
 Suppose that a wrapper on a car dealer’s database has the template displayed below for finding cars by color
 The mediator is asked to find cars of a particular model and color
 As long as the wrapper has a template that (after proper substitution for the parameters) returns a superset of
what the query wants, then it is possible to filter the dataset at the wrapper and pass only the desired tuples to
the mediator.
 In practice, the tuples could be produced one-at-a-time and filtered one-at-a-time, in a pipelined fashion, rather
than having the entire stored at the wrapper and then filtered

Capability – Based Optimization
 A typical DBMS estimates the cost of each query plan and
picks what it believes to be the best.
 Optimization by a mediator usually follows a strategy
known as capability-based optimization.
 The central issue is not what a query plan costs, but
whether the plan can be executed at all.
 Only among plans found to be executable (“feasible”) do
we try to estimate costs.
 There are several reasons why a source may be
incapable of providing parts of the information needed for
a query to be executed successfully.

The Problem of Limited Source Capabilities
 Reasons why a source may limit the ways in which queries can
be asked:
 Protection against a rival exploiting its database (e.g.
Amazon.com would never allow the equivalent of a
“SELECT * FROM books”)
 Privacy Protection (e.g. a medical database may answer
queries about averages, but won’t disclose the details of a
particular patient’s medical history)
 Lack of proper Indexes may make certain kinds of queries
too expensive to execute.
 The interface provide is not designed to support such
queries (e.g. an API that provides a specific number of pre-
defined HTTP requests)

Adornments
 We need a way to describe the capabilities of a data
source.
 In order to describe the legal forms of queries, we may
use adornments.
 Adornments are sequences of codes that represent the
requirements for the attributes of the relation, in their
standard order.

Adornments
 The codes we shall use for adornments reflect the most common capabilities of sources. They are:
 In addition, we place a prime (e.g., f ') on a code to indicate that the attribute is not part of the output of the query.
Symbol Description
f (free) means that the attribute can be specified or not, as we choose.
b (bound) means that we must specify a value for the attribute, but any value is allowed.
u (unspecified) means that we are not permitted to specify a value for the attribute.
c[S] (choice from set S) means that a value must be specified, and that value must be one of the values in the finite set S. (e.g. values
from a dropdown menu)
o[S] (optional, from set S) means that we either do not specify a value, or we specify one of the values in the finite set S.

Adornments - Example
 Suppose we have the following schema:
 Scenario 1. The user specifies a serial number. All the information about the car with that serial number
(i.e., the other four attributes) is produced as output.

 Scenario 1. The user specifies a serial number. All the information about the car with that serial number
(i.e., the other four attributes) is produced as output.
 The adornment for this query form is:
b'uuuu

 Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission
and navigation system are wanted. All five attributes are printed for all matching cars.

 Scenario 2. The user specifies a model and color, and perhaps whether or not automatic transmission
and navigation system are wanted. All five attributes are printed for all matching cars.
 The adornment for this query form is:
ubbo[yes, no]o[yes, no]

Query Plan Selection
 Given a query at the mediator, a capability-based query optimizer first considers what queries it can
ask at the sources to help answer the query.
 If we imagine those queries asked and answered, then we have bindings for some more attributes,
and these bindings may make some more queries at the sources possible.
 We repeat this process until either:
 We have asked enough queries at the sources to resolve all the conditions of the mediator query,
and therefore we may answer that query. Such a plan is called feasible.
 We can construct no more valid forms of source queries, yet we still cannot answer the mediator
query, in which case the mediator must give up; it has been given an impossible query.

Query Plan Selection - Example
 Suppose we have the following sources:
 The adornment for Autos is: ubf
 Options has two adornments: bu and uc[autoTrans, navi]
 Let the query be: “find the serial numbers and colors of “Megane” models with a navigation system.”

Approach 1
Specifying that the model is “Megane”, query Autos and get the serial numbers and colors of all Meganes. Then,
using the bu adornment for Options, for each such serial number, find the options for that car and filter to make
sure it has a navigation system.

Approach 2
Specifying the navigation-system option, query Options using the uc[autoTrans, navi] adornment and get all the
serial numbers for cars with a navigation system. Then query Autos as in (1), to get all the serial numbers and
colors of Meganes, and intersect the two sets of serial numbers.

Approach 3
Query Options as in (2) to get the serial numbers for cars with a navigation system. Then use these serial
numbers to query Autos and see which of these cars are Meganes
This is one of the approaches that would fail.
The system does not have the capability to execute this plan.

Notes on Cost-Based Optimization
 Having found the feasible plans, the Mediator must choose
among them.
 Since the sources are usually independent of the mediator, it is
difficult to estimate the cost. ( E.g.: A source may take less time
during periods when it is lightly loaded, but when are those
periods? )
 Long-term observation by the mediator is necessary for the
mediator even to guess what the response time might be.
 Consider the previous example. Approach (2) uses only two
source queries, while Approach (1) uses one plus the number of
Meganes found in the Autos relation. Thus, it appears that plan
(2) has lower cost.

An Algorithm for Answering Queries
 The algorithm is called “chain”.
 It is not guaranteed to provide the most efficient solution, but it will
provide a solution whenever one exists, and in practice, it is very
likely to obtain the most efficient solution.
 Lets agree on the following notation for data sources:
 And on the following notation for expressing queries:
“find the serial numbers and colors of Gobi models with a navigation system”
Notice the simplified adornment notation used to represent the
arguments of the subgoals that are bound to a set of constants

 The algorithm maintains two kinds of information:
 An adornment is maintained for each subgoal.
 Initially, the adornment for a subgoal has b if and only if the
mediator query provides a constant binding for the corresponding
argument of that subgoal.
 In all other places, the adornment has f ’s.
 After every step of the algorithm, the adornment of each subgoal is
updated.
 A relation X that is (a projection of) the join of the relations for all the
subgoals that have been resolved.
 Initially, since no subgoals have been resolved, X is a relation over
no attributes, containing just the empty tuple.
 As the algorithm progresses, X will have attributes that are
variables of the rule — those variables that correspond to b’s in the
adornments of the subgoals in which they appear.

 The core of the Chain Algorithm is as follows
 Initialize a relation X and the adornments of the subgoals.
 Select a subgoal that can be resolved.
 Join X and the result of the subgoal.
 Project out of X all components that correspond to variables that do
not appear in the head or in any unresolved subgoal.
 Update the adornments of the unresolved subgoals.
 Repeatedly select a subgoal that can be resolved and update X
accordingly until no unresolved subgoal has been left.
 If we succeed in resolving every subgoal, then relation X will be the
answer to the query. If at some point, there are unresolved subgoals,
yet none can be resolved, then the algorithm fails. In that case, there
can be no other sequence of resolution steps that answers the query.

Example of The Chain Algorithm
 Consider the following query:
 And the following sources:

 Initially, the adornments on the subgoals are as shown in the query Q, and
the relation X that we construct initially contains only the empty tuple.
 Since subgoals S and T have f f adornments, but the adornments at the
corresponding sources each have a component with b or c, neither of these
subgoals can be resolved.
 Fortunately, the first subgoal, R(1,a), can be resolved, since the bf
adornment at the corresponding source is matched by the adornment of the
subgoal. Thus, we send the source for R(w,x) a query with w = 1, and the
response is the set of three tuples shown in the first column.

 We next project the subgoal’s relation onto its second component, since
only the second component of R(1,a) is a variable.
 That gives us the relation on the bottom.
 This relation is joined with X, which currently has no attributes and only the
empty tuple. The result is that X becomes the relation below.
 Since a is now bound, we change the adornment on the S subgoal from ff to
bf

 At this point, the second subgoal, Sbf(a,b), can be resolved.
 We obtain bindings for the first component by projecting X onto a; the result
is X itself. That is, we can go to the source for S(x,y) with bindings 2, 3, and
4 for x.
 We do not need bindings for y, since the second component of the
adornment for the source is f.
 The c'[2,3,5] code for x says that we can give the source the value 2, 3, or 5
for the first argument.
 Since there is a prime on the c, we know that only the corresponding y
value(s) will be returned, not the value of x that we supplied in the request.
 We care about values 2, 3, and 4, but 4 is not a possible value at the source
for 5, so we never ask about it.

 When we ask about x = 2, we get one response: y = 4.
 We pad this response with the value 2 we supplied to conclude that (2,4) is
a tuple in the relation for the 5 subgoal.
 Similarly, when we ask about x = 3, we get y = 5 as the only response and
we add (3,5) to the set of tuples constructed for the S subgoal.
 There are no more requests to ask at the source for S, so we conclude that
the relation for the S subgoal is
 When we join this relation with the previous value of X, the result is just the
relation above. However, variable a now appears neither in the head nor in
any unresolved subgoal. Thus, we project it out.

 Since b is now bound, we change the adornment on the T subgoal, so it
becomes Tbf(b,c).
 Now this last subgoal can be resolved, which we do by sending requests to
the source for T(y, z) with y = 4 and y = 5.
 The responses we get back give us the following relation for the T subgoal:
 We join it with the relation for X above, and then project onto the c attribute
to get the relation for the head. That is, the answer to the query at the
mediator is {(6), (7), (8)}.

Incorporating Union Views
 In our description of the Chain Algorithm we assumed that each query subgoal was a
“view” of data at one particular source.
 It is common for there to be several sources that can contribute tuples.
 If more than one sources are available:
 The sources may contain replicated information. In that case, we can turn to any
one of the sources. However, there may be several adornments that allow us to
query that source.
 Sources each contribute some tuples that the other sources may not contribute.
In that case, we should consult all the sources for the predicate.
 There is a policy choice to be made:
 Either we can refuse to answer the query unless we can consult all the
sources
 Or we can make best efforts to return all the answers to the query that we can
obtain by combinations of sources.

Incorporating Union Views – A Best Effort Example
 Suppose we have the following query:
 We have the following sources for R: R1
ff and R2
fb
 We have the following sources for S: S1
ff and S2
bf

Incorporating Union Views – A Best Effort Example
 Suppose we have the following query:
 We have the following sources for R: R1
ff and R2
fb
 We have the following sources for S: S1
ff and S2
bf
 Suppose we start with R’s source. We query this source and get some tuples for R ( using R1
ff ).
 Now, we have some bindings, but perhaps not all, for the variable b.
 We can now use both sources for S to obtain tuples and the relation for S can be set to their union.
 At this point, we can project the relation for S onto variable b and get some b-values. These can be used to query the second
source for R, the one with adornment fb.
 In this manner, we can get some additional R-tuples. It is only at this point that we can join the relations for R and S, and
project onto a and c to get the best-effort answer to the query.

GAV & LAV Mediators
 The mediators discussed so far are called global-as-view (GAV) mediators.
 GAV mediators are easy to construct. You decide on the global predicates or relations that
the mediator will support, and for each source, you consider which predicates it can support,
and how it can be queried.
 In a local-as-view (LAV) mediator, we define global predicates at the mediator, but we do not
define these predicates as views of the source data.
 Rather, we define, for each source, one or more expressions involving the global
predicates that describe the tuples that the source is able to produce.
 Queries are answered at the mediator by discovering all possible ways to construct the query
using the views provided by the sources.

LAV Example
 We shall look at an example where the mediator is intended to provide a single predicate Par(c,p), meaning
that p is a parent of c.
 Suppose we have a database maintained by the Association of Grandparents (DS1) that doesn’t provide any
child-parent facts at all, but provides child-grandparent facts and we have another database (DS2) that provides
child-parent relationships.
 GAV mediators do not allow us to use a grandparents source at all, if our goal is to produce a Par relation.
However, LAV mediators allow us to say that a certain source provides grandparent facts.
 The DS1 can be described as:
 The DS2 can be described as:

LAV Example
 Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is,
the mediator query is

LAV Example
 Our query at the mediator will ask for great-grandparent facts that can be obtained from the sources. That is,
the mediator query is
 The possible solutions are:
 Using only V1:
 Using a combination of V1 and V2:
 Alternative 1:
 Alternative 2:

Entity Resolution
 Sometimes it is unclear whether records at two sources represent the same entity.
 Reasons why discrepancies can occur:
 Misspellings (e.g. Jones & Jomes)
 Variants (e.g. Susan Williams & Sue Williams)
 Misunderstanding of Names (e.g. Asian Names)
 Evolution of values (A person that moved)
 Abbreviations (Sesame St. & Sesame Street)
 When deciding whether two records represent the same entity, we need to look carefully at the kinds of
discrepancies that occur and devise a scoring system or other test that measures the similarity of records

Measuring the Similarity of Records
 Two useful approaches to measuring the similarity of records:
 Edit Distance
 Values that are strings can be compared by counting the number of insertions and/or deletions of
characters it takes to turn one string into another. (e.g. Smythe and Smith are at distance 3).
 We may devise a specialized distance that takes into account the way the data was constructed.
 Once we have decided on the appropriate edit distance for each field, we can define a similarity measure
for records.
 Normalization
 Before applying an edit distance, we might wish to “normalize” records by replacing certain substrings by
others. (e.g. St. would be replaced by Street in street addresses and by Saint in town names.)
 One could even use the Soundex encoding of names, so names that sound the same are represented by
the same string.

Merging Similar Records
 In many applications, when we find two records that are similar enough to merge, we would like to replace
them by a single record
 We might take the union of all the values in each field.
 Or we might somehow combine the values in corresponding fields to make a single value.
 A problem that arises if we use certain combinations of a similarity test and a merging rule is that our decision
to merge one pair of records may preclude our merging another pair.
 Suppose we have the following tuples and our similarity rule is: “must agree exactly in at least two out of the
three fields.”. Suppose also that our merge rule is: “set the field in which the records disagree to the empty
string.”:

 Another choice for similarity and merge rules is:
 Merge by taking the union of the values in each field
 Declare two records similar if at least two of the three fields have a nonempty intersection.
 So, step 1 would be:
 And then, step 2:

 Any choice of similarity and merge functions allows us to test pairs of records for similarity and merge them if
so.
 There are several properties that we would expect any merge function to satisfy. If Λ is the operation that
produces the merge of two records, it is reasonable to expect:
 r Λ r = r (Idempotence). That is, the merge of a record with itself should surely be that record.
 r Λ s = s Λ r ( Commutativity). If we merge two records, the order in which we list them should not matter.
 (r Λ s) Λ t = r Λ (s Λ t) (Associativity). The order in which we group records for a merger should not matter.
 We assume that:
 If r and s are similar, then r Λ s is defined.

 There are also some properties that we expect the similarity relationship to have, and ways that we expect
similarity and merging to interact.
 We shall use r ≈ s to say that records r and s are similar.
 r ≈ r (Idempotence for similarity). A record is always similar to itself.
 r ≈ s if and only if s ≈ r (Commutativity of similarity). That is, in deciding whether two records are similar, it
does not matter in which order we list them.
 If r ≈ s, then r ≈ (s Λ t) (Representability). This rule requires that if r is similar to some other record s (and
thus could be merged with s), but s is instead merged with some other record t, then r remains similar to
the merger of s and t and can be merged with that record.
 The collection of properties above are called the ICAR properties.

The R-Swoosh Algorithm for ICAR Records
 When the similarity and merge functions satisfy the ICAR properties, there is a simple algorithm that merges all
possible records.
 INPUT: A set of records I, a similarity function ~ , and a merge function Λ. We assume that ~ and Λ satisfy the
ICAR properties.
 If they do not, then the algorithm will still merge some records, but the result may not be the maximum or best
possible merging.
 OUTPUT: A set of merged records O.

The R-Swoosh Algorithm for ICAR Records

Other Approaches to Entity Resolution
 Clustering
 In some entity-resolution applications, we do not want to merge at all, but will instead group records into
clusters such that members of a cluster are in some sense similar to each other and members of different
clusters are not similar.
 For example, if we are looking for similar products sold on eBay, we might want the result to be not a
single record for each kind of product, but rather a list of the records that represent a common product for
sale. Clustering of large-scale data involves a complex set of options.
 Partitioning
 Since any algorithm for doing a complete merger of similar records may be forced to examine each pair of
records, it may be infeasible to get an exact answer to a large entity-resolution problem.
 One solution is to group the records, perhaps several times, into groups that are likely to contain similar
records, and look only within each group for pairs of similar records.

Data Integration Lecture Notes

More Related Content

What's hot (20)

Similar to Data Integration Lecture Notes (20)

Recently uploaded (20)

Data Integration Lecture Notes