Entity Matching for Semistructured Data in the Cloud

Entity Matching for Semistructured Data
in the Cloud

Marcus Paradies
ACM SAC 2012 - CC Track
March 27, 2012

Marcus Paradies Entity Matching for Semistructured Data in the Cloud
1 / 19

Outline

1 Motivation

2 ChuQL

3 Entity Matching

4 MAXIM: Entity Matching in the Cloud

5 Summary

2 / 19

Motivation

Enriching/Improving Wikipedia

References from Wikipedia article Hash join

3 / 19

Motivation


Lookup in the CiteSeer database

3 / 19

Motivation


Lookup in Google

3 / 19

Motivation

Wikipedia in a nutshell

Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.

4 / 19

Motivation

Wikipedia in a nutshell

Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.

Challenges
Articles in Wikipedia are incomplete
Articles in Wikipedia are inaccurate
Articles in Wikipedia are subjective

4 / 19

Motivation

Problem Statement

Definition
Given two datasets of records, R and S, a set of attributes
a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a
similarity threshold τ , the task between R and S is defined as
finding and combining all pairs of records from R and S where
n
i=1 simai (R.ai , S.ai ) ≥ τ

{{Cite book
{{Cite book
| last = Mumford
| last = Mumford
| first = David
| first = David <record id=”6627383”>
<record id=”6627383”>
| authorlink = David Mumford
| authorlink = David Mumford <author>David Mumford</author>
<author>David Mumford</author>
| title = The Red Book of Varieties and Schemes
| title = The Red Book of Varieties and Schemes <title>The red book of Varieties and
<title>The red book of Varieties and
| publisher = [[Springer]]
| publisher = [[Springer]] Schemes</title>
Schemes</title>
| location = Berlin
| location = Berlin <publisher>Springer</publisher>
<publisher>Springer</publisher>
| date = 1999
| date = 1999 <year>1999</year>
<year>1999</year>
| page = 198
| page = 198 <doi>10.1007/b62130</doi>
<doi>10.1007/b62130</doi>
| doi = 10.1007/b62130
| doi = 10.1007/b62130 </record>
</record>
| isbn = 354063293X
| isbn = 354063293X
}}
}}

Wikipedia Data set CiteSeer Data set

5 / 19

ChuQL

6 / 19

ChuQL

ChuQL by example

Wordcount in ChuQL
1 mapreduce {
2 input { fn : collection (" hdfs :// wiki /") }
3 rr { for $rev in $hxml : in // revision
4 return {" key ": fn : data ( $x // username | $x // ip ) ,
5 " val ": $x // title } }
6 map { $hxml : in }
7 reduce { {" key ": $hxml : in = >" key " , " value ": fn : count ( $hxml : in = >" val ")} }
8 rw { < author name ="{ $hxml : in = >" key "}" count ="{ $hxml : in = >" val "}"/ > }
9 output { fn : put (" hdfs :// outputdir /") }
10 }

7 / 19

Entity Matching

8 / 19

Entity Matching

What is Entity Matching?

9 / 19

Entity Matching

What is Entity Matching?

Challenges
Entity Matching has quadratic runtime behavior
Entity Matching has high CPU- and memory demands
The deﬁnition of “what is similar” is domain-dependent

9 / 19

Entity Matching

Entity Matching Architecture

b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data

...
Source
Source
S22
S
bnn
b

10 / 19

Entity Matching


b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data

...
Source
Source
S22
S
bnn
b

How can we improve the runtime of an EM task?

10 / 19

Entity Matching


b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data

...
Source
Source
S22
S
bnn
b
Distributed Blocking

10 / 19

Entity Matching


b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data

...
Source
Source
S22
S
bnn
b
Distributed Blocking Parallel Matching

10 / 19

MAXIM: Entity Matching
in the Cloud

11 / 19

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements
Eﬃcient processing of semistructured data
Scalability to large datasets
Independency from speciﬁc similarity functions
Ability to easily add new similarity functions

12 / 19


Requirements and Approach

Requirements
Eﬃcient processing of semistructured data
Scalability to large datasets
Independency from speciﬁc similarity functions
Ability to easily add new similarity functions

Main Idea
Use MapReduce and ChuQL to process semistructured data
Use a search-based blocking to generate candidate pairs
Apply similarity functions to candidate pairs within a block

12 / 19


Architecture
Search Node 1 Search Node 2 Search Node N
Engine Engine Engine
Data Node Data Node Data Node
...

Hadoop

Hadoop

Hadoop
Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker
Index Index Index
ChuQL Engine ChuQL Engine ChuQL Engine

HDFS
HDFS

Architecture
Hadoop cluster with up to 40 nodes
Each node runs a search engine and an attached full-text index
Each node runs an in-memory XQuery processor
Semistructured data is partitioned and placed on HDFS

13 / 19


Processing Stages
Search Engines
Search Engines
HDFS
HDFS

Three Stages
Preparation Stage
Blocking Stage
Matching Stage

14 / 19


Processing Stages
Search Engines
Search Engines
HDFS
HDFS

Transform
Extract Store into full-text Build
references references index XML index

Extract Wikipedia
Extract Wikipedia Index CiteSeerX
Index CiteSeerX
references
references records
records

Preparation Stage

Stage 1: Preparation Stage
Extracts references from Wikipedia
Reads and transforms records from CiteSeerX
Sends CiteSeerX data to local full-text index

14 / 19


Processing Stages
Search Engines
Search Engines
HDFS
HDFS

Transform
Extract Store into full-text Build Retrieve Generate
references references Get query Store
index XML index references query response blocks

Extract Wikipedia
Index CiteSeerX Generate Semantic
Generate Semantic
references
references records
records Block
Block

Preparation Stage Blocking Stage

Stage 2: Blocking Stage
Reads extracted references from HDFS
Probes full-text index to retrieve candidate publications
Assign candidate publications to block(s)

14 / 19


Processing Stages
Search Engines
Search Engines
HDFS
HDFS

Transform Store
Extract Store into full-text Build Retrieve Generate Verify
references references Get query Store record
index XML index references query candidate
response blocks pairs
pairs

Extract Wikipedia
Index CiteSeerX Generate Semantic
Generate Semantic Record pair generation
Record pair generation
references
references records
records Block
Block

Preparation Stage Blocking Stage Matching Stage

Stage 3: Matching Stage
Read blocks from HDFS
Generate candidate pairs and apply similarity functions
Store matching pairs and their similarity

14 / 19



Extracting References Indexing Publications

15 / 19




Extraction

{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
| title = An Adaptive Hash Join Algorithm for Multi-User Environments
| journal = Proceedings of the 16th VLDB conference
| year = 1990
| pages = 186–197
}}

15 / 19




Extraction

{{cite journal
| year = 1990
| pages = 186–197
}}

Transformation

<reference type=“journal“>
<author1>Hansjörg Zeller</author1>
<author2>Jim Gray</author2>
<title>An Adaptive Hash Join Algorithm for Multi-User
Environments</title>
<journal>Proceedings of the 16th VLDB conference</journal>
<year>1990</year>
<pages>186–197</pages>
</reference>

15 / 19




HDFS
Extraction

{{cite journal
| year = 1990
| pages = 186–197
}}

Transformation

<year>1990</year>
</reference>

15 / 19




HDFS
Extraction

Read and Transformation
{{cite journal
<doc>
| title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field>
| journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and
| year = 1990 Connecting Words In Language
Generation</field>
| pages = 186–197 <field name="author">Bonnie Dorr</field>
}} <field name="description">Generating language
...</field>
</doc>

Transformation

<year>1990</year>
</reference>

15 / 19




HDFS
Extraction

Read and Transformation
{{cite journal
<doc>
| title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field>
| journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and
| year = 1990 Connecting Words In Language
Generation</field>
| pages = 186–197 <field name="author">Bonnie Dorr</field>
}} <field name="description">Generating language
...</field>
</doc>

Transformation
Indexing
Environments</title> Lucene
Lucene
<journal>Proceedings of the 16th VLDB conference</journal> Index
Index
<year>1990</year>
</reference>

15 / 19



Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which are
listed in reference

16 / 19



Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which are
listed in reference

Example

Hashing
<citation>
<citation>
<id>26334893</id>
<id>26334893</id>
<citation>
<cat>Search engine optimization</cat>
<cat>Search engine optimization</cat>
<id>26334893</id> 10.0.1.1.124
<cat>Hashing</cat> search algorithms</cat>
<cat>Internet
<cat>Internet search algorithms</cat> Search Engine
<cat>Link analysis</cat>
<cat>Link analysis</cat>
<cat>Join algorithms</cat>
<ref> <ref>
<ref> 10.0.1.11.23
<type>journal</type>
<author>Taher Haveliwala</author>
<author>Taher Haveliwala</author>
<author>Hansjörg Zeller</author> send result
<author>Jim Gray</author>
<year>2003</year>
<year>2003</year> Full-Text
<year>1990</year> send query Index
<pages>56-70</pages>
<pages>56-70</pages>
<pages>186-197</pages> Eigenvalue
<title>The Second
<title>The Second Eigenvalue send result
<title>An AdaptiveGoogle Matrix</title>
ofof the Hash JoinMatrix</title>
the Google Algorithm Join
for Multiuser Environments</title>
<journal>Stanford University
<journal>Stanford University
<journal>Proceedings of the 16th VLDB algorithms
Technical Report</journal>
Technical Report</journal>
conference</journal>
</ref>
</ref></ref> 10.0.1.1.124
</citation>
</citation>
</citation>
10.0.7.23.14

16 / 19



Distributed Search in MAXIM

(a) Send HTTP request (query) Search Node 1 (c)
Engine
(b) HTTP response (partial result) Data Node

Hadoop
(c) Collect partial results Full-text Task Tracker
Index
ChuQL Engine

(a)
)
(a

(a)
(a)
(b)
(b)

(b)
(b)

Search Node 2 Search Node 3 Search Node 4 Search Node 5
Engine Engine Engine Engine
Data Node Data Node Data Node Data Node
Hadoop

Hadoop
Hadoop

Hadoop
Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker
Index Index Index Index
ChuQL Engine ChuQL Engine ChuQL Engine ChuQL Engine

16 / 19


Applies user-deﬁned similarity functions to candidate pairs
Each attribute can be evaluated by a speciﬁc similarity function

17 / 19


Applies user-deﬁned similarity functions to candidate pairs
Each attribute can be evaluated by a speciﬁc similarity function

Number of candidate pairs

n
CP = Ci ∗ Ri (1)
i=1

n - # of blocks in B1 , . . . , Bn
Ri - # of references in block Bi
Ci - # of candidate publications in block Bi
CP - # of candidate pairs to verify
17 / 19

Summary

Summary

Wikipedia provides many opportunities for research
Need for efficiently processing semistructured data is increasing
Entity Matching is critical for data integration and data cleaning
Entity Matching is difficult to parallelize due to unbalanced data
partitions
MAXIM parallelizes EM by building blocks of similar records in a
classification fashion
MAXIM allows to define own similarity functions and computation
functions without changing the algorithm

18 / 19

“Everything that can be invented has been invented.”
(Charles H. Duell, Commissioner, U.S. Oﬃce of Patents, 1899)

19 / 19

Experiments

Scaleup and Speedup

9 2
Ideal Ideal
INDEXING-2000 1.8 EXTRACTING-2000
8 EXTRACTING-2000 INDEXING-2000
Speedup = Base Time / New Time

Scaleup = Base Time / New Time
BLOCKING 1.6
7 MATCHING
1.4
6
1.2

5 1

0.8
4
0.6
3
0.4
2
0.2

1 0
5 10 20 40 5 10 20 40
Number of nodes Number of nodes

(a) Speedup for all stages (b) Scaleup for preparation stage

20 / 23

Experiments

Query Performance

900
RESULTCOUNT-50
Avg. Query Response Time (ms)
800 RESULTCOUNT-100
RESULTCOUNT-150
700 RESULTCOUNT-200

600

500

400

300

200

100

0
5 10 20 40
Number of Nodes

Figure: Query Performance for diﬀerent result set sizes and cluster sizes.

21 / 23

Experiments

Blocking Accuracy
1.2
Ideal
WRONG-ORDER
1.1 MISPLACED-END
MISPLACED-ANY
MISSING
1
Accuracy

0.9

0.8

0.7

0.6

0.5
0 0.25 0.5 0.75 1.0
Variance

Figure: Blocking accuracy for diﬀerent typographical error classes.

22 / 23

Experiments

Number of Candidate Pairs

5.5e+006
RSCOUNT-50
5e+006 RSCOUNT-100
RSCOUNT-150
4.5e+006 RSCOUNT-200
Number of candidate pairs

4e+006
3.5e+006
3e+006
2.5e+006
2e+006
1.5e+006
1e+006
500000
0
0.0 0.1 0.25 0.5 0.75 1.0
Variance

Figure: Number of candidate pair veriﬁcations in the matching stage.

23 / 23

Entity Matching for Semistructured Data in the Cloud

More Related Content

Similar to Entity Matching for Semistructured Data in the Cloud (20)

Recently uploaded (20)

Entity Matching for Semistructured Data in the Cloud