SlideShare a Scribd company logo
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
DOI : 10.5121/ijdkp.2015.5501 1
LITERATURE REVIEW OF ATTRIBUTE LEVEL AND
STRUCTURE LEVEL DATA LINKAGE TECHNIQUES
Mohammed Gollapalli
College of Computer Science & Information Technology,
University of Dammam, Dammam, Kingdom of Saudi Arabia
ABSTRACT
Data Linkage is an important step that can provide valuable insights for evidence-based decision making,
especially for crucial events. Performing sensible queries across heterogeneous databases containing
millions of records is a complex task that requires a complete understanding of each contributing
database’s schema to define the structure of its information. The key aim is to approximate the structure
and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven
facts. We identify such problems as four major research issues in Data Linkage: associated costs in pair-
wise matching, record matching overheads, semantic flow of information restrictions, and single order
classification limitations. In this paper, we give a literature review of research in Data Linkage. The
purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the
background in the Data Linkage research domain. Particularly, we focus on the literature related to the
recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their
efficiency, functionality and limitations are critically analysed and open-ended problems have been
exposed.
KEYWORDS
Data Linkage, Probabilistic Matching, Structure Matching, Knowledge Discovery, Data Mining
1. INTRODUCTION
Organizations worldwide have been collecting data for decades. Data collected from The World
Bank [24], The National Climatic Data Centre [49], and countless other private and public
organizations have been collecting, storing, processing and analysing massive amounts of data
which has the potential to be linked for the discovery of underlying factors to critical problems.
Sharing of large databases between organizations is also of growing importance in many data
mining projects, as data from various sources often has to be linked and aggregated in order to
improve data quality, or to enrich existing data with additional information [7]. When integrating
data from different sources to implement a data warehouse, organizations become aware of
potential systematic differences, limitations, restrictions or conflicts which fall under the
umbrella-term data heterogeneity [34]. Poor quality data has also been prevalent in databases due
to a variety of reasons, including typographical errors, lack of standards etc. To be able to query
and integrate data in the presence of such data uncertainties as depicted in Fig. 1, a central
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
2
problem is the ability to identify whether heterogeneous database tables, attributes and tuples can
be linked with the primary aim to understand the past and predict the future.
In response to the aforementioned challenges, significant advances have been made in recent
years in mining structures of databases with the aim to acquire crucial fact finding information
that is not otherwise available, or that would require time-consuming and expensive manual
procedures. Schemas are definitions that identify the structure of induced data and are the result
of a database design segments. The relational database schemas that are invariant in time hold
valuable information in their tables, attributes and tuples which can aid in identifying
semantically similar objects. The process of identifying these schema structures has been one of
the essential elements of data mining process [21-26]. Accurate integration of heterogeneous
database schema can provide valuable insights that are useful for evidence-based decision
making, especially for crucial events. In the schema integration process, each individual database
can be analysed to provide and extract local schema definitions of the data. These local schema
definitions can be used for the development of a global schema which integrates and subsumes
the local schema in such a way that (global) users are provided with a uniform and correct view
of the global database [19]. With the help of global schema structures, we can derive hierarchical
relationships up to the instance level across datasets. However, without having this global
schema, extracting meaningful data into a usable form can become a tedious process [5, 8, 14, 18,
21, and 26]. Traditional local-to-global schema-based techniques also lack the ability to allow
computational linkage and are not suitable when dealing with heterogeneous databases [2, 5, 8,
18, 57, 61 and 66]. To make things worse, the data could be “dirty” and differences might exist in
the structure and semantics maintained across different databases. Research communities have
also stressed Schema Pattern Matching [21 to 26] and SQL Querying [27, 28]. Schema Pattern
Matching uses database schema to devise clues as to the semantic meaning of the data.
Constraints are used to define requirements, generated by hand or through a variety of tools.
However, the main problems with Schema Pattern Matching are insufficiency and redundancy.
Figure 1. Data linkage across heterogeneous databases
Data linkage (also known as data matching, probabilistic matching, and instance identification) is
the process of identifying records which represent the same real world entity despite
typographical and formatting constraints [18, 25, 32, 34, and 37]. In conducting our research, we
observed four prime areas where data linkage is a persistent, yet heavily researched problem.
1. Medical science for DNA sequence matching and biological sequence alignment [12, 18,
21, 47, 56, and 80-84].
2. Government departments for taxation and pay-out tracking [5, 24, 30, 48, and 79].
3. Businesses integrating the data of acquired companies into their centralized systems [2,
36, and 42].
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
3
4. Law enforcement for data matching across domains, such as banking and the electoral
commission [24, 30, 33, 49, and 50].
Traditional data linkage approaches use similarity scores that compare tuple values from different
attributes, and declare it as matches if the score is above a certain threshold [2, 10, 18, 61, 67, and
79]. These approaches perform quite well when comparing similar databases with clean data.
However, when dealing with a large amount of variable data, comparison of tuple values alone is
not enough [1, 2]. It is necessary to apply domain knowledge when attempting to perform data
linkage where there are inconsistencies in the data. The same problem applies to database
migrations, and to other data intensive tasks that involve disparate databases without common
schemas. Furthermore, the creation of data linkage between heterogeneous databases requires the
discovery of all possible primary and foreign key relationships that may exist between different
attribute pairs, on a global spectrum [1, 3, 8, 11, and14-16].
2. TAXONOMY OF DATA LINKAGE APPROACHES
Different techniques have been presented by researchers [18, 32, 34, 35, 43, and 77] in multiple
areas which argue that the need, task, and type of linkage to be performed will define the
involved steps. Other techniques such as the Statistic New Zealand [48] lean toward the idea that
data linkage will always require manual preliminary steps such as data classification, sampling
and missing observation detection. However, the fundamental problem that arises each time in
performing data linkage on large volumes of heterogeneous databases is to discover all possible
relationships based on matching similar tuple values that might exist between different table
attributes [1].
In this paper, we review on techniques that exist in performing approximate data linkage based
on their approach rationale. We compare the advantages and disadvantages of current approaches
for solving data linkage problem in multiple ways. Our analysis of existing techniques as
depicted in Fig. 2 will show that there is room for substantial improvement within the current
state-of-the-art and we recommend techniques where further improvements can be made.
2.1. SQL Matching Strategies
SQL Matching techniques [14, 21, 22, 23, 25 and 26] perform data linkage using simple SQL-
LIKE commands and SQL Extensions. The advantage of SQL matching techniques is that they
help in performing quick data linkage across databases. However, they do not perform well in
cases where comparison and identification of data structures need to be performed on large
databases containing noisy data without proper unique keys, foreign key relationships, indexes,
constraints, triggers, or statistics. Another drawback of the SQL matching process is that it
performs |m| x |n| time’s column match where m and n are the total tuple counts in two different
databases, resulting in a very slow, expensive and tedious process.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
4
Figure 2. Data linkage approaches
A variation of SQL Matching includes extending query syntax functionalities to perform data
linkage. The proposed SQL-LIKE Command languages [22, 23 and 26] handle data
transformation, duplicate elimination and cleaning processes supported by regular SQL Query
and a proposed execution engine. However, these techniques demand users to have significantly
advanced SQL scripting skills and proposed extended functionalities along with sound domain
knowledge. Thus, syntax based SQL matching techniques are proven to be less attractive in real
world scenarios [22].
Research communities have also stressed Schema Pattern Matching [21 to 26] and SQL Querying
[27, 28]. Schema Pattern Matching uses database schema to devise clues as to the semantic
meaning of the data. Constraints are used to define requirements, generated by hand or through a
variety of tools. However, the main problems with Schema Pattern Matching are insufficiency
and redundancy. SQL Querying, on the other hand, uses a SQL query language such as the
Resource Description Framework (RDF) [27, 28] to define matching criteria. Difficulties arise
when restrictions eliminate the discovery of possible matches. More relaxed queries use a
structure-free mechanism by applying a tree pattern query; however, tree-pattern queries are
highly inaccurate due to a high incidence incorrect manual identification of relationships [29].
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
5
2.2. Exact Matching Strategies
Unlike SQL Matching, Exact Matching techniques give more insight into the content and
meaning of schema elements [25]. Exact matching uses a unique identifier present in both
datasets being compared. The unique identifier can only be linked to one individual item, or an
event (for example, a driver’s license number). The Exact Matching technique is helpful in
situations where the data linkage to be performed belongs to one data source. For example,
consider a company with a recent system crash willing to perform data linkage between the
production data source file and the most recent tape backup file to trace transactions. In such
situations, Exact Matching would likely suffice in performing data linkage. A specific variation
of exact matching discovered In this research is the Squirrel System [31], using a declarative
specification language, ISL, to specific matching criteria which will match one record in a given
table, with one record in another table. However, exact matching approach leaves no room for
uncertainty; records are either classified as a match or as a non-match. Problems often arise when
the quality of the variables does not sufficiently guarantee the unique identifier is valid [16].
Exact matching comparison does not suffice for matching records when the data contains errors,
for example typographical mistakes, or when the data have multiple representations, such as
through the use of abbreviations or synonyms [10].
2.3. Approximate Matching Strategies
Approximate matching is also known as the probabilistic approach [34 to 36] within the research
community and is highly recommended, state-of-the art, alternative approach compared to exact
matching. In approximate matching techniques, data linkage is performed on a likelihood basis
(i.e. performing matching based on the success threshold ratio). Output results can vary in
different formats such as “match, possible match, and non-match” basis, Boolean type true or
false match basis, nearest and outermost distance match basis, discrete or continuous match basis
etc. Variations in approximate matching technique include statistical and probabilistic solutions
for similarity matching. Attention has also been drawn to approximate matching techniques from
different research arenas, including statistical mathematics and bio-medical sciences.
Due to the variety of proposed approaches and the level of attributes match, we have focused our
research and classified most common approximate matching techniques into attribute level
matching and structure level matching groups discussed in the next two sections. It is important
to note that, the purpose of this paper is not to list every data linkage technique rather to discuss
the multitude of approximate matching techniques available in the areas of attributes and
structure level matching. At the end of this paper, we discussed our conclusions and
recommendations for future work.
3. ATTRIBUTE LEVEL MATCHING
Attribute Matching, also known as Field Matching [35] and Static String Similarity [36] deals
with one-to-one match across different data sources. A challenging task of attribute matching is
to perform data linkage across data sources by comparing similar matching records with the
assumption that the user is aware of the database structure. Individual record fields are often
stored as strings, meaning that functions which accurately measure the similarity of two strings
are important for deduplication [36]. In the following subsections, we describe most commonly
used attribute matching methodologies and discuss their efficiency.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
6
3.1. Linguistic similarity
Linguistic techniques focus on phonetic similarities between strings. The rationale behind this
approach is that while strings may be similar phonetically, they may have different characters to
locate potential matches. Soundex [34] is the most widely known in this area, and uses codes to
define letters, remaining non-coded letters are used as separators. In addition, Soundex checks for
identical codes (A, E, I, O, U and Y) without separators. Through the Soundex rules, a possible
match is determined or denied. Advantages of linguistic techniques include the exposure of about
2/3 of spelling variations [25, 32, and 34]. However, linguistic methods are not equally effective
from one ethnicity to the next. Linguistic based techniques are designed for Caucasians, and
works on most other ethnicities, but largely fails on East Asian names due to the phonetic
differences. NYSIIS [34] improved upon this by maintaining vowel placement and converting all
vowels to the letter A. Nonetheless, it is still not perfectly accurate and performs best on
surnames and not on other types of data [34].
3.2. Rule/Regular expression
The Rule / Regular expression [40] approach uses rules or set of predefined regular expressions
and perform matching on tuples. Regular Expression Pattern as proposed in [40] is more flexible
than regular expression alone, which is built from alphabetical elements. This is also because the
Regular Expression Pattern is built from patterns over a data element, allowing the use of
constructs such as “wildcards” or pattern variables. Regular Expression Pattern is quite useful
when manipulating strings, and can be used in conjunction with basic pattern matching.
However, the problem with this approach lies in the fact that it is relatively domain specific and
tends to only work well on strings.
3.3. Ranking
Ranking [15, 41] methods determine preferential relationships and have been more recently
recognized by researchers as a necessary addition to structure based matching techniques. Search
engines have used ranking methods for some time, such as Google’s PageRank, despite such
algorithms not suited for matching noisy data due to their poor connectivity and lack of referrals
[15]. Therefore, ranking extensions which simultaneously calculate meaning and relevance are
researched. Thus far, only a few ranking methods have been proposed including induction logic
programming, probabilistic relational kernel, and complex objects ranking [15, 41].
3.4. String distance
String distance methods, also known as character-based similarity metrics [34] are used to
perform data linkage based on the cost associated within the comparing strings. The cost is
estimated on the number of characters which needs to be inserted, replaced or deleted for a
possible string match. For example, Fig. 3 shows the cost associated in editing string “Aussie” to
“Australian” (the “+” sign shows addition, the “-“ sign shows deletion, and the “x” sign shows
replacement).
Figure 3. An example of string distance technique
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
7
Experimental results in [34] have shown that the different distance based methodologies
discovered so far are efficient under different circumstances. Some of the commonly
recommended distance based metrics include Levenstein distance, Needleman-Wunsch distance,
Smith-Waterman distance, Affine-gap distance, Jaro metric, Jaro and Jaro-Winkler metric, Q-
gram distance, and positional Q-grams distance. Through the various methods, costs are assigned
to compensate for pitfalls in the system. Yet, overall, string distance pattern is most effective for
typographical errors, but is hardly useful outside of this area [34].
3.5. Term frequency
Term frequency [43] approach determines the frequency of strings in relation and to favour
matches of less common strings, and penalizes more common strings. The Term frequency
methods allow for more commonly used strings to be left out of the similarity equation. TF-IDF
[43] (Term Frequency-Inverse Document Frequency) is a method using the commonality of the
term (TF) along with the overall importance of the term (IDF). TF-IDF is commonly used in
conjunction with cosine similarity in the vector space model. Soft TF-IDG [44] adds similar
token pairs to the cosine similarity computation. According to the researchers in [44], TF-IDF
can be useful for similarity computations due to its ability to give proportionate token weights.
However, this approach fails to make distinctions between the similarity level of two records with
the same token or weight, and is essentially unable to determine which record is more relevant.
3.6. Range pattern
Range pattern matching returns a Boolean style true or false result if the specified tuples fall
within the specified range. Similarity or dissimilarity is determined when the elements of the data
are compared against the predetermined range. Range matching will return a 0 or 1, with 0 being
false and 1 being true. Range pattern matching is often used as an expansion of an algorithm to
filter results. For example, TeenyLIME [45] expands upon LIME by adding range pattern
capabilities, giving TeenyLIME the ability to define the range of its results. A drawback of the
range pattern approach is that it is often not powerful enough to perform matching without a high
level of query knowledge. For example, if a query is made to search for nearby locations, an
optimal range is often not given or is defined by words having various meanings, causing range
pattern matching to produce inaccurate results.
3.7. Numeric distance
Numeric distance methods are used to quickly perform data linkage on tuples that contains
numerical values but don’t require complex string character-style comparison. Hamming distance
[46], for example, is used for numeric values such as zip codes, and counts the variations between
two records. Due to the limitations of numeric data type constraints, it has not received much
attention. Numeric distance methods can be best used in combination of other techniques.
3.8. Token matching
Token based matching compare fields by ignoring the ordering of the tokens (words) within these
fields. Token based approach use tokenization to perform matching, which is the separation of
strings into a series of tokens. It assigns a token to each word in the string and tries to perform
matching by ignoring token order and by performing similar match. The token based approach
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
8
attempts to compensate for the inadequacies of character-based metrics, specifically the inability
to detect word order arrangement. A tokenizer performs the operation, taking into account
characters, punctuation marks, blank spaces, numbers, and capitalisation. Token based methods
count a string as a word set, and accommodates duplicates. For example, Cosine Similarity [38]
is used to perform data linkage based on record strings, irrespective of word ordering within the
string. The Cosine Similarity methods are effective over a range of entry types, and also have the
advantage of considering word location to allow for swapping of word positions. For data
containing a large amount of text, the token based matching works quite well, as it can handle
repeating words. The optimising token based approach has typically included aggregation of
different sources. A potential drawback is that token based matching does not store sub-string
order and can predict false matches.
3.9. Weight pattern
Weight pattern also referred to as Scoring [47], is applied on matching strings to return a
numerical weight; a positive weight for agreeing values and a negative weight for disagreeing
values. As two records are compared, the system assigns a weight value for similarity
comparison. Composite weight [48] is a summation of all the field weights for a record, which
multiplies the probabilities of each value. Reliability of the information, commonality of the
values, and similarity between the values are considered in determining weight. Determinations
are made by calculating the “m” probability (reliability of data) and the “u” probability (the
commonness of the data). For example, IDF weights consider how often a particular value is
used. After weights are determined for all the data, cut-off thresholds are set to determine the
comparison range. Unfortunately, weight pattern techniques do not perform well when there are
data inconsistencies. True matches may have low weights, and non-matches may have high
weights as a result of simple data errors [48].
3.10. Gram sequence
Gram sequence based techniques compare the sequence of grams of one string with the sequence
of grams of another string. n-grams is a gram based comparison function which calculates the
common characters in a sequence, but is only effective for strings that have a small number of
missing characters [46]. For example, the strings “Uni” and “University” have the same 2-gram
{un, ni}. q-gram [85] involves generating short substrings of length q using a sliding window at
the beginning and end of a string [85]. The q-gram method can be used in corporate databases
without making any significant changes to the database itself [85]. Theoretically, two similar
strings will share multiple q-grams. Positional q-grams record the position of q-grams within the
string [14]. Danish and Ahy in [85] proposed to generate q-grams along with various processing
methods such as substrings, joins, and distance. Unfortunately, the gram sequence approach is
only efficient for short string comparison and becomes complex, expensive and unfeasible for
large strings [85].
3.11. Blocking
Blocking [46] techniques separate tuple values into set of blocks/groups. Within each of these
blocks, comparisons are made. Sorted Neighborhood is a blocking method which first sorts and
then slides a “window” over the data to make comparisons [46]. BigMatch [51] used by the U.S.
Census Bureau, is another blocking technique. BigMatch identifies pairs for further processing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
9
through a more sophisticated means. The blocking function assigns a category for each record
and identical records are given the same category. The disadvantage of the blocking method is
that it will not work for records which have not been given the same category [18, 25, and 34].
3.12. Hashing
Hashing methods convert attributes into a sequence of hash values which are compared for
similarity matching between different sets of strings. Hashing methods require conversion of all
the data to find the smallest hash value, which could be a costly approach. Set-of-sets [8] is a
hashing based data matching technique which works reasonably well in smaller string matching
scenarios. The set-of-sets technique proposed in [8] divides strings into 3-grams and assigns a
hash value to each tri-gram. Once hash values are assigned and placed in a hash bag, only the
lowest matching hash values are considered for matching. Unfortunately, this technique doesn’t
yield accurate results when dealing with variable length strings and uses traditional hashing
which results in completely different hash values for even a small variation [79]. Furthermore,
the Set-of-sets requires conversion of all the data prior to comparison in order to find the smallest
hash value, which could be a costly approach. To overcome this disadvantage, the h-gram (hash
gram) method was proposed in [79] to address the deficits of the set-of-sets technique, by
extending the n-gram technique; utilizing scale based hashing; increasing matching probability;
and by reducing the cost associated in storage of hash codes.
3.13. Path sequence
The path sequence approach such as in [37] examines the label sequences, and compares them to
the labelled data. The distance is measured by determining the similarity between the last
elements of a path. The prefix can be considered, but this only affects the result to a certain
degree, and becomes less relevant with increasing distance between the prefix and the end of the
sequence.
3.14. Conditional substrings
Substring matching such as in [53] expands upon string-based techniques by adding substring
conditions to string algorithms. Distance measurements are calculated for the specified substring,
in which all substring elements must satisfy the distance threshold. A frequent complication
related to conditional substring based matching involves the estimation of the size of intersection
among related substrings. Clusters and q-grams [2, 4, and 53], which are commonly used in string
estimation, are not applicable in substring based techniques, because substring elements are often
dissimilar. As a result, substring matching is hindered by an abundance of possibilities, which
must all be considered.
3.15. Fuzzy Matrix
Fuzzy Matrix [32, 60] places records in the form of matrices and apply fuzzy matching
techniques to perform record matching. Commonly used by social scientists to analyse
behavioural data, the fuzzy matrix technique is also applicable to many other data types. When
considering a fuzzy set, a match is not directly identified as positive or negative. Instead, the
match is considered on its degree level of agreement with the relevant data. As a result, a
spectrum is created which identifies all levels of agreement or truth.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
10
3.16. Thesauri Matching
Thesauri based matching attempts to integrate two or more thesauruses. A thesaurus is a kind of
lexicon to which some relational information has been added, containing hyponyms which give
more specific conceptual meaning. WordNet [27, 32, and 52] is a public domain lexical database,
or thesaurus, which makes its distinctions by grouping words into sets of synonyms; it is often
used in thesauri matching techniques. Falcon and DSSim [52] are thesauri based matching tools
which incorporate lexicons, edit-distance and data structures. LOM [32] is a lexicon-based
mapping technique using four methods (whole term, word constituent, synset, and type matching)
in an attempt to reduce the required amount of human labour, but does not guarantee any level of
accuracy. While Thesauri based approaches can be extremely useful in merging conceptual,
highly descriptive information; they can be incredibly complex and difficult to automate to a
significant degree; and human experts are typically required to quality assure the relationships
[27]. Thesauri matching algorithms also needs to consider the best balance between precision and
recall.
4. STRUCTURE LEVEL MATCHING
Structure level matching is used when the records being matched need to be fetched from a
combination of records (i.e. when attempting to match noisy tuples across different domains, and
requiring more than one match). These techniques perform data matching, with the main intuition
that the grouping of attributes into clusters followed by performing matching provides a deeper
analysis of related content and semantic structure. This process was initially considered for
discovering candidate keys and dependent keys. However, one of the biggest challenges involved
in this process has been the large number of combinations required for grouping attributes and
performing data matching between these groups, which can be costly and time consuming [25,
32, 34 and 37]. Large scale organisations such as Microsoft and IBM have introduced
Performance Tuner tools for indexing combined attributes on which queries are frequently
executed. Unfortunately, these tools are suited to Database Developers / DBA’s who have sound
knowledge in executing SQL queries and is not ideal for novice users. As such, research has
taken new directions by classifying multiple structure level techniques that require matching
across multiple attributes. We have classified principal techniques in the following subsections.
4.1. Iterative pattern
Iterative pattern is the process of repeating a step multiple times (or making “passes”) until a
match is found based on similarity scores and blocking variables (variables set to be ignored for
similarity comparison). The Iterative approach uses attribute similarity, while considering the
similarity between currently linked objects. For example, the Iterative pattern method will
consider a match of “John Doe” and “Jonathan Doe” as a higher probability if there is additional
matching information between the two records (such as spouse’s name and children’s names).
The first part of the process is to measure string distance, followed by a clustering process.
Iterative pattern methods have proven to detect duplicates that would have likely been missed by
other methods [54]. The gains are greater when the mean size of the group is larger, and smaller
when the mean size is smaller. Disadvantages surface when distinctive cliques do not exist for the
entities or if references for each group appear randomly. Additionally, there is also the
disadvantage of cost, as the Iterative pattern method is computationally quite expensive [54].
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
11
4.2. Tree pattern
Tree pattern is based on decision trees with ordered branches and leaves. The nodes are compared
based on the extracted tree information. CART and C.5 are two widely-known decision tree
methods which create trees through an extensive search of the available variables and splitting
values [55]. A Tree pattern starts at the root node and recursively partitions the records into each
node of the tree and creates a child to represent each partition. The process of splitting into
partitions is determined by the values of some attributes, known as splitting attributes, which are
chosen based on various criteria. The algorithm stops when there are no further splits to be made.
Hierarchical verification through trees examines the parent once a matching leaf is identified. If
no match is found within the parent, the process stops; otherwise the algorithm continues to
examine the grandparent and further up the tree [37]. Suffix trees such as DAWG [37] build the
tree structure over the suffixes of S, with each leaf representing one suffix and each internal node
representing one unique substring of S. DAWG has additional feature of failure links added in for
those letters which are not in the tree. Disadvantages of Tree pattern lies in lengthy and time
consuming process with manual criteria often needed for splitting.
4.3. Sequence pattern
Sequence pattern methods perform data linkage based on sequence alignment. This technique
attempts to simulate a sequential alignment algorithm, such as the BLAST (Basic Local
Alignment Search Tool) [12] technique used in Biology. The researchers compared the data
linkage problem with the gene sequence alignment problem for pattern matching, with the main
motivation to use already invented BLAST tools and techniques. The algorithm translates record
string data into DNA sequences, while considering the relative importance of tokens in the string
data [12].
Further research in the Sequence pattern area have exposed variations based on the type of
translation used to translate strings into DNA Sequence (i.e. weighted, hybrid, and multi-bit
BLASTed linkage) [12]. BLASTed linkage has advantages through the careful selection of one of
its four variations, as each variation performs well on specific types of data. Unfortunately,
sequence pattern tends to perform poorly on particular data strings, depending upon the error rate,
importance weight, and number of common tokens [12].
4.4. Neighbourhood pattern
The neighbourhood approach [7, 59] attempts to understand and measure distribution according
to their pattern match, and is a primary component in identifying statistical patterns. By using the
nearest neighbour approach, related data is able to be clustered even if it is specifically separated.
The logic behind this approach is based on the assumption that, if clustered objects are similar,
then the neighbours of clustered objects have a higher likelihood of also being similar.
Neighbourhood pattern requires a number of factors that need to be carefully considered in order
to determine pattern matches which is considered as a key downfall.
4.5. Relational hierarchy
Relational Hierarchy techniques use primary and foreign key relationships to understand related
table content in order to perform data linkage. Relational hierarchy forms relation links which
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
12
connect concepts within various categories. It breaks down the hierarchical structure and the top-
level structure contains children sets. The relational hierarchy technique compares and calculates
the co-occurrence between tuples by measuring the overlap of the children sets. A high degree of
overlap will indicate a possible relationship between the two top level categories [57]. Relational
Hierarchy techniques are only effective when primary and foreign key relationships have been
established. Raw data, without predefined relationships, cannot be linked using this approach.
4.6. Clustering/Feature extraction
Clustering, also known as the Feature extraction method performs data linkage based on common
matching criteria in clusters, so that objects in clusters are similar. Soft clustering [61], or
probabilistic clustering, is a relaxed version of clustering which uses partial assignment of a
cluster centre. The SWOOSH [62] algorithms apply ICAR properties (idempotence,
commutativity, associativity, representativity) to the match and merge function. With these
properties and several assumptions, researchers introduced the brute force algorithm (BFA),
including the G, R and F SWOOSH algorithms [44]. SIMCLUST is another similarity based
clustering algorithm which places each table in its own cluster as a starting point and then works
its way through all of the tables by consecutively choosing two tables (clusters) with the highest
level of similarities. [5] proposed iDisc system which creates database representations through a
multi-process learning technique. Base clusters are used to uncover topical clusters which are
then aggregated through meta-clustering. Clustering in general can get extremely complex (such
as forming clusters using semantics) and needs to be handled carefully while discovering
relationships between matching clusters.
4.7. Graphical statistic
Graphical statistic is a semi-automated analysis based technique where data linkage is performed
based on the results obtained on the graph. Such representations illustrate the topical database
structure through tables. The referential relationship indicates an important linkage between two
separate tables. Foreign keys within one table may refer to keys within the second table.
However, problems with this technique often arise due to the fact that information on foreign
keys is often missing [5].
4.8. Training based
Training based technique is a manual approach where users are constantly involved in providing
statistical data based on previous/future predictions. In [7], researchers presented a two-step
training approach using automatically selected, high quality examples which are then used to
train a support vector machine classifier. The approach proposed in [7] outperforms k-means
clustering, as well as other unsupervised methods. The Hidden Markov training model, or HMM,
standardises name and address data as an alternative method to rule-based matching. Through use
of lexicon-based tokenization and probabilistic hidden Markov models, the approach attempts to
cut down on the heavy computing investment required by rule programming [64]. Once trained,
the HMM can determine which sequence of hidden states is most likely to have emitted the
observed sequence of symbols. When this is identified, the hidden states can be associated with
words from the original input string. This approach seems advantageous in that it cuts down on
time costs when compared to rule-based systems. However, this approach remains a lengthy
process, and has shown to run into significant problems in various areas. For instance, HMM
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
13
confuses given, middle, and surnames, especially when applied to homogenous data.
Furthermore, outcomes proved to be less accurate than those of rule-based systems [64].
DATAMOLD [65] is a training-based method which enhances HMM. The program is seeded
with a set of training examples which allows the system to extract data matches. A common
problem with training techniques is that it requires many examples to be effective; and the system
will not perform without an adequate training set [55].
4.9. Pruning/Filtering statistic
Pruning statistic performs data linkage by trimming similar records on a top down approach. In
[16], the data cleaning process of “deduplication” involves detecting and eliminating duplicate
records to reduce confusion in the matching process. For data which accepts a large number of
duplicates, pruning, before data matching, simplifies the process and makes it more effective. A
pruning technique proposed by Verykios [34] recommends pruning as on derived decision trees
used for classification of matched or mismatched pairs. The pruning function reduces the size of
the trees, improving accuracy and speed [34]. The pruning phase of CORDS [16] (which is
further discussed in the statistical analysis section) prunes non-candidates on the basis of data
type, properties, pairing rules, and workload; such tasks are done to reduce the search space and
make the process faster for large datasets. Pruning techniques [37] are based on the idea that it is
much faster to determine non-matching records than matching records, and therefore aim to
eliminate all non-matching records which do not contain errors. However, the disadvantage of
such techniques is that they are not suitable in identifying matches of any type, and must be
combined with another matching technique.
4.10. Enrichment pattern
Enrichment patterns are a continuous improvement based technique which performs data linkage
by enriching the similarity tasks on a case by case basis. An example of the enrichment method is
ALIAS [34], a learning-based system, designed to reduce the required amount of training
material through the use of a “reject region”. Only pairs with a high level of uncertainty require
labels. A method similar to ALIAS is created using decision trees to teach rule matching in [34].
OMEN [32] enriches data quality through the use of a Bayesian Net, which uses a rule set to
show related mappings. Semantic Enrichment [66] is the annotation of text within a document by
sematic metadata, essentially allowing free text to be converted into a knowledge database
through data extraction and data linking. Conversion to a knowledge database can be through
exact matching or by building hierarchical classifications of terms; text mining techniques allow
annotation of concepts within documents which are subsequently linked to additional databases.
Thesauri alignment [32, 52] based techniques are also considered as part of enrichment
techniques because it combines concepts and better defines the data. The problems associated
with enrichment approach include substantial investment of time and the requirement for
extensive domain knowledge.
4.11. Multi pattern
The multi (multiple) pattern approach performs data linkage through the simultaneous usage of
different matching techniques. This approach best fits when one does not know which technique
performs better. The researchers in [31] use a multi approach which combines sequence
matching, merging, and then exact matching. Febrl [67] is an open-source software containing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
14
comparison, and record pair classifications. Febrl results are conveniently presented in a
graphical user interface which allows the user to experiment with numerous other methods [67].
TAILOR [46] is another example which uses three different methods to classify records: decision
tree induction, unsupervised k-means clustering, and a hybrid approach. GLUE [68] is yet
another matching technique allowing for multiple matching methods. GLUE performs matching
by first identifying the most similar concepts. Once these concepts are identified, a multi-strategy
learning approach allows user to choose from several similarity measures to perform the
measurement.
4.12. Data constraints
Data constraints, also known as internal structure based techniques, apply a data constraint filter
to identify possible matches [43]. The constraint typically uses specific criteria of the data
properties. This technique is not suited when used on its own, and performs best for the
elimination of non-matches, as a pre-processing method before a secondary method, such as
clustering. Furthermore, data constraints don’t handle the large number of uncertainties present
within the data. Hence, adding constraints for each uncertainty is computationally infeasible.
4.13. Taxonomy
Taxonomy based methods use taxonomies, a core aspect of structural concepts which are largely
used in file systems and in knowledge repositories [69]. This approach uses the nodes of
taxonomy to define a parent/child relationship within the conceptual information and create
classification. Using specified data constraints, the taxonomy of multiple data sources are
evaluated into a technique known as structural similarity measure. For example, in [70]
researchers used a taxonomy mapping strategy to enrich WordNet with a large number of
instances from Wikipedia, essentially merging the conceptual information from the two sources.
As with similar methods, taxonomy based matching requires a significant degree of domain
knowledge and performs with limited precision and inadequate recall.
4.14. Hybrid match
Hybrid techniques use a combination of several mapping methods to perform data match. A
prime example of the hybrid method is described in [71], which uses a combination of syntactic
and semantic comparisons. The rationale behind hybrid matching is that the semantics alone is
not sufficient to perform accurate matching and could be inconsistent. The hybrid solution
consists of a hybrid of semantic and syntactic matching algorithms which considers individual
components. The syntactic match uses a similarity score based on class, prefix and substring, and
the semantic match uses a similarity score based on cognitive measures such as LSA, Gloss
Vector, and WordNet Vector. The information is aggregated and entered into a matrix and
experts are used to determine domains within the selected threshold.
4.15. Data extraction
Data extraction primarily involves extracting semantic data. Data extraction can be performed
manually or with an induction and automatic extraction [72]. In [73], researchers used data
recognisers to perform data extraction on the semantics of data. The recogniser method is aimed
at reducing alignment after extraction, speeding up the extraction process, reusing existing
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
15
knowledge, and cutting down on manual structure creation. This approach is found to be effective
for simple unified domains, but not for complicated, loosely unified domains. Another benefit of
the data extraction technique is that, after the data is extracted, it can be handled as instances in a
traditional database. However, it generally requires a carefully constructed extraction plan by an
expert in that specific knowledge domain [74].
4.16. Knowledge integration
Knowledge integration techniques are used to enhance the functioning of structure level matching
by integrating knowledge between data relationships to form a stronger concept base for
performing data linkage [75]. Knowledge integration enhances query formulation when the
information structure and data sources are not known, as highlighted in [76], and is becoming
increasingly important in data matching processes as various data structures conceptualise the
same concept in different ways, with resulting inconsistencies and overlapping material.
Integration can be based on extensions or concepts, and is aimed at indemnifying inconsistencies
and mismatches in the concepts. For example, the COIN technique [77] addresses data-level
heterogeneities among data sources expressed in terms of context axioms and provides a
comprehensive approach to knowledge integration. An extension of COIN is ECOIN, which
improves upon COIN through its ability to handle both data-level and ontological heterogeneities
in a single framework [77]. Knowledge integration is highly useful in medicine, to integrate
concepts and information within various medical data sources. Knowledge integration involves
the introduction of a dictionary to fill knowledge gaps, such as using distance-based weight
measurement through Google [68]. For example, the Foundational Model of Anatomy is used as
a concept roadmap to better integrate various medical data sources into unique anatomy concepts
[68].
4.17. Data structures
Data structures use structural information to identify match and reflect relationships. Information
properties are often considered and compared with concepts to make a similarity determination,
while other variations of the data structure approach uses graphical information to create
similarities [68]. A drawback of the data structure based approach results from its consumption
rate of resources; the process builds an “in-memory” graph containing paired concepts which can
lead to memory overflow.
4.18. Statistical analysis
Statistical analysis techniques examine statistical measurements for determining term and
concept relationships. Jaccard Similarity Coefficient [38] is a widely used statistical measurement
for comparing terms, which consider the extent of overlap between two vectors. The
measurement is the size of the intersection, divided by the size of the union of the vector
dimension sets. Considering the corpus, the Jaccard Similarity approach determines a match to be
present if there is a high probability for both concepts to be present within the same section. For
attribute matching, a match is determined if there is a large amount of overlap between values
[38]. For example, CORDS [16] is a statistical matching tool, built upon B-HUNT, which locates
statistical correlations and soft functional dependencies. CORDS searches for correlated column
pairs through enumerating potentially correlating pairs and pruning unqualified pairs. A chi-
squared analysis is performed in order to locate numerical and categorical correlations.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
16
Unfortunately, statistical analysis methods are generally restricted to column pairs, and may not
detect correlations where not all subsets have been correlated [1, 18].
5. CONCLUSIONS & FUTURE WORK
The data linkage approaches reviewed in this paper represents a variety of linkage techniques
using different aspects of data. We discussed practical approaches from two different angles, they
are, Attribute level and Structure-level approach. We showed that classification of data into a
single order does not provide the necessary flexibility for accurately defining data relationships.
Furthermore, we found that the flow of data and their relationships need not be in a fixed
direction. This is because, when dealing with variable data sources, same sets of data can be
ordered in multiple ways based on the semantics of tables, attributes and tuples. This is critical
when performing data linkage. Through our analysis of the status quo we also proved that the
research should take a new direction to discover possible data matches, based on its inherent
hierarchical semantic similarities as proposed in [109]. This approach is ideal for knowledge
based data matching and query answering. We recommend faceted classification to classify data
in multiple ways, to source semantic information for accurate data linkage and other data intrinsic
tasks. We also recommend, in response to the intricacy of this background research, that the data
linkage research community collaborate to benchmark existing data linkage techniques, as it is
getting increasingly complicated to convincingly and in a timely manner compare new techniques
with existing ones.
REFERENCES
[1] M. Gollapalli, X. Li, I. Wood, G. Governatori, Ontology Guided Data Linkage Framework for Discovering
Meaningful Data Facts, in Proceeding of the 7th International Conference on Advanced Data Mining and
Applications (ADMA), Beijing, China, 2011, pp. 252-265.
[2] J. Euzenat, P. Shvaiko, Ontology Matching, 1st ed., Berlin Heidelberg: Springer, New York, 2007.
[3] M. Franklin, A. Halevy, D. Maier, From Databases to Dataspaces: a new abstraction for information
management, in J. ACM Special Interest Group on Management of Data (SIGMOD), Maryland, Record 34 Issue
4, 2005, pp. 27-33.
[4] S. Fenz, An ontology-based approach for constructing Bayesian networks, in J. Data & Knowledge Engineering
(DKE), vol. 73, no. Elsevier, March 2012.
[5] W. Wu, B. Reinwald, Y. Sismanis, R. Manjrekar, Discovering Topical Structures of Databases, in Proceedings
of the 2008 ACM International Conference on Special Interest Group on Management of Data (SIGMOD),
Vancouver, Canada, 2008, pp. 1019-1030.
[6] E Simperl, Reusing ontologies on the Semantic Web: A feasibility study, in J. Data & Knowledge Engineering
(DKE), vol. 68, no. 10, pp. 905-925, Oct 2009.
[7] P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine
Classification, in Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining
(SIGKDD), Nevada, USA, 2008, pp. 151-159.
[8] H. Koehler, X. Zhou, S. Sadiq, Y. Shu, K. Taylor, Sampling Dirty Data for Matching Attributes, in Proceedings
of the 2010 ACM Special Interest Group on Management of Data (SIGMOD), Indianapolis, USA, 2010, pp. 63-
74.
[9] C. Lee, Automated ontology construction for unstructured text documents, in J. Data and Knowledge
Engineering (DKE), vol. 60, no. 3, pp. 547-566, March 2007.
[10] I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in Proceedings of the ACM
Special Interest Group on Management of Data (SIGMOD) Workshop on Research issues in data mining and
knowledge discovery, Paris, France, 2004, pp. 11-18.
[11] H. Kim, D. Lee, Parallel Linkage, in Proceedings of the ACM 16th International Conference on Information and
Knowledge Management (CIKM), Lisbon, Protugal, 2007, pp. 283-292.
[12] Y. Hong, T. Yang, J. Kang, D. Lee, Record Linkage as DNA Sequence Alignment Problem, in Proceedings of
the 6th International Conference on Very Large Data Base (VLDB), Auckland, New Zealand, 2008, pp. 13-22.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
17
[13] M. Gagnon, Ontology-based Integration of Data Sources, in Proceedings of the IEEE 10th International
Conference on Information Fusion, Que, Canada, 2007, pp. 1-8.
[14] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, G. Summa, Schema Mapping Verification: The Spicy Way, in
Proceedings of the 11th Internation Conference on Extending Database Technology (EDBT), Nantes, France,
2008, pp. 1289-1293.
[15] A. Radwan, L. Popa, I.R. Stanoi, A. Younis, Top-K Generation of Integrated Schemas Based on Directed and
Weighted Correspondences, in Proceedings of the 35th International Conference on Management of Data
(SIGMOD), Rhode Island, USA, 2009, pp. 641-654.
[16] I.F. Liyas, V. Markl, P. Haas, P. Brown, A. Aboulnaa, CORDS: Automatic Discovery of Correlations and Soft
Functional Dependencies, in Proceedings of the 2004 ACM International Conference on Management of Data
(SIGMOD), France, 2004, pp. 647-658.
[17] ARFF, University of Waikato, Extensible attribute-Relation File Format, Available online from:
http://weka.wikispaces.com/XML
[18] V. Pudi, P. Krishna, Data Mining, 1st ed., New Delhi, India: OXFORD University Press, 2009.
[19] F. Hakimpour, A. Geppert, Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based
Approach, ACM, Proc. of the Int. Conf. On Formal Ontologies in Information Systems FOIS, 2001, pp. 297-308.
[20] MSDN Hashing, Available Online:
http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx
[21] Y. Karasneh, H. Ibrahim, M. Othman, R. Yaakob, A model for matching and integrating heterogeneous
relational biomedical databases schemas, in Proc. of Int. Database Engineering & Applications Symposium,
Rende (CS), Italy, 2009, pp. 242 - 250.
[22] A. Fuxman, M. A. Hernandez, H. Ho, R. J. Miller, P. Papotti, L. Popa, Nested mappings: schema mapping
reloaded, in Proc. of the 32nd Int. Conf. on Very large data bases, Seoul, Korea, 2006, pp. 67 - 78.
[23] R. Pottinger, P. A. Bernstein, Schema merging and mapping creation for relational sources, in Proc. of the 11th
Int. Conf. on Extending Database Technology, Nantes, France, 2008, pp. 73-84.
[24] The World Bank, Data Catalog, Available Online: http://data.worldbank.org/topic
[25] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, in the Very Large data bases
Journal, Vol 10 Issue 4, Dec 2001, pp. 334 - 350.
[26] A. Fuxman, E. Fazil, and R. Miller, ConQuer: Efficient Management of Inconsistent Databases, in Proc. of the
2005 ACM SIGMOD Int. Conf. on Management of data, Baltimore, Maryland, 2005, pp. 155 - 166.
[27] Z. Li, S. Li, Z. Peng, Ontology matching based on Probabilistic Description Logic, in Proc. of the 7th Int. Conf.
on Applied Computer & Applied Computational Science, Hangzhou, China, April 2008.
[28] C. Yu, H. V. Jagadish, Querying complex structured databases, in Proc. of the 33rd Int. Conf. on Very large data
bases, Vienna, Austria, 2007, pp. 1010-1021.
[29] T. Poulain, N. Cullot, K. Yetongnon, Ontology Mapping Specification in Description Logics for Cooperative
Systems, in Proc. of the 1st Int. Conf. on Signal-Image Technology and Internet-Based Systems, 2005, pp. 240-
246.
[30] The US Federal Government, Data Catalog, Available Online: http://www.data.gov/catalog
[31] H. Galhardas, D. Florescuand, An Extensible Framework for Data Cleaning, in Proc. of the 16th Int. Conf. on
Data Engineering, California, USA, 2000, pp. 312.
[32] N. Choi, Y. Song, and H. Han, A Survey on Ontology Mapping, in ACM SIGMOND Record, Volume 35, Issue
3, 2006, pp. 34-41.
[33] The World Wildlife Fund, Data Catalog, Available Online:
http://www.worldwildlife.org/science/data/item1872.html
[34] A. K. Elmagarmid, P. G. Ipeirotis, V S Verykios, Duplicate Record Detection: A Survey, in IEEE Transactions
on Knowledge and Data Engineering, Volume 19, Issue 1, Jan 2007, pp. 1-16.
[35] P. Goiser, K. Christen, Quality and complexity measures for data linkage and deduplication, in Quality Measures
in Data Mining Book, Volume 43, Springer, 2007.
[36] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, Adaptive Name Matching in Information
Integration, in IEEE IS Journal, Volume 18 Issue 5, Sept 2003, pp. 16 - 23.
[37] G. Navarro, A Guided Tour to Approximate String Matching, in ACM Computing Surveys Journal, Volume 33,
Issue 1, March 2001, pp. 31-88.
[38] J. Pan, C. Cheng, G. Lau, and H. K. Law, Utilizing Statistical Semantic Similarity Techniques for Ontology
Mapping with Applications to AEC Standard Models, in the Journal of Tsinghua Science & Technology,
Volume 13, pp. 217-222.
[39] N. Koudas, A. Marathe, D. Srivastava, Flexible string matching against large databases in practice, in the
Proceedings of the 13th Int. Conf. on Very large data bases, Toronto, Canada, 2004, pp. 1078 - 1086.
[40] N. Brobergand, A. Farre, and J. Svenningsson, Regular Expression Patterns, in the Int. Conf. on Functional
Programming, Utah, USA , 2004, pp. 67-68
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
18
[41] M. Ceci, A. Appice, C. Loglisci, D. Malerba, Complex objects ranking: a relational data mining approach, in
Proc. of the 2010 ACM Symposium on Applied Computing, Switzerland, 2010, pp. 1071-1077.
[42] The Adventure Works Database, Data Catalog, Available Online: http://sqlserversamples.codeplex.com/
[43] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati, Linking data to ontologies, in
the Journal on Data Semantics, Heidelberg, 2008, pp. 133-173.
[44] M. Bilenko, and R. Mooney, Adaptive Duplicate Detection Using Learnable String Similarity Measures, in the
Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Washington DC, USA,
2003, pp. 39 - 48.
[45] P. Costa, L. Mottola, A. L. Murphy, G. P. Picco, TeenyLIME: transiently shared tuple space middleware for
wireless sensor networks, in the Proc. of the Int. Workshop on Middleware for sensor networks, Melbourne,
Australia, 2006, pp. 43 - 48.
[46] M. G. Elfeky, V. S. Verykios, A. K. Elmagarmid, TAILOR: A Record Linkage Tool Box, in the Proc. of the 18th
Int. Conf. on Data Engineering, California, USA, 2002, pp. 17.
[47] G. Noren, R. Orre, and A. Bate, A hit-miss model for duplicate detection in the WHO drug safety database, in
Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledge discovery in data mining, Chicago, USA, 2005, pp.
459 - 468.
[48] Statistics New Zealand, in the Data Integration Manual, Wellington,
http://www.stats.govt.nz/~/media/Statistics/about-us/policies-protocols-guidelines/data-integration-further-
technical-info/DataIntegrationManual.pdf, 2006.
[49] National Climatic Data Center, Data Catalog, Available Online: http://www.ncdc.noaa.gov/oa/ncdc.html
[50] Queensland Govt. Wildlife & Ecosystems, Data Catalog, Available Online:
http://www.derm.qld.gov.au/wildlife-ecosystems/index.html
[51] W. E. Yancey, BigMatch: A Program for Extracting Probable Matches from a Large File, in US. Census Bureau
Research Report, 2002
[52] A. Isaac, S. Wang, C. Zinn, H. Matthezing, L. van der Meij, S. Schlobach, Evaluating Thesaurus Alignments for
Semantic Interoperability in the Library Domain, in IEEE Intelligent Systems Journal, Volume 24 Issue 2, Mar
2009, pp. 76-86.
[53] H. Lee, R. T. Ng, K. Shim, Approximate substring selectivity estimation, in the Proc. of the 12th Int. Conf. on
Extending Database Technology, St Petersburg, Russia, 2009, pp. 827-838.
[54] D. Calvanese, G. Giacomo, D. Lembo, M. Lenzerini, A. Poggi, R. Rosati, Linking Data to Ontologies: The
Description Logic DL-Lite, in the OWL: Experience and Direction Workshop, Athens, Georgia, 2006,
[55] B. B Hariri, H. Sayyadi, H. Abolhassani and K. Sheykh Esmaili, Combining Ontology Alignment Metrics Using
the Data Mining Techniques, in Proc. of the 2006 Int. Workshop on Context and Ontologies, Trento, Italy, 2006.
[56] Medical Science, Data Catalog, http://www.medicare.gov/download/downloaddb.asp
[57] C. Batini, M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and
Applications). Springer-Verlag New York, 2006, Book ISBN: 3540331727
[58] S.C. Gupta, V.K.Kapoor, Fundamentals of Mathematical Statistics, Sultan Chand & Sons, 2002
[59] C. Ding, X. He, K-nearest-neighbor consistency in data clustering: incorporating local information into global
optimization, in Proc. of the 2004 ACM symposium on Applied Computing, Nicosia, Cyprus, 2004, pp. 584 -
589.
[60] D. Balzarotti, P. Costa, G. P. Picco, The LighTS Tuple Space Frawework and its Customization for Context-
Aware Applications, in the Web Intelligence and Agent Systems Journal, The Netherlands, Volume 5 Issue 2,
Apr 2007, pp. 215-231.
[61] G. Cormode, A. McGregor, Approximation algorithms for clustering uncertain data, in Proc. of the Int. Conf. on
Principles of Database Systems, Vancouver, Canada, 2008, pp. 191-200.
[62] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Qi Su, S E Whang, J Widom, Swoosh: a generic approach to
entity resolution, in the Very large data bases Journal, Volume 18, Issue 1, Jan 2009, pp. 255 - 276.
[63] M.A. Hernandez and S.J. Stolfo, The Merge/Perge Problem for Large Databases, In Proc of the ACM Int. Conf.
on Management of Data (SIGMOD), San Jose, California, May 1995.
[64] T. Churches, P. Christen, K. Lim and J. X. Zhu, Preparation of name and address data for record linkage using
hidden Markov models, in the BioMed Central Medical Informatics and Decision Makin Journal,
http://www.biomedcentral.com/1472-6947/2/9/, 2002.
[65] V. Borkar, K. Deshmukh, and S. Sarawagi, Automatic Segmentation of Text into Structured Records, in Proc. of
the 2001 ACM SIGMOD Int. Conf. on Management of Data, Santa Barbara, California, 2001, pp. 175-186.
[66] Semantic Enrichment: The Key to Successful Knowledge Extraction, in the Scope eKnowledge Center
Literature, Chennai, India, Available Online at: http://www.scopeknowledge.com/Semantic_Processingnew.pdf,
Oct 2008.
[67] P. Christen, Febrl: a freely available record linkage system with a graphical user interface, in Proc. of the
Australasian Workshop on Health Data and Knowledge Management, Canberra, Australia, 2008, pp. 17-25.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
19
[68] A. Doan, J Madhavan, P Domingos, and A Halevy, Ontology Matching: A Machine Learning Approach, in the
Handbook on Ontologies in Information Systems, Springer, 2003, pp. 397-416.
[69] P. Avesani, F. Giunchiglia, and M. Yatskevich, A Large Scale Taxonomy Mapping Evaluation, in Proc. of the
4th Int. Semantic Web Conference, Galway, Ireland, 2005, pp. 67-81.
[70] S. Ponzetto, and R. Navigli, Large-scale Taxonomy Mapping for Restructuring and Integrating Wikipedia, in
Proc. of the 21st Int. Joint Conf. on Artificial Intelligence, Pasadena, USA, 2009, pp. 2083–2088.
[71] S. Muthaiyah, M. Barbulescu and L. Kerschberg, A hybrid similarity matching algorithm for mapping and rading
ontologies via a multi-agent system, in Proc. of the 12th WSEAS Int. Conf. on Computers, Crete Island, Greece,
2008, pp. 653-661.
[72] Y. Zhai, and B. Liu, Web Data Extraction Based on Partial Tree Alignment, in Proc. of the Int. World Wide Web
Conference, Chiba, Japan, 2005, pp. 76-85.
[73] Y. Ding, and D. Embly, Using Data-Extraction Ontologies to Foster Automating Semantic Annotation, in the
22nd Int. Conf. on Data Engineering Workshops, Atlanta, USA, 2006, pp. 138.
[74] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeria, A Brief Survey of Web Data Extraction Tools, in the
ACM Int. Conf. on Management of Data (SIGMOD), Volume 31, 2002, pp. 84-93.
[75] A. Shareha, M. Rejeswari and D. Ramchandram, Multimodal Integration (Image and Text) Using Ontology
Alignment, in the American Journal of Applied Sciences, 2009, pp. 1217-1224.
[76] K. Munir, M. Odeh, R. McClatchey, Ontology Assisted Query Reformulation Using the Semantic and Assertion
Capabilities of OWL-DL Ontologies, in Proc. of the 2008 Int. Symposium on Database Engineering &
Applications, Coimbra, Portugal, 2008, pp. 81-90.
[77] A. Firat, S. Madnick, and B. Grosof, Knowledge Integration to Overcome Ontological Heterogeneity: Challenges
from Financial Information Systems, in Int. Conf. on Information Systems, Barcelona, 2002, pp. 17.
[78] G. Járosa, Teleonics of health and healthcare: Focus on health promotion, In World Futures: The Journal of
Global Ed, 54(3), Jun. 2010, pp. 259 – 284.
[79] M. Gollapalli, X. Li, I. Wood, and G. Governatori, Approximate Record Matching using Hash Grams, in the
IEEE International Conference on Data Mining Workshop, 2011, Vancouver, Canada, pp. 504-511.
[80] D. McD Taylor, B. Bell, A. Holdgate, C. MacBean, T. Huynh, O. Thom, M. Augello, M., R. Millar, R. Day, A.
Williams, P. Ritchie, and J. Pasco, Risk factors for sedation-related events during procedural sedation in the
emergency department, Emerg. Med. Australasia. 23(4), Aug. 2011, pp. 466 – 473.
[81] F. Azam, Biologically Inspired Modular Neural Networks, Virginia Polytechnic Institute and State University,
Blacksburg, Virginia, PhD Thesis, 2000.
[82] R. Rojas, Neural Networks - A Systematic Introduction, 4th ed. New Yourk, U.S.A: Springer-Verlag, 2004.
[83] R. Bose, Knowledge management-enabled health care management systems: capabilities, infrastructure, and
decision-support, Expert Syst. Appl. 24(1), Jan. 2003, pp. 59–71.
[84] Queensland Health. 2012, Protocol for Clinical Data Standardization, Document Number # QH-PTL-279-
1:2012.
[85] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, D. Srivastava, Using q-
grams in a DBMS for Approximate String Processing, in IEEE Data Engineering Bulletin, Vol 24, No. 7, Dec
2001.
[86] T. Turner, Developing evidence-based clinical practice guidelines in hospitals in Australia, Indonesia, Malaysia,
the Philippines and Thailand: values, requirements and barriers. BMC Health Serv. Res. 9(1), Dec. 2009, pp.
235.
[87] E. E. Roughead, and S. J. Semple, Medication safety in acute care in Australia 2008, Aust. New Zealand Health
Policy 6, Aug. 2009, pp. 18.
[88] A Fuxman, E Fazil, and R Miller, ConQuer: Efficent Management of Inconsistent Databases, in ACM SIGMOD
Int. Conf. on Mgmt of Data 2005, June 14-16, Maryland, p.p.155 - 166.
[89] Australian Government Department of Health and Ageing. Summary of the national E-Health Strategy: 2,
National Vision for E-Health, Dec. 2009, pp. 1.
[90] M G. Elfeky, V S. Verykios, A K. Elmagarmid, TAILOR: A Record Linkage Toolbox, in IEEE Int. Conf. on
Data Engineering ICDE, 2002, p.p.17-28.
[91] Australian Commission on Safety and Quality in Health Care. Safety and Quality Improvement Guide Standard
4: Medication Safety Oct. 2012, pp. 14.
[92] A. Burls, AGREE II—improving the quality of clinical care, The Lancet, 376(9747), Oct. 2010, pp. 1128 – 1129.
[93] J. Robertson, J. B. Moxey, D. A. Newby, M. B. Gillies, M. Williamson, and E. Pearson, Electronic information
and clinical decision support for prescribing: state of play in Australian general practice, Fam Pract. 28(1), Feb.
2011, pp. 93-101.
[94] V. N. Stroetmann, D. Kalra, P. Lewalle, A. Rector, J. M. Rodrigues, K. A. Stroetmann, G. Surjan, B. Ustun, M.
Virtanen, and P. E. Zanstra, Semantic Interoperability for Better Health and Safer Healthcare: Deployment and
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015
20
Research Roadmap for Europe, SemanticHEALTH project, a Specific Support Action funded by the European
Union 6th R&D Framework Programme (FP6). DOI=http://ec.europa.eu/ : 10.2759/38514.
[95] W. Wenjun, D. Lei, D. Cunxiang, D. Shan, and Z. Xiankun, Emergency plan process ontology and its
application, In Proc. of the 2nd Int. Conf. on Advanced Computer Control (Shenyang, Liaoning ICACC, Tianjin,
China, 2010, pp. 513-516.
[96] M. Sotoodeh, Ontology-Based Semantic Interoperability in Emergency Management, Doctoral Thesis,
University of British Columbia, 2007.
[97] Y. Peng, Application of Emergency Case Ontology Model in Earthquake, in Proc. of Int. Conf. on Management
and Service Science. MASS, 2009, Tianjin, China, pp. 1 – 5.
[98] A. Bellenger, An information fusion semantic and service enablement platform: The FusionLab approach.
Proceedings of the 14th International Conference on Information Fusion (Chicago, Illinois, USA, July 05 - 08,
2011), FUSION 2011. Val-de-Reuil, France, pp. 1 – 8.
[99] J. Hunter, P. Becker, A. Alabri, C. Van Ingen, E. Abal, Using Ontologies to Relate Resource Management
Actions to Environmental Monitoring Data in South East Queensland, IJAEIS, 2(1) (2011), pp. 1-19.
[100] V. Mascardi, V. Cordi, P. Rosso, A Comparison of Upper Ontologies, Technical Report DISI-TR-06-2, The
University of Genoa, The Technical University of Valencia, Genova, Italy, 2007.
[101] Suggested Upper Merged Ontology (SUMO), Available online at www.ontologyportal.org.
[102] Z. Xianmin, Z. Daozhi, F. Wen, and W. Wenjun, Research on SUMO-based Emergency Response Management
Team Model, in the Proc. of the Int. Conf. on Wireless Communications, Networking and Mobile Computing,
Shanghai, China, September 21 - 25, 2007, pp. 4606 - 4609.
[103] International Organization for Standardization, International Standard 18629, Available online at
http://www.iso.org.
[104] M. Grüninger, T. Hahmann, A. Hashemi, D. Ong, and A. Özgövde, Modular first-order ontologies via
repositories, Applied Ontology 7(2), April 2012, pp. 169 – 209.
[105] C. Lange, O. Kutz, T. Mossakowski, and M. Grüninger, The Distributed Ontology Language (DOL): Ontology
Integration and Interoperability Applied to Mathematical Formalization, In Proc. of the Conf. on Intelligent
Computer Mathematics CICM, Bremen, Germany, Springer, Heidelberg, 2012, pp. 463-467.
[106] M. Gollapalli, A Framework of Ontology Guided Data Linkage for Evidence based Knowledge Extraction and
Information Sharing, in the 29th IEEE International Conference on Data Engineering (ICDE) Workshop,
Brisbane, Australia, Apr 2013, pp. 294 - 297.
[107] B. Bell, D. McD Taylor, A. Holdgate, C. MacBean, T. Huynh, O. Thom, M. Augello, R. Millar, R. Day, A.
Williams, P. Ritchie, and J. Pasco, Procedural sedation practices in Australian Emergency Departments. Emerg,
Med. Australasia. 23(4), May 2011, pp. 458 – 465.
[108] A. Holdgate, D. McD Taylor, B. Bell, C. MacBean, T. Huynh, O. Thom, M. Augello, R. Millar, R. Day, A.
Williams, P. Ritchie, and J. Pasco, Factors associated with failure to successfully complete a procedure during
emergency department sedation, Emerg. Med. Australasia. 23(4), Aug. 2011, pp. 474 – 478.
[109] M. Gollapalli, X. Li, I. Wood, Automated discovery of multi-faceted ontologies for accurate query answering
and future semantic reasoning, in the Data & Knowledge Engineering Journal, Volume 87, 2013, pp. 405-424.
AUTHOR
Dr. Mohammed Gollapalli is an Assistant Professor in the College of Computer Science and
Information Technology (CCSIT) at the University of Dammam (UD), Dammam, Kingdom of
Saudi Arabia. He graduated his PhD in Information Technology at the University of
Queensland (UQ), Brisbane, Australia in 2013 and obtained Masters in Information
Technology at Griffith University, Gold Coast, Australia, in 2005. His major areas of research
interests and expertise include Data Mining, Knowledge Management, and Quality Control.
He is a member of MCP and IEEE.

More Related Content

PDF
DATA WAREHOUSE AND BIG DATA INTEGRATION
PDF
Granularity analysis of classification and estimation for complex datasets wi...
PDF
Role of Data Cleaning in Data Warehouse
PDF
Implementation of Matching Tree Technique for Online Record Linkage
PDF
Big Data Mining - Classification, Techniques and Issues
PDF
A simulated decision trees algorithm (sdt)
PDF
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
PDF
HITS: A History-Based Intelligent Transportation System
DATA WAREHOUSE AND BIG DATA INTEGRATION
Granularity analysis of classification and estimation for complex datasets wi...
Role of Data Cleaning in Data Warehouse
Implementation of Matching Tree Technique for Online Record Linkage
Big Data Mining - Classification, Techniques and Issues
A simulated decision trees algorithm (sdt)
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
HITS: A History-Based Intelligent Transportation System

What's hot (13)

PDF
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
PDF
Certain Investigation on Dynamic Clustering in Dynamic Datamining
PPT
Tg03
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
PDF
Effective data mining for proper
PDF
A Review on Classification of Data Imbalance using BigData
PDF
Applying Classification Technique using DID3 Algorithm to improve Decision Su...
PDF
Managing Data Strategically
PDF
DASIA2009 Yggdrasyll
PDF
Recommendation system using bloom filter in mapreduce
PDF
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Tg03
Characterizing and Processing of Big Data Using Data Mining Techniques
Effective data mining for proper
A Review on Classification of Data Imbalance using BigData
Applying Classification Technique using DID3 Algorithm to improve Decision Su...
Managing Data Strategically
DASIA2009 Yggdrasyll
Recommendation system using bloom filter in mapreduce
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data...
RESEARCH IN BIG DATA – AN OVERVIEW
Ad

Viewers also liked (20)

PDF
Content based indexing of music
PDF
Evaluation of rule extraction algorithms
PDF
A comparative study on term weighting methods for automated telugu text categ...
PDF
Incremental learning from unbalanced data with concept class, concept drift a...
PDF
A fuzzy logic based on sentiment
PDF
A statistical model for gist generation a case study on hindi news article
PDF
A review on evaluation metrics for
PDF
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
PDF
THE EVALUATION OF FACTORS INFLUENCING SAFETY PERFORMANCE: A CASE IN AN INDUST...
PDF
Application of data mining tools for
PDF
Enhancing the labelling technique of
PDF
Enhancement techniques for data warehouse staging area
PDF
Comparison between riss and dcharm for mining gene expression data
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
PDF
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
PDF
Confidential data identification using
PDF
A statistical data fusion technique in virtual data integration environment
PDF
The Next Alternative: Private Equity Asset Class Summary
PDF
The Innovator’s Journey: Asset Owners Insights
PDF
Study on body fat density prediction
Content based indexing of music
Evaluation of rule extraction algorithms
A comparative study on term weighting methods for automated telugu text categ...
Incremental learning from unbalanced data with concept class, concept drift a...
A fuzzy logic based on sentiment
A statistical model for gist generation a case study on hindi news article
A review on evaluation metrics for
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
THE EVALUATION OF FACTORS INFLUENCING SAFETY PERFORMANCE: A CASE IN AN INDUST...
Application of data mining tools for
Enhancing the labelling technique of
Enhancement techniques for data warehouse staging area
Comparison between riss and dcharm for mining gene expression data
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
Confidential data identification using
A statistical data fusion technique in virtual data integration environment
The Next Alternative: Private Equity Asset Class Summary
The Innovator’s Journey: Asset Owners Insights
Study on body fat density prediction
Ad

Similar to Literature review of attribute level and (20)

PPT
online Record Linkage
PDF
B131626
PDF
Efficient Record De-Duplication Identifying Using Febrl Framework
PDF
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
PDF
Database aggregation using metadata
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Matching data detection for the integration system
PDF
ME/R model: A New approach of Data Warehouse Schema Design
PDF
how can implement a multidimensional Data Warehouse using NoSQL
PDF
An approach for transforming of relational databases to owl ontology
DOC
Introduction abstract
PDF
Entity resolution for hierarchical data using attributes value comparison ove...
PDF
Study on Theoretical Aspects of Virtual Data Integration and its Applications
PDF
Study on Theoretical Aspects of Virtual Data Integration and its Applications
PDF
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
DOCX
Mc0077 – advanced database systems
PDF
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
PPTX
Data warehouse logical design
PDF
Xml based data exchange in the
PDF
Semantic Conflicts and Solutions in Integration of Fuzzy Relational Databases
online Record Linkage
B131626
Efficient Record De-Duplication Identifying Using Febrl Framework
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Database aggregation using metadata
The International Journal of Engineering and Science (The IJES)
Matching data detection for the integration system
ME/R model: A New approach of Data Warehouse Schema Design
how can implement a multidimensional Data Warehouse using NoSQL
An approach for transforming of relational databases to owl ontology
Introduction abstract
Entity resolution for hierarchical data using attributes value comparison ove...
Study on Theoretical Aspects of Virtual Data Integration and its Applications
Study on Theoretical Aspects of Virtual Data Integration and its Applications
Semi-automatic Discovery of Mappings Between Heterogeneous Data Warehouse Dim...
Mc0077 – advanced database systems
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Data warehouse logical design
Xml based data exchange in the
Semantic Conflicts and Solutions in Integration of Fuzzy Relational Databases

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Machine Learning_overview_presentation.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
1. Introduction to Computer Programming.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
A comparative analysis of optical character recognition models for extracting...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Tartificialntelligence_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Programs and apps: productivity, graphics, security and other tools
Machine Learning_overview_presentation.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Group 1 Presentation -Planning and Decision Making .pptx

Literature review of attribute level and

  • 1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 DOI : 10.5121/ijdkp.2015.5501 1 LITERATURE REVIEW OF ATTRIBUTE LEVEL AND STRUCTURE LEVEL DATA LINKAGE TECHNIQUES Mohammed Gollapalli College of Computer Science & Information Technology, University of Dammam, Dammam, Kingdom of Saudi Arabia ABSTRACT Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing database’s schema to define the structure of its information. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pair- wise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been exposed. KEYWORDS Data Linkage, Probabilistic Matching, Structure Matching, Knowledge Discovery, Data Mining 1. INTRODUCTION Organizations worldwide have been collecting data for decades. Data collected from The World Bank [24], The National Climatic Data Centre [49], and countless other private and public organizations have been collecting, storing, processing and analysing massive amounts of data which has the potential to be linked for the discovery of underlying factors to critical problems. Sharing of large databases between organizations is also of growing importance in many data mining projects, as data from various sources often has to be linked and aggregated in order to improve data quality, or to enrich existing data with additional information [7]. When integrating data from different sources to implement a data warehouse, organizations become aware of potential systematic differences, limitations, restrictions or conflicts which fall under the umbrella-term data heterogeneity [34]. Poor quality data has also been prevalent in databases due to a variety of reasons, including typographical errors, lack of standards etc. To be able to query and integrate data in the presence of such data uncertainties as depicted in Fig. 1, a central
  • 2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 2 problem is the ability to identify whether heterogeneous database tables, attributes and tuples can be linked with the primary aim to understand the past and predict the future. In response to the aforementioned challenges, significant advances have been made in recent years in mining structures of databases with the aim to acquire crucial fact finding information that is not otherwise available, or that would require time-consuming and expensive manual procedures. Schemas are definitions that identify the structure of induced data and are the result of a database design segments. The relational database schemas that are invariant in time hold valuable information in their tables, attributes and tuples which can aid in identifying semantically similar objects. The process of identifying these schema structures has been one of the essential elements of data mining process [21-26]. Accurate integration of heterogeneous database schema can provide valuable insights that are useful for evidence-based decision making, especially for crucial events. In the schema integration process, each individual database can be analysed to provide and extract local schema definitions of the data. These local schema definitions can be used for the development of a global schema which integrates and subsumes the local schema in such a way that (global) users are provided with a uniform and correct view of the global database [19]. With the help of global schema structures, we can derive hierarchical relationships up to the instance level across datasets. However, without having this global schema, extracting meaningful data into a usable form can become a tedious process [5, 8, 14, 18, 21, and 26]. Traditional local-to-global schema-based techniques also lack the ability to allow computational linkage and are not suitable when dealing with heterogeneous databases [2, 5, 8, 18, 57, 61 and 66]. To make things worse, the data could be “dirty” and differences might exist in the structure and semantics maintained across different databases. Research communities have also stressed Schema Pattern Matching [21 to 26] and SQL Querying [27, 28]. Schema Pattern Matching uses database schema to devise clues as to the semantic meaning of the data. Constraints are used to define requirements, generated by hand or through a variety of tools. However, the main problems with Schema Pattern Matching are insufficiency and redundancy. Figure 1. Data linkage across heterogeneous databases Data linkage (also known as data matching, probabilistic matching, and instance identification) is the process of identifying records which represent the same real world entity despite typographical and formatting constraints [18, 25, 32, 34, and 37]. In conducting our research, we observed four prime areas where data linkage is a persistent, yet heavily researched problem. 1. Medical science for DNA sequence matching and biological sequence alignment [12, 18, 21, 47, 56, and 80-84]. 2. Government departments for taxation and pay-out tracking [5, 24, 30, 48, and 79]. 3. Businesses integrating the data of acquired companies into their centralized systems [2, 36, and 42].
  • 3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 3 4. Law enforcement for data matching across domains, such as banking and the electoral commission [24, 30, 33, 49, and 50]. Traditional data linkage approaches use similarity scores that compare tuple values from different attributes, and declare it as matches if the score is above a certain threshold [2, 10, 18, 61, 67, and 79]. These approaches perform quite well when comparing similar databases with clean data. However, when dealing with a large amount of variable data, comparison of tuple values alone is not enough [1, 2]. It is necessary to apply domain knowledge when attempting to perform data linkage where there are inconsistencies in the data. The same problem applies to database migrations, and to other data intensive tasks that involve disparate databases without common schemas. Furthermore, the creation of data linkage between heterogeneous databases requires the discovery of all possible primary and foreign key relationships that may exist between different attribute pairs, on a global spectrum [1, 3, 8, 11, and14-16]. 2. TAXONOMY OF DATA LINKAGE APPROACHES Different techniques have been presented by researchers [18, 32, 34, 35, 43, and 77] in multiple areas which argue that the need, task, and type of linkage to be performed will define the involved steps. Other techniques such as the Statistic New Zealand [48] lean toward the idea that data linkage will always require manual preliminary steps such as data classification, sampling and missing observation detection. However, the fundamental problem that arises each time in performing data linkage on large volumes of heterogeneous databases is to discover all possible relationships based on matching similar tuple values that might exist between different table attributes [1]. In this paper, we review on techniques that exist in performing approximate data linkage based on their approach rationale. We compare the advantages and disadvantages of current approaches for solving data linkage problem in multiple ways. Our analysis of existing techniques as depicted in Fig. 2 will show that there is room for substantial improvement within the current state-of-the-art and we recommend techniques where further improvements can be made. 2.1. SQL Matching Strategies SQL Matching techniques [14, 21, 22, 23, 25 and 26] perform data linkage using simple SQL- LIKE commands and SQL Extensions. The advantage of SQL matching techniques is that they help in performing quick data linkage across databases. However, they do not perform well in cases where comparison and identification of data structures need to be performed on large databases containing noisy data without proper unique keys, foreign key relationships, indexes, constraints, triggers, or statistics. Another drawback of the SQL matching process is that it performs |m| x |n| time’s column match where m and n are the total tuple counts in two different databases, resulting in a very slow, expensive and tedious process.
  • 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 4 Figure 2. Data linkage approaches A variation of SQL Matching includes extending query syntax functionalities to perform data linkage. The proposed SQL-LIKE Command languages [22, 23 and 26] handle data transformation, duplicate elimination and cleaning processes supported by regular SQL Query and a proposed execution engine. However, these techniques demand users to have significantly advanced SQL scripting skills and proposed extended functionalities along with sound domain knowledge. Thus, syntax based SQL matching techniques are proven to be less attractive in real world scenarios [22]. Research communities have also stressed Schema Pattern Matching [21 to 26] and SQL Querying [27, 28]. Schema Pattern Matching uses database schema to devise clues as to the semantic meaning of the data. Constraints are used to define requirements, generated by hand or through a variety of tools. However, the main problems with Schema Pattern Matching are insufficiency and redundancy. SQL Querying, on the other hand, uses a SQL query language such as the Resource Description Framework (RDF) [27, 28] to define matching criteria. Difficulties arise when restrictions eliminate the discovery of possible matches. More relaxed queries use a structure-free mechanism by applying a tree pattern query; however, tree-pattern queries are highly inaccurate due to a high incidence incorrect manual identification of relationships [29].
  • 5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 5 2.2. Exact Matching Strategies Unlike SQL Matching, Exact Matching techniques give more insight into the content and meaning of schema elements [25]. Exact matching uses a unique identifier present in both datasets being compared. The unique identifier can only be linked to one individual item, or an event (for example, a driver’s license number). The Exact Matching technique is helpful in situations where the data linkage to be performed belongs to one data source. For example, consider a company with a recent system crash willing to perform data linkage between the production data source file and the most recent tape backup file to trace transactions. In such situations, Exact Matching would likely suffice in performing data linkage. A specific variation of exact matching discovered In this research is the Squirrel System [31], using a declarative specification language, ISL, to specific matching criteria which will match one record in a given table, with one record in another table. However, exact matching approach leaves no room for uncertainty; records are either classified as a match or as a non-match. Problems often arise when the quality of the variables does not sufficiently guarantee the unique identifier is valid [16]. Exact matching comparison does not suffice for matching records when the data contains errors, for example typographical mistakes, or when the data have multiple representations, such as through the use of abbreviations or synonyms [10]. 2.3. Approximate Matching Strategies Approximate matching is also known as the probabilistic approach [34 to 36] within the research community and is highly recommended, state-of-the art, alternative approach compared to exact matching. In approximate matching techniques, data linkage is performed on a likelihood basis (i.e. performing matching based on the success threshold ratio). Output results can vary in different formats such as “match, possible match, and non-match” basis, Boolean type true or false match basis, nearest and outermost distance match basis, discrete or continuous match basis etc. Variations in approximate matching technique include statistical and probabilistic solutions for similarity matching. Attention has also been drawn to approximate matching techniques from different research arenas, including statistical mathematics and bio-medical sciences. Due to the variety of proposed approaches and the level of attributes match, we have focused our research and classified most common approximate matching techniques into attribute level matching and structure level matching groups discussed in the next two sections. It is important to note that, the purpose of this paper is not to list every data linkage technique rather to discuss the multitude of approximate matching techniques available in the areas of attributes and structure level matching. At the end of this paper, we discussed our conclusions and recommendations for future work. 3. ATTRIBUTE LEVEL MATCHING Attribute Matching, also known as Field Matching [35] and Static String Similarity [36] deals with one-to-one match across different data sources. A challenging task of attribute matching is to perform data linkage across data sources by comparing similar matching records with the assumption that the user is aware of the database structure. Individual record fields are often stored as strings, meaning that functions which accurately measure the similarity of two strings are important for deduplication [36]. In the following subsections, we describe most commonly used attribute matching methodologies and discuss their efficiency.
  • 6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 6 3.1. Linguistic similarity Linguistic techniques focus on phonetic similarities between strings. The rationale behind this approach is that while strings may be similar phonetically, they may have different characters to locate potential matches. Soundex [34] is the most widely known in this area, and uses codes to define letters, remaining non-coded letters are used as separators. In addition, Soundex checks for identical codes (A, E, I, O, U and Y) without separators. Through the Soundex rules, a possible match is determined or denied. Advantages of linguistic techniques include the exposure of about 2/3 of spelling variations [25, 32, and 34]. However, linguistic methods are not equally effective from one ethnicity to the next. Linguistic based techniques are designed for Caucasians, and works on most other ethnicities, but largely fails on East Asian names due to the phonetic differences. NYSIIS [34] improved upon this by maintaining vowel placement and converting all vowels to the letter A. Nonetheless, it is still not perfectly accurate and performs best on surnames and not on other types of data [34]. 3.2. Rule/Regular expression The Rule / Regular expression [40] approach uses rules or set of predefined regular expressions and perform matching on tuples. Regular Expression Pattern as proposed in [40] is more flexible than regular expression alone, which is built from alphabetical elements. This is also because the Regular Expression Pattern is built from patterns over a data element, allowing the use of constructs such as “wildcards” or pattern variables. Regular Expression Pattern is quite useful when manipulating strings, and can be used in conjunction with basic pattern matching. However, the problem with this approach lies in the fact that it is relatively domain specific and tends to only work well on strings. 3.3. Ranking Ranking [15, 41] methods determine preferential relationships and have been more recently recognized by researchers as a necessary addition to structure based matching techniques. Search engines have used ranking methods for some time, such as Google’s PageRank, despite such algorithms not suited for matching noisy data due to their poor connectivity and lack of referrals [15]. Therefore, ranking extensions which simultaneously calculate meaning and relevance are researched. Thus far, only a few ranking methods have been proposed including induction logic programming, probabilistic relational kernel, and complex objects ranking [15, 41]. 3.4. String distance String distance methods, also known as character-based similarity metrics [34] are used to perform data linkage based on the cost associated within the comparing strings. The cost is estimated on the number of characters which needs to be inserted, replaced or deleted for a possible string match. For example, Fig. 3 shows the cost associated in editing string “Aussie” to “Australian” (the “+” sign shows addition, the “-“ sign shows deletion, and the “x” sign shows replacement). Figure 3. An example of string distance technique
  • 7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 7 Experimental results in [34] have shown that the different distance based methodologies discovered so far are efficient under different circumstances. Some of the commonly recommended distance based metrics include Levenstein distance, Needleman-Wunsch distance, Smith-Waterman distance, Affine-gap distance, Jaro metric, Jaro and Jaro-Winkler metric, Q- gram distance, and positional Q-grams distance. Through the various methods, costs are assigned to compensate for pitfalls in the system. Yet, overall, string distance pattern is most effective for typographical errors, but is hardly useful outside of this area [34]. 3.5. Term frequency Term frequency [43] approach determines the frequency of strings in relation and to favour matches of less common strings, and penalizes more common strings. The Term frequency methods allow for more commonly used strings to be left out of the similarity equation. TF-IDF [43] (Term Frequency-Inverse Document Frequency) is a method using the commonality of the term (TF) along with the overall importance of the term (IDF). TF-IDF is commonly used in conjunction with cosine similarity in the vector space model. Soft TF-IDG [44] adds similar token pairs to the cosine similarity computation. According to the researchers in [44], TF-IDF can be useful for similarity computations due to its ability to give proportionate token weights. However, this approach fails to make distinctions between the similarity level of two records with the same token or weight, and is essentially unable to determine which record is more relevant. 3.6. Range pattern Range pattern matching returns a Boolean style true or false result if the specified tuples fall within the specified range. Similarity or dissimilarity is determined when the elements of the data are compared against the predetermined range. Range matching will return a 0 or 1, with 0 being false and 1 being true. Range pattern matching is often used as an expansion of an algorithm to filter results. For example, TeenyLIME [45] expands upon LIME by adding range pattern capabilities, giving TeenyLIME the ability to define the range of its results. A drawback of the range pattern approach is that it is often not powerful enough to perform matching without a high level of query knowledge. For example, if a query is made to search for nearby locations, an optimal range is often not given or is defined by words having various meanings, causing range pattern matching to produce inaccurate results. 3.7. Numeric distance Numeric distance methods are used to quickly perform data linkage on tuples that contains numerical values but don’t require complex string character-style comparison. Hamming distance [46], for example, is used for numeric values such as zip codes, and counts the variations between two records. Due to the limitations of numeric data type constraints, it has not received much attention. Numeric distance methods can be best used in combination of other techniques. 3.8. Token matching Token based matching compare fields by ignoring the ordering of the tokens (words) within these fields. Token based approach use tokenization to perform matching, which is the separation of strings into a series of tokens. It assigns a token to each word in the string and tries to perform matching by ignoring token order and by performing similar match. The token based approach
  • 8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 8 attempts to compensate for the inadequacies of character-based metrics, specifically the inability to detect word order arrangement. A tokenizer performs the operation, taking into account characters, punctuation marks, blank spaces, numbers, and capitalisation. Token based methods count a string as a word set, and accommodates duplicates. For example, Cosine Similarity [38] is used to perform data linkage based on record strings, irrespective of word ordering within the string. The Cosine Similarity methods are effective over a range of entry types, and also have the advantage of considering word location to allow for swapping of word positions. For data containing a large amount of text, the token based matching works quite well, as it can handle repeating words. The optimising token based approach has typically included aggregation of different sources. A potential drawback is that token based matching does not store sub-string order and can predict false matches. 3.9. Weight pattern Weight pattern also referred to as Scoring [47], is applied on matching strings to return a numerical weight; a positive weight for agreeing values and a negative weight for disagreeing values. As two records are compared, the system assigns a weight value for similarity comparison. Composite weight [48] is a summation of all the field weights for a record, which multiplies the probabilities of each value. Reliability of the information, commonality of the values, and similarity between the values are considered in determining weight. Determinations are made by calculating the “m” probability (reliability of data) and the “u” probability (the commonness of the data). For example, IDF weights consider how often a particular value is used. After weights are determined for all the data, cut-off thresholds are set to determine the comparison range. Unfortunately, weight pattern techniques do not perform well when there are data inconsistencies. True matches may have low weights, and non-matches may have high weights as a result of simple data errors [48]. 3.10. Gram sequence Gram sequence based techniques compare the sequence of grams of one string with the sequence of grams of another string. n-grams is a gram based comparison function which calculates the common characters in a sequence, but is only effective for strings that have a small number of missing characters [46]. For example, the strings “Uni” and “University” have the same 2-gram {un, ni}. q-gram [85] involves generating short substrings of length q using a sliding window at the beginning and end of a string [85]. The q-gram method can be used in corporate databases without making any significant changes to the database itself [85]. Theoretically, two similar strings will share multiple q-grams. Positional q-grams record the position of q-grams within the string [14]. Danish and Ahy in [85] proposed to generate q-grams along with various processing methods such as substrings, joins, and distance. Unfortunately, the gram sequence approach is only efficient for short string comparison and becomes complex, expensive and unfeasible for large strings [85]. 3.11. Blocking Blocking [46] techniques separate tuple values into set of blocks/groups. Within each of these blocks, comparisons are made. Sorted Neighborhood is a blocking method which first sorts and then slides a “window” over the data to make comparisons [46]. BigMatch [51] used by the U.S. Census Bureau, is another blocking technique. BigMatch identifies pairs for further processing
  • 9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 9 through a more sophisticated means. The blocking function assigns a category for each record and identical records are given the same category. The disadvantage of the blocking method is that it will not work for records which have not been given the same category [18, 25, and 34]. 3.12. Hashing Hashing methods convert attributes into a sequence of hash values which are compared for similarity matching between different sets of strings. Hashing methods require conversion of all the data to find the smallest hash value, which could be a costly approach. Set-of-sets [8] is a hashing based data matching technique which works reasonably well in smaller string matching scenarios. The set-of-sets technique proposed in [8] divides strings into 3-grams and assigns a hash value to each tri-gram. Once hash values are assigned and placed in a hash bag, only the lowest matching hash values are considered for matching. Unfortunately, this technique doesn’t yield accurate results when dealing with variable length strings and uses traditional hashing which results in completely different hash values for even a small variation [79]. Furthermore, the Set-of-sets requires conversion of all the data prior to comparison in order to find the smallest hash value, which could be a costly approach. To overcome this disadvantage, the h-gram (hash gram) method was proposed in [79] to address the deficits of the set-of-sets technique, by extending the n-gram technique; utilizing scale based hashing; increasing matching probability; and by reducing the cost associated in storage of hash codes. 3.13. Path sequence The path sequence approach such as in [37] examines the label sequences, and compares them to the labelled data. The distance is measured by determining the similarity between the last elements of a path. The prefix can be considered, but this only affects the result to a certain degree, and becomes less relevant with increasing distance between the prefix and the end of the sequence. 3.14. Conditional substrings Substring matching such as in [53] expands upon string-based techniques by adding substring conditions to string algorithms. Distance measurements are calculated for the specified substring, in which all substring elements must satisfy the distance threshold. A frequent complication related to conditional substring based matching involves the estimation of the size of intersection among related substrings. Clusters and q-grams [2, 4, and 53], which are commonly used in string estimation, are not applicable in substring based techniques, because substring elements are often dissimilar. As a result, substring matching is hindered by an abundance of possibilities, which must all be considered. 3.15. Fuzzy Matrix Fuzzy Matrix [32, 60] places records in the form of matrices and apply fuzzy matching techniques to perform record matching. Commonly used by social scientists to analyse behavioural data, the fuzzy matrix technique is also applicable to many other data types. When considering a fuzzy set, a match is not directly identified as positive or negative. Instead, the match is considered on its degree level of agreement with the relevant data. As a result, a spectrum is created which identifies all levels of agreement or truth.
  • 10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 10 3.16. Thesauri Matching Thesauri based matching attempts to integrate two or more thesauruses. A thesaurus is a kind of lexicon to which some relational information has been added, containing hyponyms which give more specific conceptual meaning. WordNet [27, 32, and 52] is a public domain lexical database, or thesaurus, which makes its distinctions by grouping words into sets of synonyms; it is often used in thesauri matching techniques. Falcon and DSSim [52] are thesauri based matching tools which incorporate lexicons, edit-distance and data structures. LOM [32] is a lexicon-based mapping technique using four methods (whole term, word constituent, synset, and type matching) in an attempt to reduce the required amount of human labour, but does not guarantee any level of accuracy. While Thesauri based approaches can be extremely useful in merging conceptual, highly descriptive information; they can be incredibly complex and difficult to automate to a significant degree; and human experts are typically required to quality assure the relationships [27]. Thesauri matching algorithms also needs to consider the best balance between precision and recall. 4. STRUCTURE LEVEL MATCHING Structure level matching is used when the records being matched need to be fetched from a combination of records (i.e. when attempting to match noisy tuples across different domains, and requiring more than one match). These techniques perform data matching, with the main intuition that the grouping of attributes into clusters followed by performing matching provides a deeper analysis of related content and semantic structure. This process was initially considered for discovering candidate keys and dependent keys. However, one of the biggest challenges involved in this process has been the large number of combinations required for grouping attributes and performing data matching between these groups, which can be costly and time consuming [25, 32, 34 and 37]. Large scale organisations such as Microsoft and IBM have introduced Performance Tuner tools for indexing combined attributes on which queries are frequently executed. Unfortunately, these tools are suited to Database Developers / DBA’s who have sound knowledge in executing SQL queries and is not ideal for novice users. As such, research has taken new directions by classifying multiple structure level techniques that require matching across multiple attributes. We have classified principal techniques in the following subsections. 4.1. Iterative pattern Iterative pattern is the process of repeating a step multiple times (or making “passes”) until a match is found based on similarity scores and blocking variables (variables set to be ignored for similarity comparison). The Iterative approach uses attribute similarity, while considering the similarity between currently linked objects. For example, the Iterative pattern method will consider a match of “John Doe” and “Jonathan Doe” as a higher probability if there is additional matching information between the two records (such as spouse’s name and children’s names). The first part of the process is to measure string distance, followed by a clustering process. Iterative pattern methods have proven to detect duplicates that would have likely been missed by other methods [54]. The gains are greater when the mean size of the group is larger, and smaller when the mean size is smaller. Disadvantages surface when distinctive cliques do not exist for the entities or if references for each group appear randomly. Additionally, there is also the disadvantage of cost, as the Iterative pattern method is computationally quite expensive [54].
  • 11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 11 4.2. Tree pattern Tree pattern is based on decision trees with ordered branches and leaves. The nodes are compared based on the extracted tree information. CART and C.5 are two widely-known decision tree methods which create trees through an extensive search of the available variables and splitting values [55]. A Tree pattern starts at the root node and recursively partitions the records into each node of the tree and creates a child to represent each partition. The process of splitting into partitions is determined by the values of some attributes, known as splitting attributes, which are chosen based on various criteria. The algorithm stops when there are no further splits to be made. Hierarchical verification through trees examines the parent once a matching leaf is identified. If no match is found within the parent, the process stops; otherwise the algorithm continues to examine the grandparent and further up the tree [37]. Suffix trees such as DAWG [37] build the tree structure over the suffixes of S, with each leaf representing one suffix and each internal node representing one unique substring of S. DAWG has additional feature of failure links added in for those letters which are not in the tree. Disadvantages of Tree pattern lies in lengthy and time consuming process with manual criteria often needed for splitting. 4.3. Sequence pattern Sequence pattern methods perform data linkage based on sequence alignment. This technique attempts to simulate a sequential alignment algorithm, such as the BLAST (Basic Local Alignment Search Tool) [12] technique used in Biology. The researchers compared the data linkage problem with the gene sequence alignment problem for pattern matching, with the main motivation to use already invented BLAST tools and techniques. The algorithm translates record string data into DNA sequences, while considering the relative importance of tokens in the string data [12]. Further research in the Sequence pattern area have exposed variations based on the type of translation used to translate strings into DNA Sequence (i.e. weighted, hybrid, and multi-bit BLASTed linkage) [12]. BLASTed linkage has advantages through the careful selection of one of its four variations, as each variation performs well on specific types of data. Unfortunately, sequence pattern tends to perform poorly on particular data strings, depending upon the error rate, importance weight, and number of common tokens [12]. 4.4. Neighbourhood pattern The neighbourhood approach [7, 59] attempts to understand and measure distribution according to their pattern match, and is a primary component in identifying statistical patterns. By using the nearest neighbour approach, related data is able to be clustered even if it is specifically separated. The logic behind this approach is based on the assumption that, if clustered objects are similar, then the neighbours of clustered objects have a higher likelihood of also being similar. Neighbourhood pattern requires a number of factors that need to be carefully considered in order to determine pattern matches which is considered as a key downfall. 4.5. Relational hierarchy Relational Hierarchy techniques use primary and foreign key relationships to understand related table content in order to perform data linkage. Relational hierarchy forms relation links which
  • 12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 12 connect concepts within various categories. It breaks down the hierarchical structure and the top- level structure contains children sets. The relational hierarchy technique compares and calculates the co-occurrence between tuples by measuring the overlap of the children sets. A high degree of overlap will indicate a possible relationship between the two top level categories [57]. Relational Hierarchy techniques are only effective when primary and foreign key relationships have been established. Raw data, without predefined relationships, cannot be linked using this approach. 4.6. Clustering/Feature extraction Clustering, also known as the Feature extraction method performs data linkage based on common matching criteria in clusters, so that objects in clusters are similar. Soft clustering [61], or probabilistic clustering, is a relaxed version of clustering which uses partial assignment of a cluster centre. The SWOOSH [62] algorithms apply ICAR properties (idempotence, commutativity, associativity, representativity) to the match and merge function. With these properties and several assumptions, researchers introduced the brute force algorithm (BFA), including the G, R and F SWOOSH algorithms [44]. SIMCLUST is another similarity based clustering algorithm which places each table in its own cluster as a starting point and then works its way through all of the tables by consecutively choosing two tables (clusters) with the highest level of similarities. [5] proposed iDisc system which creates database representations through a multi-process learning technique. Base clusters are used to uncover topical clusters which are then aggregated through meta-clustering. Clustering in general can get extremely complex (such as forming clusters using semantics) and needs to be handled carefully while discovering relationships between matching clusters. 4.7. Graphical statistic Graphical statistic is a semi-automated analysis based technique where data linkage is performed based on the results obtained on the graph. Such representations illustrate the topical database structure through tables. The referential relationship indicates an important linkage between two separate tables. Foreign keys within one table may refer to keys within the second table. However, problems with this technique often arise due to the fact that information on foreign keys is often missing [5]. 4.8. Training based Training based technique is a manual approach where users are constantly involved in providing statistical data based on previous/future predictions. In [7], researchers presented a two-step training approach using automatically selected, high quality examples which are then used to train a support vector machine classifier. The approach proposed in [7] outperforms k-means clustering, as well as other unsupervised methods. The Hidden Markov training model, or HMM, standardises name and address data as an alternative method to rule-based matching. Through use of lexicon-based tokenization and probabilistic hidden Markov models, the approach attempts to cut down on the heavy computing investment required by rule programming [64]. Once trained, the HMM can determine which sequence of hidden states is most likely to have emitted the observed sequence of symbols. When this is identified, the hidden states can be associated with words from the original input string. This approach seems advantageous in that it cuts down on time costs when compared to rule-based systems. However, this approach remains a lengthy process, and has shown to run into significant problems in various areas. For instance, HMM
  • 13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 13 confuses given, middle, and surnames, especially when applied to homogenous data. Furthermore, outcomes proved to be less accurate than those of rule-based systems [64]. DATAMOLD [65] is a training-based method which enhances HMM. The program is seeded with a set of training examples which allows the system to extract data matches. A common problem with training techniques is that it requires many examples to be effective; and the system will not perform without an adequate training set [55]. 4.9. Pruning/Filtering statistic Pruning statistic performs data linkage by trimming similar records on a top down approach. In [16], the data cleaning process of “deduplication” involves detecting and eliminating duplicate records to reduce confusion in the matching process. For data which accepts a large number of duplicates, pruning, before data matching, simplifies the process and makes it more effective. A pruning technique proposed by Verykios [34] recommends pruning as on derived decision trees used for classification of matched or mismatched pairs. The pruning function reduces the size of the trees, improving accuracy and speed [34]. The pruning phase of CORDS [16] (which is further discussed in the statistical analysis section) prunes non-candidates on the basis of data type, properties, pairing rules, and workload; such tasks are done to reduce the search space and make the process faster for large datasets. Pruning techniques [37] are based on the idea that it is much faster to determine non-matching records than matching records, and therefore aim to eliminate all non-matching records which do not contain errors. However, the disadvantage of such techniques is that they are not suitable in identifying matches of any type, and must be combined with another matching technique. 4.10. Enrichment pattern Enrichment patterns are a continuous improvement based technique which performs data linkage by enriching the similarity tasks on a case by case basis. An example of the enrichment method is ALIAS [34], a learning-based system, designed to reduce the required amount of training material through the use of a “reject region”. Only pairs with a high level of uncertainty require labels. A method similar to ALIAS is created using decision trees to teach rule matching in [34]. OMEN [32] enriches data quality through the use of a Bayesian Net, which uses a rule set to show related mappings. Semantic Enrichment [66] is the annotation of text within a document by sematic metadata, essentially allowing free text to be converted into a knowledge database through data extraction and data linking. Conversion to a knowledge database can be through exact matching or by building hierarchical classifications of terms; text mining techniques allow annotation of concepts within documents which are subsequently linked to additional databases. Thesauri alignment [32, 52] based techniques are also considered as part of enrichment techniques because it combines concepts and better defines the data. The problems associated with enrichment approach include substantial investment of time and the requirement for extensive domain knowledge. 4.11. Multi pattern The multi (multiple) pattern approach performs data linkage through the simultaneous usage of different matching techniques. This approach best fits when one does not know which technique performs better. The researchers in [31] use a multi approach which combines sequence matching, merging, and then exact matching. Febrl [67] is an open-source software containing
  • 14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 14 comparison, and record pair classifications. Febrl results are conveniently presented in a graphical user interface which allows the user to experiment with numerous other methods [67]. TAILOR [46] is another example which uses three different methods to classify records: decision tree induction, unsupervised k-means clustering, and a hybrid approach. GLUE [68] is yet another matching technique allowing for multiple matching methods. GLUE performs matching by first identifying the most similar concepts. Once these concepts are identified, a multi-strategy learning approach allows user to choose from several similarity measures to perform the measurement. 4.12. Data constraints Data constraints, also known as internal structure based techniques, apply a data constraint filter to identify possible matches [43]. The constraint typically uses specific criteria of the data properties. This technique is not suited when used on its own, and performs best for the elimination of non-matches, as a pre-processing method before a secondary method, such as clustering. Furthermore, data constraints don’t handle the large number of uncertainties present within the data. Hence, adding constraints for each uncertainty is computationally infeasible. 4.13. Taxonomy Taxonomy based methods use taxonomies, a core aspect of structural concepts which are largely used in file systems and in knowledge repositories [69]. This approach uses the nodes of taxonomy to define a parent/child relationship within the conceptual information and create classification. Using specified data constraints, the taxonomy of multiple data sources are evaluated into a technique known as structural similarity measure. For example, in [70] researchers used a taxonomy mapping strategy to enrich WordNet with a large number of instances from Wikipedia, essentially merging the conceptual information from the two sources. As with similar methods, taxonomy based matching requires a significant degree of domain knowledge and performs with limited precision and inadequate recall. 4.14. Hybrid match Hybrid techniques use a combination of several mapping methods to perform data match. A prime example of the hybrid method is described in [71], which uses a combination of syntactic and semantic comparisons. The rationale behind hybrid matching is that the semantics alone is not sufficient to perform accurate matching and could be inconsistent. The hybrid solution consists of a hybrid of semantic and syntactic matching algorithms which considers individual components. The syntactic match uses a similarity score based on class, prefix and substring, and the semantic match uses a similarity score based on cognitive measures such as LSA, Gloss Vector, and WordNet Vector. The information is aggregated and entered into a matrix and experts are used to determine domains within the selected threshold. 4.15. Data extraction Data extraction primarily involves extracting semantic data. Data extraction can be performed manually or with an induction and automatic extraction [72]. In [73], researchers used data recognisers to perform data extraction on the semantics of data. The recogniser method is aimed at reducing alignment after extraction, speeding up the extraction process, reusing existing
  • 15. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 15 knowledge, and cutting down on manual structure creation. This approach is found to be effective for simple unified domains, but not for complicated, loosely unified domains. Another benefit of the data extraction technique is that, after the data is extracted, it can be handled as instances in a traditional database. However, it generally requires a carefully constructed extraction plan by an expert in that specific knowledge domain [74]. 4.16. Knowledge integration Knowledge integration techniques are used to enhance the functioning of structure level matching by integrating knowledge between data relationships to form a stronger concept base for performing data linkage [75]. Knowledge integration enhances query formulation when the information structure and data sources are not known, as highlighted in [76], and is becoming increasingly important in data matching processes as various data structures conceptualise the same concept in different ways, with resulting inconsistencies and overlapping material. Integration can be based on extensions or concepts, and is aimed at indemnifying inconsistencies and mismatches in the concepts. For example, the COIN technique [77] addresses data-level heterogeneities among data sources expressed in terms of context axioms and provides a comprehensive approach to knowledge integration. An extension of COIN is ECOIN, which improves upon COIN through its ability to handle both data-level and ontological heterogeneities in a single framework [77]. Knowledge integration is highly useful in medicine, to integrate concepts and information within various medical data sources. Knowledge integration involves the introduction of a dictionary to fill knowledge gaps, such as using distance-based weight measurement through Google [68]. For example, the Foundational Model of Anatomy is used as a concept roadmap to better integrate various medical data sources into unique anatomy concepts [68]. 4.17. Data structures Data structures use structural information to identify match and reflect relationships. Information properties are often considered and compared with concepts to make a similarity determination, while other variations of the data structure approach uses graphical information to create similarities [68]. A drawback of the data structure based approach results from its consumption rate of resources; the process builds an “in-memory” graph containing paired concepts which can lead to memory overflow. 4.18. Statistical analysis Statistical analysis techniques examine statistical measurements for determining term and concept relationships. Jaccard Similarity Coefficient [38] is a widely used statistical measurement for comparing terms, which consider the extent of overlap between two vectors. The measurement is the size of the intersection, divided by the size of the union of the vector dimension sets. Considering the corpus, the Jaccard Similarity approach determines a match to be present if there is a high probability for both concepts to be present within the same section. For attribute matching, a match is determined if there is a large amount of overlap between values [38]. For example, CORDS [16] is a statistical matching tool, built upon B-HUNT, which locates statistical correlations and soft functional dependencies. CORDS searches for correlated column pairs through enumerating potentially correlating pairs and pruning unqualified pairs. A chi- squared analysis is performed in order to locate numerical and categorical correlations.
  • 16. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 16 Unfortunately, statistical analysis methods are generally restricted to column pairs, and may not detect correlations where not all subsets have been correlated [1, 18]. 5. CONCLUSIONS & FUTURE WORK The data linkage approaches reviewed in this paper represents a variety of linkage techniques using different aspects of data. We discussed practical approaches from two different angles, they are, Attribute level and Structure-level approach. We showed that classification of data into a single order does not provide the necessary flexibility for accurately defining data relationships. Furthermore, we found that the flow of data and their relationships need not be in a fixed direction. This is because, when dealing with variable data sources, same sets of data can be ordered in multiple ways based on the semantics of tables, attributes and tuples. This is critical when performing data linkage. Through our analysis of the status quo we also proved that the research should take a new direction to discover possible data matches, based on its inherent hierarchical semantic similarities as proposed in [109]. This approach is ideal for knowledge based data matching and query answering. We recommend faceted classification to classify data in multiple ways, to source semantic information for accurate data linkage and other data intrinsic tasks. We also recommend, in response to the intricacy of this background research, that the data linkage research community collaborate to benchmark existing data linkage techniques, as it is getting increasingly complicated to convincingly and in a timely manner compare new techniques with existing ones. REFERENCES [1] M. Gollapalli, X. Li, I. Wood, G. Governatori, Ontology Guided Data Linkage Framework for Discovering Meaningful Data Facts, in Proceeding of the 7th International Conference on Advanced Data Mining and Applications (ADMA), Beijing, China, 2011, pp. 252-265. [2] J. Euzenat, P. Shvaiko, Ontology Matching, 1st ed., Berlin Heidelberg: Springer, New York, 2007. [3] M. Franklin, A. Halevy, D. Maier, From Databases to Dataspaces: a new abstraction for information management, in J. ACM Special Interest Group on Management of Data (SIGMOD), Maryland, Record 34 Issue 4, 2005, pp. 27-33. [4] S. Fenz, An ontology-based approach for constructing Bayesian networks, in J. Data & Knowledge Engineering (DKE), vol. 73, no. Elsevier, March 2012. [5] W. Wu, B. Reinwald, Y. Sismanis, R. Manjrekar, Discovering Topical Structures of Databases, in Proceedings of the 2008 ACM International Conference on Special Interest Group on Management of Data (SIGMOD), Vancouver, Canada, 2008, pp. 1019-1030. [6] E Simperl, Reusing ontologies on the Semantic Web: A feasibility study, in J. Data & Knowledge Engineering (DKE), vol. 68, no. 10, pp. 905-925, Oct 2009. [7] P. Christen, Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, in Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining (SIGKDD), Nevada, USA, 2008, pp. 151-159. [8] H. Koehler, X. Zhou, S. Sadiq, Y. Shu, K. Taylor, Sampling Dirty Data for Matching Attributes, in Proceedings of the 2010 ACM Special Interest Group on Management of Data (SIGMOD), Indianapolis, USA, 2010, pp. 63- 74. [9] C. Lee, Automated ontology construction for unstructured text documents, in J. Data and Knowledge Engineering (DKE), vol. 60, no. 3, pp. 547-566, March 2007. [10] I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD) Workshop on Research issues in data mining and knowledge discovery, Paris, France, 2004, pp. 11-18. [11] H. Kim, D. Lee, Parallel Linkage, in Proceedings of the ACM 16th International Conference on Information and Knowledge Management (CIKM), Lisbon, Protugal, 2007, pp. 283-292. [12] Y. Hong, T. Yang, J. Kang, D. Lee, Record Linkage as DNA Sequence Alignment Problem, in Proceedings of the 6th International Conference on Very Large Data Base (VLDB), Auckland, New Zealand, 2008, pp. 13-22.
  • 17. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 17 [13] M. Gagnon, Ontology-based Integration of Data Sources, in Proceedings of the IEEE 10th International Conference on Information Fusion, Que, Canada, 2007, pp. 1-8. [14] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, G. Summa, Schema Mapping Verification: The Spicy Way, in Proceedings of the 11th Internation Conference on Extending Database Technology (EDBT), Nantes, France, 2008, pp. 1289-1293. [15] A. Radwan, L. Popa, I.R. Stanoi, A. Younis, Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences, in Proceedings of the 35th International Conference on Management of Data (SIGMOD), Rhode Island, USA, 2009, pp. 641-654. [16] I.F. Liyas, V. Markl, P. Haas, P. Brown, A. Aboulnaa, CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies, in Proceedings of the 2004 ACM International Conference on Management of Data (SIGMOD), France, 2004, pp. 647-658. [17] ARFF, University of Waikato, Extensible attribute-Relation File Format, Available online from: http://weka.wikispaces.com/XML [18] V. Pudi, P. Krishna, Data Mining, 1st ed., New Delhi, India: OXFORD University Press, 2009. [19] F. Hakimpour, A. Geppert, Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based Approach, ACM, Proc. of the Int. Conf. On Formal Ontologies in Information Systems FOIS, 2001, pp. 297-308. [20] MSDN Hashing, Available Online: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx [21] Y. Karasneh, H. Ibrahim, M. Othman, R. Yaakob, A model for matching and integrating heterogeneous relational biomedical databases schemas, in Proc. of Int. Database Engineering & Applications Symposium, Rende (CS), Italy, 2009, pp. 242 - 250. [22] A. Fuxman, M. A. Hernandez, H. Ho, R. J. Miller, P. Papotti, L. Popa, Nested mappings: schema mapping reloaded, in Proc. of the 32nd Int. Conf. on Very large data bases, Seoul, Korea, 2006, pp. 67 - 78. [23] R. Pottinger, P. A. Bernstein, Schema merging and mapping creation for relational sources, in Proc. of the 11th Int. Conf. on Extending Database Technology, Nantes, France, 2008, pp. 73-84. [24] The World Bank, Data Catalog, Available Online: http://data.worldbank.org/topic [25] E. Rahm, P. A. Bernstein, A survey of approaches to automatic schema matching, in the Very Large data bases Journal, Vol 10 Issue 4, Dec 2001, pp. 334 - 350. [26] A. Fuxman, E. Fazil, and R. Miller, ConQuer: Efficient Management of Inconsistent Databases, in Proc. of the 2005 ACM SIGMOD Int. Conf. on Management of data, Baltimore, Maryland, 2005, pp. 155 - 166. [27] Z. Li, S. Li, Z. Peng, Ontology matching based on Probabilistic Description Logic, in Proc. of the 7th Int. Conf. on Applied Computer & Applied Computational Science, Hangzhou, China, April 2008. [28] C. Yu, H. V. Jagadish, Querying complex structured databases, in Proc. of the 33rd Int. Conf. on Very large data bases, Vienna, Austria, 2007, pp. 1010-1021. [29] T. Poulain, N. Cullot, K. Yetongnon, Ontology Mapping Specification in Description Logics for Cooperative Systems, in Proc. of the 1st Int. Conf. on Signal-Image Technology and Internet-Based Systems, 2005, pp. 240- 246. [30] The US Federal Government, Data Catalog, Available Online: http://www.data.gov/catalog [31] H. Galhardas, D. Florescuand, An Extensible Framework for Data Cleaning, in Proc. of the 16th Int. Conf. on Data Engineering, California, USA, 2000, pp. 312. [32] N. Choi, Y. Song, and H. Han, A Survey on Ontology Mapping, in ACM SIGMOND Record, Volume 35, Issue 3, 2006, pp. 34-41. [33] The World Wildlife Fund, Data Catalog, Available Online: http://www.worldwildlife.org/science/data/item1872.html [34] A. K. Elmagarmid, P. G. Ipeirotis, V S Verykios, Duplicate Record Detection: A Survey, in IEEE Transactions on Knowledge and Data Engineering, Volume 19, Issue 1, Jan 2007, pp. 1-16. [35] P. Goiser, K. Christen, Quality and complexity measures for data linkage and deduplication, in Quality Measures in Data Mining Book, Volume 43, Springer, 2007. [36] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, Adaptive Name Matching in Information Integration, in IEEE IS Journal, Volume 18 Issue 5, Sept 2003, pp. 16 - 23. [37] G. Navarro, A Guided Tour to Approximate String Matching, in ACM Computing Surveys Journal, Volume 33, Issue 1, March 2001, pp. 31-88. [38] J. Pan, C. Cheng, G. Lau, and H. K. Law, Utilizing Statistical Semantic Similarity Techniques for Ontology Mapping with Applications to AEC Standard Models, in the Journal of Tsinghua Science & Technology, Volume 13, pp. 217-222. [39] N. Koudas, A. Marathe, D. Srivastava, Flexible string matching against large databases in practice, in the Proceedings of the 13th Int. Conf. on Very large data bases, Toronto, Canada, 2004, pp. 1078 - 1086. [40] N. Brobergand, A. Farre, and J. Svenningsson, Regular Expression Patterns, in the Int. Conf. on Functional Programming, Utah, USA , 2004, pp. 67-68
  • 18. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 18 [41] M. Ceci, A. Appice, C. Loglisci, D. Malerba, Complex objects ranking: a relational data mining approach, in Proc. of the 2010 ACM Symposium on Applied Computing, Switzerland, 2010, pp. 1071-1077. [42] The Adventure Works Database, Data Catalog, Available Online: http://sqlserversamples.codeplex.com/ [43] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati, Linking data to ontologies, in the Journal on Data Semantics, Heidelberg, 2008, pp. 133-173. [44] M. Bilenko, and R. Mooney, Adaptive Duplicate Detection Using Learnable String Similarity Measures, in the Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Washington DC, USA, 2003, pp. 39 - 48. [45] P. Costa, L. Mottola, A. L. Murphy, G. P. Picco, TeenyLIME: transiently shared tuple space middleware for wireless sensor networks, in the Proc. of the Int. Workshop on Middleware for sensor networks, Melbourne, Australia, 2006, pp. 43 - 48. [46] M. G. Elfeky, V. S. Verykios, A. K. Elmagarmid, TAILOR: A Record Linkage Tool Box, in the Proc. of the 18th Int. Conf. on Data Engineering, California, USA, 2002, pp. 17. [47] G. Noren, R. Orre, and A. Bate, A hit-miss model for duplicate detection in the WHO drug safety database, in Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledge discovery in data mining, Chicago, USA, 2005, pp. 459 - 468. [48] Statistics New Zealand, in the Data Integration Manual, Wellington, http://www.stats.govt.nz/~/media/Statistics/about-us/policies-protocols-guidelines/data-integration-further- technical-info/DataIntegrationManual.pdf, 2006. [49] National Climatic Data Center, Data Catalog, Available Online: http://www.ncdc.noaa.gov/oa/ncdc.html [50] Queensland Govt. Wildlife & Ecosystems, Data Catalog, Available Online: http://www.derm.qld.gov.au/wildlife-ecosystems/index.html [51] W. E. Yancey, BigMatch: A Program for Extracting Probable Matches from a Large File, in US. Census Bureau Research Report, 2002 [52] A. Isaac, S. Wang, C. Zinn, H. Matthezing, L. van der Meij, S. Schlobach, Evaluating Thesaurus Alignments for Semantic Interoperability in the Library Domain, in IEEE Intelligent Systems Journal, Volume 24 Issue 2, Mar 2009, pp. 76-86. [53] H. Lee, R. T. Ng, K. Shim, Approximate substring selectivity estimation, in the Proc. of the 12th Int. Conf. on Extending Database Technology, St Petersburg, Russia, 2009, pp. 827-838. [54] D. Calvanese, G. Giacomo, D. Lembo, M. Lenzerini, A. Poggi, R. Rosati, Linking Data to Ontologies: The Description Logic DL-Lite, in the OWL: Experience and Direction Workshop, Athens, Georgia, 2006, [55] B. B Hariri, H. Sayyadi, H. Abolhassani and K. Sheykh Esmaili, Combining Ontology Alignment Metrics Using the Data Mining Techniques, in Proc. of the 2006 Int. Workshop on Context and Ontologies, Trento, Italy, 2006. [56] Medical Science, Data Catalog, http://www.medicare.gov/download/downloaddb.asp [57] C. Batini, M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer-Verlag New York, 2006, Book ISBN: 3540331727 [58] S.C. Gupta, V.K.Kapoor, Fundamentals of Mathematical Statistics, Sultan Chand & Sons, 2002 [59] C. Ding, X. He, K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization, in Proc. of the 2004 ACM symposium on Applied Computing, Nicosia, Cyprus, 2004, pp. 584 - 589. [60] D. Balzarotti, P. Costa, G. P. Picco, The LighTS Tuple Space Frawework and its Customization for Context- Aware Applications, in the Web Intelligence and Agent Systems Journal, The Netherlands, Volume 5 Issue 2, Apr 2007, pp. 215-231. [61] G. Cormode, A. McGregor, Approximation algorithms for clustering uncertain data, in Proc. of the Int. Conf. on Principles of Database Systems, Vancouver, Canada, 2008, pp. 191-200. [62] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Qi Su, S E Whang, J Widom, Swoosh: a generic approach to entity resolution, in the Very large data bases Journal, Volume 18, Issue 1, Jan 2009, pp. 255 - 276. [63] M.A. Hernandez and S.J. Stolfo, The Merge/Perge Problem for Large Databases, In Proc of the ACM Int. Conf. on Management of Data (SIGMOD), San Jose, California, May 1995. [64] T. Churches, P. Christen, K. Lim and J. X. Zhu, Preparation of name and address data for record linkage using hidden Markov models, in the BioMed Central Medical Informatics and Decision Makin Journal, http://www.biomedcentral.com/1472-6947/2/9/, 2002. [65] V. Borkar, K. Deshmukh, and S. Sarawagi, Automatic Segmentation of Text into Structured Records, in Proc. of the 2001 ACM SIGMOD Int. Conf. on Management of Data, Santa Barbara, California, 2001, pp. 175-186. [66] Semantic Enrichment: The Key to Successful Knowledge Extraction, in the Scope eKnowledge Center Literature, Chennai, India, Available Online at: http://www.scopeknowledge.com/Semantic_Processingnew.pdf, Oct 2008. [67] P. Christen, Febrl: a freely available record linkage system with a graphical user interface, in Proc. of the Australasian Workshop on Health Data and Knowledge Management, Canberra, Australia, 2008, pp. 17-25.
  • 19. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 19 [68] A. Doan, J Madhavan, P Domingos, and A Halevy, Ontology Matching: A Machine Learning Approach, in the Handbook on Ontologies in Information Systems, Springer, 2003, pp. 397-416. [69] P. Avesani, F. Giunchiglia, and M. Yatskevich, A Large Scale Taxonomy Mapping Evaluation, in Proc. of the 4th Int. Semantic Web Conference, Galway, Ireland, 2005, pp. 67-81. [70] S. Ponzetto, and R. Navigli, Large-scale Taxonomy Mapping for Restructuring and Integrating Wikipedia, in Proc. of the 21st Int. Joint Conf. on Artificial Intelligence, Pasadena, USA, 2009, pp. 2083–2088. [71] S. Muthaiyah, M. Barbulescu and L. Kerschberg, A hybrid similarity matching algorithm for mapping and rading ontologies via a multi-agent system, in Proc. of the 12th WSEAS Int. Conf. on Computers, Crete Island, Greece, 2008, pp. 653-661. [72] Y. Zhai, and B. Liu, Web Data Extraction Based on Partial Tree Alignment, in Proc. of the Int. World Wide Web Conference, Chiba, Japan, 2005, pp. 76-85. [73] Y. Ding, and D. Embly, Using Data-Extraction Ontologies to Foster Automating Semantic Annotation, in the 22nd Int. Conf. on Data Engineering Workshops, Atlanta, USA, 2006, pp. 138. [74] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeria, A Brief Survey of Web Data Extraction Tools, in the ACM Int. Conf. on Management of Data (SIGMOD), Volume 31, 2002, pp. 84-93. [75] A. Shareha, M. Rejeswari and D. Ramchandram, Multimodal Integration (Image and Text) Using Ontology Alignment, in the American Journal of Applied Sciences, 2009, pp. 1217-1224. [76] K. Munir, M. Odeh, R. McClatchey, Ontology Assisted Query Reformulation Using the Semantic and Assertion Capabilities of OWL-DL Ontologies, in Proc. of the 2008 Int. Symposium on Database Engineering & Applications, Coimbra, Portugal, 2008, pp. 81-90. [77] A. Firat, S. Madnick, and B. Grosof, Knowledge Integration to Overcome Ontological Heterogeneity: Challenges from Financial Information Systems, in Int. Conf. on Information Systems, Barcelona, 2002, pp. 17. [78] G. Járosa, Teleonics of health and healthcare: Focus on health promotion, In World Futures: The Journal of Global Ed, 54(3), Jun. 2010, pp. 259 – 284. [79] M. Gollapalli, X. Li, I. Wood, and G. Governatori, Approximate Record Matching using Hash Grams, in the IEEE International Conference on Data Mining Workshop, 2011, Vancouver, Canada, pp. 504-511. [80] D. McD Taylor, B. Bell, A. Holdgate, C. MacBean, T. Huynh, O. Thom, M. Augello, M., R. Millar, R. Day, A. Williams, P. Ritchie, and J. Pasco, Risk factors for sedation-related events during procedural sedation in the emergency department, Emerg. Med. Australasia. 23(4), Aug. 2011, pp. 466 – 473. [81] F. Azam, Biologically Inspired Modular Neural Networks, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, PhD Thesis, 2000. [82] R. Rojas, Neural Networks - A Systematic Introduction, 4th ed. New Yourk, U.S.A: Springer-Verlag, 2004. [83] R. Bose, Knowledge management-enabled health care management systems: capabilities, infrastructure, and decision-support, Expert Syst. Appl. 24(1), Jan. 2003, pp. 59–71. [84] Queensland Health. 2012, Protocol for Clinical Data Standardization, Document Number # QH-PTL-279- 1:2012. [85] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, D. Srivastava, Using q- grams in a DBMS for Approximate String Processing, in IEEE Data Engineering Bulletin, Vol 24, No. 7, Dec 2001. [86] T. Turner, Developing evidence-based clinical practice guidelines in hospitals in Australia, Indonesia, Malaysia, the Philippines and Thailand: values, requirements and barriers. BMC Health Serv. Res. 9(1), Dec. 2009, pp. 235. [87] E. E. Roughead, and S. J. Semple, Medication safety in acute care in Australia 2008, Aust. New Zealand Health Policy 6, Aug. 2009, pp. 18. [88] A Fuxman, E Fazil, and R Miller, ConQuer: Efficent Management of Inconsistent Databases, in ACM SIGMOD Int. Conf. on Mgmt of Data 2005, June 14-16, Maryland, p.p.155 - 166. [89] Australian Government Department of Health and Ageing. Summary of the national E-Health Strategy: 2, National Vision for E-Health, Dec. 2009, pp. 1. [90] M G. Elfeky, V S. Verykios, A K. Elmagarmid, TAILOR: A Record Linkage Toolbox, in IEEE Int. Conf. on Data Engineering ICDE, 2002, p.p.17-28. [91] Australian Commission on Safety and Quality in Health Care. Safety and Quality Improvement Guide Standard 4: Medication Safety Oct. 2012, pp. 14. [92] A. Burls, AGREE II—improving the quality of clinical care, The Lancet, 376(9747), Oct. 2010, pp. 1128 – 1129. [93] J. Robertson, J. B. Moxey, D. A. Newby, M. B. Gillies, M. Williamson, and E. Pearson, Electronic information and clinical decision support for prescribing: state of play in Australian general practice, Fam Pract. 28(1), Feb. 2011, pp. 93-101. [94] V. N. Stroetmann, D. Kalra, P. Lewalle, A. Rector, J. M. Rodrigues, K. A. Stroetmann, G. Surjan, B. Ustun, M. Virtanen, and P. E. Zanstra, Semantic Interoperability for Better Health and Safer Healthcare: Deployment and
  • 20. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.5, September 2015 20 Research Roadmap for Europe, SemanticHEALTH project, a Specific Support Action funded by the European Union 6th R&D Framework Programme (FP6). DOI=http://ec.europa.eu/ : 10.2759/38514. [95] W. Wenjun, D. Lei, D. Cunxiang, D. Shan, and Z. Xiankun, Emergency plan process ontology and its application, In Proc. of the 2nd Int. Conf. on Advanced Computer Control (Shenyang, Liaoning ICACC, Tianjin, China, 2010, pp. 513-516. [96] M. Sotoodeh, Ontology-Based Semantic Interoperability in Emergency Management, Doctoral Thesis, University of British Columbia, 2007. [97] Y. Peng, Application of Emergency Case Ontology Model in Earthquake, in Proc. of Int. Conf. on Management and Service Science. MASS, 2009, Tianjin, China, pp. 1 – 5. [98] A. Bellenger, An information fusion semantic and service enablement platform: The FusionLab approach. Proceedings of the 14th International Conference on Information Fusion (Chicago, Illinois, USA, July 05 - 08, 2011), FUSION 2011. Val-de-Reuil, France, pp. 1 – 8. [99] J. Hunter, P. Becker, A. Alabri, C. Van Ingen, E. Abal, Using Ontologies to Relate Resource Management Actions to Environmental Monitoring Data in South East Queensland, IJAEIS, 2(1) (2011), pp. 1-19. [100] V. Mascardi, V. Cordi, P. Rosso, A Comparison of Upper Ontologies, Technical Report DISI-TR-06-2, The University of Genoa, The Technical University of Valencia, Genova, Italy, 2007. [101] Suggested Upper Merged Ontology (SUMO), Available online at www.ontologyportal.org. [102] Z. Xianmin, Z. Daozhi, F. Wen, and W. Wenjun, Research on SUMO-based Emergency Response Management Team Model, in the Proc. of the Int. Conf. on Wireless Communications, Networking and Mobile Computing, Shanghai, China, September 21 - 25, 2007, pp. 4606 - 4609. [103] International Organization for Standardization, International Standard 18629, Available online at http://www.iso.org. [104] M. Grüninger, T. Hahmann, A. Hashemi, D. Ong, and A. Özgövde, Modular first-order ontologies via repositories, Applied Ontology 7(2), April 2012, pp. 169 – 209. [105] C. Lange, O. Kutz, T. Mossakowski, and M. Grüninger, The Distributed Ontology Language (DOL): Ontology Integration and Interoperability Applied to Mathematical Formalization, In Proc. of the Conf. on Intelligent Computer Mathematics CICM, Bremen, Germany, Springer, Heidelberg, 2012, pp. 463-467. [106] M. Gollapalli, A Framework of Ontology Guided Data Linkage for Evidence based Knowledge Extraction and Information Sharing, in the 29th IEEE International Conference on Data Engineering (ICDE) Workshop, Brisbane, Australia, Apr 2013, pp. 294 - 297. [107] B. Bell, D. McD Taylor, A. Holdgate, C. MacBean, T. Huynh, O. Thom, M. Augello, R. Millar, R. Day, A. Williams, P. Ritchie, and J. Pasco, Procedural sedation practices in Australian Emergency Departments. Emerg, Med. Australasia. 23(4), May 2011, pp. 458 – 465. [108] A. Holdgate, D. McD Taylor, B. Bell, C. MacBean, T. Huynh, O. Thom, M. Augello, R. Millar, R. Day, A. Williams, P. Ritchie, and J. Pasco, Factors associated with failure to successfully complete a procedure during emergency department sedation, Emerg. Med. Australasia. 23(4), Aug. 2011, pp. 474 – 478. [109] M. Gollapalli, X. Li, I. Wood, Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning, in the Data & Knowledge Engineering Journal, Volume 87, 2013, pp. 405-424. AUTHOR Dr. Mohammed Gollapalli is an Assistant Professor in the College of Computer Science and Information Technology (CCSIT) at the University of Dammam (UD), Dammam, Kingdom of Saudi Arabia. He graduated his PhD in Information Technology at the University of Queensland (UQ), Brisbane, Australia in 2013 and obtained Masters in Information Technology at Griffith University, Gold Coast, Australia, in 2005. His major areas of research interests and expertise include Data Mining, Knowledge Management, and Quality Control. He is a member of MCP and IEEE.