SlideShare a Scribd company logo
Computer Engineering and Intelligent Systems                                                    www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 2, No.7, 2011


  On Demand Quality of web services using Ranking by multi
                                                 criteria
                                                Nagelli Rajanikath
                         Vivekananda Institute of Science & Technology, JNT University
                                            Karimanagar. Pin : 505001
                          Mobile No:9866491911,       E-mail: rajanikanth86@gmail.com


                                              Prof. P. Pradeep Kumar
                                    HOD, Dept of CSE(VITS), JNT University
                                            Karimnagar , Pin : 505001
                                          E-mail: pkpuram@yahoo.com


                                               Asst. Prof. B. Meena
                                       Dept of CSE(VITS), JNT University
                                            Karimnagar , Pin : 505001
                                         E-mail: vinaymeena@gmail.com
Received: 2011-10-13
Accepted: 2011-10-19
Published: 2011-11-04


Abstract
In the Web database scenario, the records to match are highly query-dependent, since they can only be
obtained through online queries. Moreover, they are only a partial and biased portion of all the data in the
source Web databases. Consequently, hand-coding or offline-learning approaches are not appropriate for
two reasons. First, the full data set is not available beforehand, and therefore, good representative data for
training are hard to obtain. Second, and most importantly, even if good representative data are found and
labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial
and biased part of that data set.
Keywords: SOA, Web Services, Networks

1. Introduction
Today, more and more databases that dynamically generate Web pages in response to user queries are
available on the Web. These Web databases compose the deep or hidden Web, which is estimated to contain
a much larger amount of high quality, usually structured information and to have a faster growth rate than
the static Web. Most Web databases are only accessible via a query interface through which users can
submit queries. Once a query is received, the Web server will retrieve the corresponding results from the
back-end database and return them to the user.

To build a system that helps users integrate and, more importantly, compare the query results returned from
multiple Web databases, a crucial task is to match the different sources’ records that refer to the same
real-world entity. The problem of identifying duplicates, that is, two (or more) records describing the same
entity, has attracted much attention from many research fields, including Databases, Data Mining, Artificial

31 | P a g e
www.iiste.org
Computer Engineering and Intelligent Systems                                                    www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 2, No.7, 2011

Intelligence, and Natural Language Processing. Most previous work is based on predefined matching rules
hand-coded by domain experts or matching rules learned offline by some learning method from a set of
training examples. Such approaches work well in a traditional database environment, where all instances of
the target databases can be readily accessed, as long as a set of high-quality representative records can be
examined by experts or selected for the user to label.

1.1 Previous Work
Data integration is the problem of combining information from multiple heterogeneous databases. One step
of data integration is relating the primitive objects that appear in the different databases specifically,
determining which sets of identifiers refer to the same real-world entities. A number of recent research
papers have addressed this problem by exploiting similarities in the textual names used for objects in
different databases. (For example one might suspect that two objects from different databases named
“USAMA FAYYAD” and “Usama M. Fayyad” ” respectively might refer to the same person.) Integration
techniques based on textual similarity are especially useful for databases found on the Web or obtained by
extracting information from text, where descriptive names generally exist but global object identifiers are
rare. Previous publications in using textual similarity for data integration have considered a number of
related tasks. Although the terminology is not completely standardized, in this paper we define entity-name
matching as the task of taking two lists of entity names from two different sources and determining which
pairs of names are co-referent (i.e., refer to the same real-world entity). We define entity-name clustering as
the task of taking a single list of entity names and assigning entity names to clusters such that all names in a
cluster are co-referent. Matching is important in attempting to join information across of pair of relations
from different databases, and clustering is important in removing duplicates from a relation that has been
drawn from the union of many different information sources. Previous work in this area includes work in
distance functions for matching and scalable matching and clustering algorithms. Work in record linkage is
similar but does not rely as heavily on textual similarities. [1]

Important business decisions; therefore, accuracy of such analysis is crucial. However, data received at the
data warehouse from external sources usually contains errors: spelling mistakes, inconsistent conventions,
etc. Hence, significant amount of time and money are spent on data cleaning, the task of detecting and
correcting errors in data. The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality. [2]

Many times, the same logical real world entity may have multiple representations in the data warehouse.
For example, when Lisa purchases products from SuperMart twice, she might be entered as two different
customers due to data entry errors. Such duplicated information can significantly increase direct mailing
costs because several customers like Lisa may be sent multiple catalogs. Moreover, such duplicates can
cause incorrect results in analysis queries (say, the number of SuperMart customers in Seattle), and
erroneous data mining models to be built. We refer to this problem of detecting and eliminating multiple
distinct records representing the same real world entity as the fuzzy duplicate elimination problem, which is
sometimes also called merge/purge, dedup, record linkage problems. This problem is different from the
standard duplicate elimination problem, say for answering “select distinct” queries, in relational database
systems which considers two tuples to be duplicates if they match exactly on all attributes. However, data
cleaning deals with fuzzy duplicate elimination, which is our focus in this paper. Henceforth, we use
duplicate elimination to mean fuzzy duplicate elimination. [2]
2. Proposed System
2.1 Weighted Component Similarity Summing Classifier
In our algorithm, classifier plays a vital role. At the beginning, it is used to identify some duplicate vectors
when there are no positive examples available. Then, after iteration begins, it is used again to cooperate
with other classifier to identify new duplicate vectors. Because no duplicate vectors are available initially,
classifiers that need class information to train, such as decision tree, cannot be used. An intuitive method to
identify duplicate vectors is to assume that two records are duplicates if most of their fields that are under
consideration are similar. On the other hand, if all corresponding fields of the two records are dissimilar, it
32 | P a g e
www.iiste.org
Computer Engineering and Intelligent Systems                                                   www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 2, No.7, 2011

is unlikely that the two records are duplicates. To evaluate the similarity between two records, we combine
the values of each component in the similarity vector for the two records. Different fields may have
different importance when we decide whether two records are duplicates. The importance is usually
data-dependent, which, in turn, depends on the query in the Web database scenario.
2.1 Component Weight Assignment
In this classifier, we assign a weight to a component to indicate the importance of its corresponding field.
The similarity between two duplicate records should be close to 1. For a duplicate vector that is formed by
a pair of duplicate records r1 and r2, we need to assign large weights to the components with large
similarity values and small weights to the components with small similarity values. The similarity for two
nonduplicate records should be close to 0. Hence, for a nonduplicate vector that is formed by a pair of
nonduplicate records r1 and r2, we need to assign small weights to the components with large similarity
values and large weights to the components with small similarity values. The component will be assigned
a small weight if it usually has a small similarity value in the duplicate vectors.
2.2 Duplicate Identification
After we assign a weight for each component, the duplicate vector detection is rather intuitive. Two records
r1 and r2 are duplicates if they are similar, i.e., if their similarity value is equal to or greater than a
similarity threshold. In general, the similarity threshold should be close to 1 to ensure that the identified
duplicates are correct. Increasing the value of similarity will reduce the number of duplicate vectors
identified while, at the same time, the identified duplicates will be more precise.
2.3 Similarity Calculation
The similarity calculation quantifies the similarity between a pair of record fields. As the query results to
match are extracted from HTML pages, namely, text files, we only consider string similarity. Given a pair
of strings a similarity function calculates the similarity score between Sa and Sb, which must be between 0
and 1. Since the similarity function is orthogonal to the iterative duplicate detection, any kind of similarity
calculation method can be employed. Domain knowledge or user preference can also be incorporated into
the similarity function. In particular, the similarity function can be learned if training data is available.
3. Results

The concept of this paper is implemented and different results are shown below




33 | P a g e
www.iiste.org
Computer Engineering and Intelligent Systems                                                www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 2, No.7, 2011




3.1 Performance Analysis
The proposed paper is implemented in Java technology on a Pentium-III PC with 20 GB hard-disk and 256
MB RAM with apache web server. The propose paper’s concepts shows efficient results and has been
efficiently tested on different Messages.

4. Conclusion
Duplicate detection is an important step in data integration and most state-of-the-art methods are based on
offline learning techniques, which require training data. In the Web database scenario, where records to
match are greatly query-dependent, a pretrained approach is not applicable as the set of records in each
query’s results is a biased subset of the full data set. To overcome this problem, we presented an
unsupervised, online approach, UDD, for detecting duplicates over the query results of multiple Web
databases. Two classifiers, WCSS and SVM, are used cooperatively in the convergence step of record
matching to identify the duplicate pairs from all potential duplicate pairs iteratively.


References
[1] W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Datasets
for Data Integration,” Proc. ACM SIGKDD, pp. 475-480, 2002.
[2] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data
Warehouses,” Proc. 28th Int’l Conf. Very Large Data Bases, pp. 586-597, 2002.
34 | P a g e
www.iiste.org
Computer Engineering and Intelligent Systems                                        www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol 2, No.7, 2011

[3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for
Online Data Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.
[4] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava,
“Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int’l Conf. Very Large Data
Bases, pp. 491-500, 2001.
[5] X. Dong, A. Halevy, and J. Madhavan, “Reference Reconciliation in Complex Information
Spaces,” Proc. ACM SIGMOD, pp. 85-96, 2005.
[6] W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. ACM
SIGKDD, pp. 255-259, 2000.
[7] P. Christen, T. Churches, and M. Hegland, “Febrl—A Parallel Open Source Data Linkage System,”
Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004.
[8] P. Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and
Deduplication,” Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp.
127-151, Springer, 2007.




35 | P a g e
www.iiste.org

More Related Content

PDF
Udd for multiple web databases
PDF
Cl4201593597
PDF
2016 BE Final year Projects in chennai - 1 Crore Projects
PDF
Bi4101343346
PDF
An efficeient privacy preserving ranked keyword search
PDF
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
The International Journal of Engineering and Science (The IJES)
Udd for multiple web databases
Cl4201593597
2016 BE Final year Projects in chennai - 1 Crore Projects
Bi4101343346
An efficeient privacy preserving ranked keyword search
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
International Journal of Computational Engineering Research(IJCER)
The International Journal of Engineering and Science (The IJES)

What's hot (19)

PDF
DBPEDIA BASED FACTOID QUESTION ANSWERING SYSTEM
PDF
[IJET-V2I3P19] Authors: Priyanka Sharma
PDF
Spe165 t
PDF
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
PDF
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
PDF
Data Mining in Multi-Instance and Multi-Represented Objects
PDF
In3415791583
PDF
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
PDF
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
PDF
Record matching over query results
PDF
A study and survey on various progressive duplicate detection mechanisms
PDF
PDF
Adaptive named entity recognition for social network analysis and domain onto...
PDF
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
PDF
Modern association rule mining methods
PDF
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
PDF
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
PDF
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
PDF
Indexing based Genetic Programming Approach to Record Deduplication
DBPEDIA BASED FACTOID QUESTION ANSWERING SYSTEM
[IJET-V2I3P19] Authors: Priyanka Sharma
Spe165 t
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
Data Mining in Multi-Instance and Multi-Represented Objects
In3415791583
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
Record matching over query results
A study and survey on various progressive duplicate detection mechanisms
Adaptive named entity recognition for social network analysis and domain onto...
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
Modern association rule mining methods
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Indexing based Genetic Programming Approach to Record Deduplication
Ad

Viewers also liked (20)

PPTX
MIS G6-ES
ODP
Slideshow
PDF
HERRAMIENTAS PARA APOYAR LOS PROCESOS GERENCIALES MAPA MENTAL OSCAR NUÑEZ
PDF
Defesa representacao gj_frente_popular1
PPTX
Planning business intelligence - Persian Presentation
PPT
Analisis Clase2
PDF
30 وقفة في فن الدعوة
PPT
Una Vita da Producer - GameCamp2013
PDF
Jvm的最小使用内存测试
PPTX
Mle to super loop
DOCX
Arancel pablo guio
PDF
L’atto medico tra il paradigma della malattia e il paradigma della salute
PPT
المشروع
PDF
تلخيصات التقرير الثانى لمرصد
DOCX
Proyecto Unal2009 2
PPT
الصخور
DOCX
الحجارة اليمنية
PDF
الانتقال الديمقراطي في تونس
PPT
sophiarojas
PDF
طلب مرصد بخصوص تعديلات القا نون
MIS G6-ES
Slideshow
HERRAMIENTAS PARA APOYAR LOS PROCESOS GERENCIALES MAPA MENTAL OSCAR NUÑEZ
Defesa representacao gj_frente_popular1
Planning business intelligence - Persian Presentation
Analisis Clase2
30 وقفة في فن الدعوة
Una Vita da Producer - GameCamp2013
Jvm的最小使用内存测试
Mle to super loop
Arancel pablo guio
L’atto medico tra il paradigma della malattia e il paradigma della salute
المشروع
تلخيصات التقرير الثانى لمرصد
Proyecto Unal2009 2
الصخور
الحجارة اليمنية
الانتقال الديمقراطي في تونس
sophiarojas
طلب مرصد بخصوص تعديلات القا نون
Ad

Similar to 4.on demand quality of web services using ranking by multi criteria 31-35 (20)

DOC
Record matching over multiple query result - Document
PDF
Matching data detection for the integration system
PDF
Duplicate Detection of Records in Queries using Clustering
PDF
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
PDF
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
PPTX
Record matching over query results from Web Databases
PPTX
DOC
Power Management in Micro grid Using Hybrid Energy Storage System
PDF
Elimination of data redundancy before persisting into dbms using svm classifi...
PDF
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
PDF
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
PDF
A Primer on Entity Resolution
PDF
Data warehousing and data mining
PPT
VNSISPL_DBMS_Concepts_ch19
PDF
03. Data Preprocessing
PDF
Anomalous symmetry succession for seek out
PDF
Dotnet datamining ieee projects 2012 @ Seabirds ( Chennai, Pondicherry, Vello...
DOCX
Toward a System Building Agenda for Data Integration(and Dat.docx
PPTX
IRT Unit_ 2.pptx
PDF
Engineering challenges in vertical search engines
Record matching over multiple query result - Document
Matching data detection for the integration system
Duplicate Detection of Records in Queries using Clustering
A LINK-BASED APPROACH TO ENTITY RESOLUTION IN SOCIAL NETWORKS
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
Record matching over query results from Web Databases
Power Management in Micro grid Using Hybrid Energy Storage System
Elimination of data redundancy before persisting into dbms using svm classifi...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
A Primer on Entity Resolution
Data warehousing and data mining
VNSISPL_DBMS_Concepts_ch19
03. Data Preprocessing
Anomalous symmetry succession for seek out
Dotnet datamining ieee projects 2012 @ Seabirds ( Chennai, Pondicherry, Vello...
Toward a System Building Agenda for Data Integration(and Dat.docx
IRT Unit_ 2.pptx
Engineering challenges in vertical search engines

More from Alexander Decker (20)

PDF
Abnormalities of hormones and inflammatory cytokines in women affected with p...
PDF
A validation of the adverse childhood experiences scale in
PDF
A usability evaluation framework for b2 c e commerce websites
PDF
A universal model for managing the marketing executives in nigerian banks
PDF
A unique common fixed point theorems in generalized d
PDF
A trends of salmonella and antibiotic resistance
PDF
A transformational generative approach towards understanding al-istifham
PDF
A time series analysis of the determinants of savings in namibia
PDF
A therapy for physical and mental fitness of school children
PDF
A theory of efficiency for managing the marketing executives in nigerian banks
PDF
A systematic evaluation of link budget for
PDF
A synthetic review of contraceptive supplies in punjab
PDF
A synthesis of taylor’s and fayol’s management approaches for managing market...
PDF
A survey paper on sequence pattern mining with incremental
PDF
A survey on live virtual machine migrations and its techniques
PDF
A survey on data mining and analysis in hadoop and mongo db
PDF
A survey on challenges to the media cloud
PDF
A survey of provenance leveraged
PDF
A survey of private equity investments in kenya
PDF
A study to measures the financial health of
Abnormalities of hormones and inflammatory cytokines in women affected with p...
A validation of the adverse childhood experiences scale in
A usability evaluation framework for b2 c e commerce websites
A universal model for managing the marketing executives in nigerian banks
A unique common fixed point theorems in generalized d
A trends of salmonella and antibiotic resistance
A transformational generative approach towards understanding al-istifham
A time series analysis of the determinants of savings in namibia
A therapy for physical and mental fitness of school children
A theory of efficiency for managing the marketing executives in nigerian banks
A systematic evaluation of link budget for
A synthetic review of contraceptive supplies in punjab
A synthesis of taylor’s and fayol’s management approaches for managing market...
A survey paper on sequence pattern mining with incremental
A survey on live virtual machine migrations and its techniques
A survey on data mining and analysis in hadoop and mongo db
A survey on challenges to the media cloud
A survey of provenance leveraged
A survey of private equity investments in kenya
A study to measures the financial health of

Recently uploaded (20)

PDF
August Patch Tuesday
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Enhancing emotion recognition model for a student engagement use case through...
Heart disease approach using modified random forest and particle swarm optimi...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A Presentation on Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
Web App vs Mobile App What Should You Build First.pdf
NewMind AI Weekly Chronicles - August'25-Week II
A comparative study of natural language inference in Swahili using monolingua...
Hindi spoken digit analysis for native and non-native speakers
DP Operators-handbook-extract for the Mautical Institute
A novel scalable deep ensemble learning framework for big data classification...

4.on demand quality of web services using ranking by multi criteria 31-35

  • 1. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol 2, No.7, 2011 On Demand Quality of web services using Ranking by multi criteria Nagelli Rajanikath Vivekananda Institute of Science & Technology, JNT University Karimanagar. Pin : 505001 Mobile No:9866491911, E-mail: rajanikanth86@gmail.com Prof. P. Pradeep Kumar HOD, Dept of CSE(VITS), JNT University Karimnagar , Pin : 505001 E-mail: pkpuram@yahoo.com Asst. Prof. B. Meena Dept of CSE(VITS), JNT University Karimnagar , Pin : 505001 E-mail: vinaymeena@gmail.com Received: 2011-10-13 Accepted: 2011-10-19 Published: 2011-11-04 Abstract In the Web database scenario, the records to match are highly query-dependent, since they can only be obtained through online queries. Moreover, they are only a partial and biased portion of all the data in the source Web databases. Consequently, hand-coding or offline-learning approaches are not appropriate for two reasons. First, the full data set is not available beforehand, and therefore, good representative data for training are hard to obtain. Second, and most importantly, even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set. Keywords: SOA, Web Services, Networks 1. Introduction Today, more and more databases that dynamically generate Web pages in response to user queries are available on the Web. These Web databases compose the deep or hidden Web, which is estimated to contain a much larger amount of high quality, usually structured information and to have a faster growth rate than the static Web. Most Web databases are only accessible via a query interface through which users can submit queries. Once a query is received, the Web server will retrieve the corresponding results from the back-end database and return them to the user. To build a system that helps users integrate and, more importantly, compare the query results returned from multiple Web databases, a crucial task is to match the different sources’ records that refer to the same real-world entity. The problem of identifying duplicates, that is, two (or more) records describing the same entity, has attracted much attention from many research fields, including Databases, Data Mining, Artificial 31 | P a g e www.iiste.org
  • 2. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol 2, No.7, 2011 Intelligence, and Natural Language Processing. Most previous work is based on predefined matching rules hand-coded by domain experts or matching rules learned offline by some learning method from a set of training examples. Such approaches work well in a traditional database environment, where all instances of the target databases can be readily accessed, as long as a set of high-quality representative records can be examined by experts or selected for the user to label. 1.1 Previous Work Data integration is the problem of combining information from multiple heterogeneous databases. One step of data integration is relating the primitive objects that appear in the different databases specifically, determining which sets of identifiers refer to the same real-world entities. A number of recent research papers have addressed this problem by exploiting similarities in the textual names used for objects in different databases. (For example one might suspect that two objects from different databases named “USAMA FAYYAD” and “Usama M. Fayyad” ” respectively might refer to the same person.) Integration techniques based on textual similarity are especially useful for databases found on the Web or obtained by extracting information from text, where descriptive names generally exist but global object identifiers are rare. Previous publications in using textual similarity for data integration have considered a number of related tasks. Although the terminology is not completely standardized, in this paper we define entity-name matching as the task of taking two lists of entity names from two different sources and determining which pairs of names are co-referent (i.e., refer to the same real-world entity). We define entity-name clustering as the task of taking a single list of entity names and assigning entity names to clusters such that all names in a cluster are co-referent. Matching is important in attempting to join information across of pair of relations from different databases, and clustering is important in removing duplicates from a relation that has been drawn from the union of many different information sources. Previous work in this area includes work in distance functions for matching and scalable matching and clustering algorithms. Work in record linkage is similar but does not rely as heavily on textual similarities. [1] Important business decisions; therefore, accuracy of such analysis is crucial. However, data received at the data warehouse from external sources usually contains errors: spelling mistakes, inconsistent conventions, etc. Hence, significant amount of time and money are spent on data cleaning, the task of detecting and correcting errors in data. The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality. [2] Many times, the same logical real world entity may have multiple representations in the data warehouse. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers due to data entry errors. Such duplicated information can significantly increase direct mailing costs because several customers like Lisa may be sent multiple catalogs. Moreover, such duplicates can cause incorrect results in analysis queries (say, the number of SuperMart customers in Seattle), and erroneous data mining models to be built. We refer to this problem of detecting and eliminating multiple distinct records representing the same real world entity as the fuzzy duplicate elimination problem, which is sometimes also called merge/purge, dedup, record linkage problems. This problem is different from the standard duplicate elimination problem, say for answering “select distinct” queries, in relational database systems which considers two tuples to be duplicates if they match exactly on all attributes. However, data cleaning deals with fuzzy duplicate elimination, which is our focus in this paper. Henceforth, we use duplicate elimination to mean fuzzy duplicate elimination. [2] 2. Proposed System 2.1 Weighted Component Similarity Summing Classifier In our algorithm, classifier plays a vital role. At the beginning, it is used to identify some duplicate vectors when there are no positive examples available. Then, after iteration begins, it is used again to cooperate with other classifier to identify new duplicate vectors. Because no duplicate vectors are available initially, classifiers that need class information to train, such as decision tree, cannot be used. An intuitive method to identify duplicate vectors is to assume that two records are duplicates if most of their fields that are under consideration are similar. On the other hand, if all corresponding fields of the two records are dissimilar, it 32 | P a g e www.iiste.org
  • 3. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol 2, No.7, 2011 is unlikely that the two records are duplicates. To evaluate the similarity between two records, we combine the values of each component in the similarity vector for the two records. Different fields may have different importance when we decide whether two records are duplicates. The importance is usually data-dependent, which, in turn, depends on the query in the Web database scenario. 2.1 Component Weight Assignment In this classifier, we assign a weight to a component to indicate the importance of its corresponding field. The similarity between two duplicate records should be close to 1. For a duplicate vector that is formed by a pair of duplicate records r1 and r2, we need to assign large weights to the components with large similarity values and small weights to the components with small similarity values. The similarity for two nonduplicate records should be close to 0. Hence, for a nonduplicate vector that is formed by a pair of nonduplicate records r1 and r2, we need to assign small weights to the components with large similarity values and large weights to the components with small similarity values. The component will be assigned a small weight if it usually has a small similarity value in the duplicate vectors. 2.2 Duplicate Identification After we assign a weight for each component, the duplicate vector detection is rather intuitive. Two records r1 and r2 are duplicates if they are similar, i.e., if their similarity value is equal to or greater than a similarity threshold. In general, the similarity threshold should be close to 1 to ensure that the identified duplicates are correct. Increasing the value of similarity will reduce the number of duplicate vectors identified while, at the same time, the identified duplicates will be more precise. 2.3 Similarity Calculation The similarity calculation quantifies the similarity between a pair of record fields. As the query results to match are extracted from HTML pages, namely, text files, we only consider string similarity. Given a pair of strings a similarity function calculates the similarity score between Sa and Sb, which must be between 0 and 1. Since the similarity function is orthogonal to the iterative duplicate detection, any kind of similarity calculation method can be employed. Domain knowledge or user preference can also be incorporated into the similarity function. In particular, the similarity function can be learned if training data is available. 3. Results The concept of this paper is implemented and different results are shown below 33 | P a g e www.iiste.org
  • 4. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol 2, No.7, 2011 3.1 Performance Analysis The proposed paper is implemented in Java technology on a Pentium-III PC with 20 GB hard-disk and 256 MB RAM with apache web server. The propose paper’s concepts shows efficient results and has been efficiently tested on different Messages. 4. Conclusion Duplicate detection is an important step in data integration and most state-of-the-art methods are based on offline learning techniques, which require training data. In the Web database scenario, where records to match are greatly query-dependent, a pretrained approach is not applicable as the set of records in each query’s results is a biased subset of the full data set. To overcome this problem, we presented an unsupervised, online approach, UDD, for detecting duplicates over the query results of multiple Web databases. Two classifiers, WCSS and SVM, are used cooperatively in the convergence step of record matching to identify the duplicate pairs from all potential duplicate pairs iteratively. References [1] W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Datasets for Data Integration,” Proc. ACM SIGKDD, pp. 475-480, 2002. [2] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int’l Conf. Very Large Data Bases, pp. 586-597, 2002. 34 | P a g e www.iiste.org
  • 5. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol 2, No.7, 2011 [3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003. [4] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int’l Conf. Very Large Data Bases, pp. 491-500, 2001. [5] X. Dong, A. Halevy, and J. Madhavan, “Reference Reconciliation in Complex Information Spaces,” Proc. ACM SIGMOD, pp. 85-96, 2005. [6] W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. ACM SIGKDD, pp. 255-259, 2000. [7] P. Christen, T. Churches, and M. Hegland, “Febrl—A Parallel Open Source Data Linkage System,” Advances in Knowledge Discovery and Data Mining, pp. 638-647, Springer, 2004. [8] P. Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication,” Quality Measures in Data Mining, F. Guillet and H. Hamilton, eds., vol. 43, pp. 127-151, Springer, 2007. 35 | P a g e www.iiste.org