SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 13, No. 1, February 2023, pp. 1008~1014
ISSN: 2088-8708, DOI: 10.11591/ijece.v13i1.pp1008-1014  1008
Journal homepage: http://ijece.iaescore.com
Matching data detection for the integration system
Merieme El Abassi1
, Mohamed Amnai1
, Ali Choukri1
, Youssef Fakhri1
, Noreddine Gherabi2
1
Laboratory of Computer Sciences Research, Faculty of Sciences, Ibn Tofail University Kenitra, Kenitra, Morocco
2
National School of Applied Sciences of Sultan Moulay Slimane University, Khouribga, Morocco
Article Info ABSTRACT
Article history:
Received Nov 9, 2021
Revised Sep 22, 2022
Accepted Oct 5, 2022
The purpose of data integration is to integrate the multiple sources of
heterogeneous data available on the internet, such as text, image, and video.
After this stage, the data becomes large. Therefore, it is necessary to analyze
the data that can be used for the efficient execution of the query. However,
we have problems with solving entities, so it is necessary to use different
techniques to analyze and verify the data quality in order to obtain good data
management. Then, when we have a single database, we call this mechanism
deduplication. To solve the problems above, we propose in this article a
method to calculate the similarity between the potential duplicate data. This
solution is based on graphics technology to narrow the search field for
similar features. Then, a composite mechanism is used to locate the most
similar records in our database to improve the quality of the data to make
good decisions from heterogeneous sources.
Keywords:
Data integration
Data matching
Data quality
Entity resolution
This is an open access article under the CC BY-SA license.
Corresponding Author:
Merieme El Abassi
Laboratory of Computer Sciences Research, Faculty of Sciences, Ibn Tofail University Kenitra
Kenitra, Morocco
Email: merieme.elabassi@uit.ac.ma
1. INTRODUCTION
Big data is like a small data but in a large amount of data with a higher complexity level, because it
becomes very difficult to control it by any database management tool [1]. However, big data is characterized
via a set of properties, including volume, veracity, variety, and velocity. Volume represents the size of data; it
can be extended up to terabyte or more. Velocity denoted how fast the data came in. variety represents that
data can be structured, semi-structured and unstructured format. Today, data has become the wealth of
companies and management departments, contributing to its development. The decisions based on
low-quality data can be very costly, hurting businesses, partners, and customers. Furthermore, the
management departments and companies need to improve their relationships through data governance.
Hence, having good data quality is very important for companies, especially when they interact with other
organizations or make big decisions.
The concentrate on the structure of the data to be cleaned or integrated, in order to make some
metrics and ways to solve the issues of data quality. That is what the suggested methods depend on to solve
these problems. Thus, to get helpful data, we need to analyze it within the range of its usage [2]. As we know,
integration projects may require some support to improve the quality of data, because there are few
companies who execute the procedures of data quality management in the database or data warehouse they
have created.
Currently, the problem of entity analysis is a field of research in the field of data quality [3]–[5]. Just
as online mining for relationships and entities has established an extensive public knowledge base,
companies, governments, and researchers can also use the true value of this data, which can only be used
when multiple data sources are integrated. Entity resolution refers to the task of identifying records from the
Int J Elec & Comp Eng ISSN: 2088-8708 
Matching data detection for the integration system (Merieme El Abassi)
1009
same entity in one or more data sources [6]–[9]. When only one database is used, this strategy can be called
deduplication [10]. Comparing records that might be matched in an entity collection is a secondary problem.
It is impractical to use traditional methods of comparison when collecting big data. Therefore, you generally
organize similar records and then compare only those records that appear in the same block to improve the
efficiency of the entity resolution algorithm. However, for data across multiple data sources, there are usually
different standards, so before analyzing the entity, pattern matching must be performed between the data
sources. Traditional pattern matching is no longer effective for large amounts of highly heterogeneous and
noisy data on the network.
Several techniques are currently used to define the likelihood of matching features. These include
sorted neighborhood blocking (SN) [11], duplicate count strategy (DCS) [12], and q-gram based indexing
(G-gram). These are feature resolution techniques. Basically, they operate on all records in the dataset. Then
we group each record into one or more blocks, and finally we create the pairs for comparison. Therefore,
these methods take a long time to identify possible pairings. The probability of error is also high because the
search area is larger for large amounts of data. Consequently, to minimize these problems, we need to look
for a technique that reduces search time and achieves good results.
The main focus of this study is on exploiting graph mechanisms to solve the problem of big data
diversity. Here, a technique is discussed to identify matching data resulting from the integration of different
types of sources such as databases, comma-separated values (CSV), spreadsheets, web services, and other
formats. These duplicate data cause a lot of problems when we make decisions. For example, we might find
many records representing the same record with minor differences, such as finding names in a different order,
which makes us consider each record duplicate as a new one, or having some names or information errors, or
using different abbreviations in each database sources. So, the goal is to convert the data into a graph where
the search space for duplicate records can be reduced, the records that have shared data are identified, and
then some algorithms are applied to obtain a match ratio between records that have some shared information.
The contributions of this paper are five aspects. First, the data integration problem solving is
propose based on the deduplication approach. In the second aspect, the problems of data integration are
described. Third, the proposal is presented, while in the fourth, the principal focus is on the evaluation of that
proposed method. In the last phase, conclusions and ideas for future extensions are discussed.
2. RELATED WORK
A lot of studies have conducted many challenges of data integration which we need to overcome.
Several propositions have been discussed by Yeganeh et al. [13], who discussed the problem of considering
user preferences for data quality, within several settings and improving user satisfaction from query results.
Improving query results helps the users’ decision-making process and should lead to higher user satisfaction,
this makes the field of data quality study interdisciplinary. Different synergies are proposed to provide
comprehensive data quality solutions [14]. Several researchers have worked on grouping the dimensions of
data quality into conceptual views of data, data values, and data formats. Similar to the above work,
Bovee et al. [15] and Jarke et al. [16] recommended to classify the data quality dimensions based on the
user’s role in the data warehouse environment. Also, they propose different dimensions and sub-dimensions,
based on the concept of data quality as applicability. The problem of describing the quality of data sources is
at the core of data integration and exchange [17], [18]. Talburt [19] talked about entity resolution and
information quality and discusses the Fellegi-Sunter theory of the relationship between records, the Stanford
entity resolution framework and the algebraic model for entity resolution, which are the main theoretical
models that support entity resolution. Also, the way of eliminating the redundant data and supporting the
master data management programs is discussed through the concept of entity resolution and by using the
Oyster open-source system [20].
In addressing the entity resolution problem, a method for distributing the workload between the
different computing nodes is proposed [21]. The approach is based on the use of MapReduce with standard
blocking. Benny et al. [22] proposed similar work for entity resolution on Hadoop, which works with
semi-structured data. It also contains the preprocessing tasks and the results of the comparison, indexing, and
classification. The use of different classification methods and their results is described to improve future
comparisons. Papadakis et al. [23] proposed different techniques. The first technique aims to remove the
superfluous comparisons from any redundancy-based blocking method, and the second solution is used for
reducing the space requirements which are mapped to the Cartesian space.
Yan et al. [24] developed a methodology called multi-singer that supports structured and
unstructured data types, as well as tasks considering data preprocessing and comparing reduction. The
objective of the work [1] is to apply deduplication techniques in different merged data sources, in order to
format comparison or finding duplicate elements of the same type. Various techniques for preprocessing and
testing large amounts of data are proposed [25], this system allows matching entities assigned to the same
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014
1010
block. It uses dynamic blocking to achieve high performance, reducing the search space and covering the
same entities in blocking steps.
3. PROBLEM FORMULATION
Data inconsistency can be caused by heterogeneous data sources, which means more tools and
techniques to optimize unstructured data are needed. In addition, structured data allows us to run query
processes to filter, analyze, and use this data to make business decisions and build organizational capabilities
[26]. Organizations face a different challenge when they need to expand to accommodate a wide range of
data and create new domains. That is why a solution must be found. This involves creating high performance
computing environments with advanced data storage systems. It also reduces latency while improving
reliability and access to data quickly. Big data entities pose many challenges when faced with the
accumulation of large data sets from different sources. Then ways of running common to the two fields to
combine and execute the queries and algorithms must be found.
The wave of big data will soon spread to all areas of life, which will not only provide humans with
unprecedented opportunities, but also bring major challenges. Therefore, it is very necessary to effectively
solve diversity and heterogeneous data issues in big data integration [27]. Thus, the quality of information
has become the latest technical requirement for users. To evaluate and enhance data quality, it is essential to
implement continuous enhancement strategies. By combining data, duplicate data can be identified and then
actions can be taken, like merging two similar or identical records into one. This also helps identify equally
important non-duplicates, as you should know that two similar things are not the same. Overall, deduplication
is a process used for moving the duplicate data in one or more databases containing information of poor
quality, it is a very important technique for having good data quality. This also can be a difficult question,
because the same “entity” can be referred to by different names, and these names can also contain typos, so
matching ordinary strings is not enough.
For example, if you have a dataset consisting of many records, and you are trying to find records
that represent the same entity, this problem is difficult to solve in a simple way. Although each entity uses a
different lexical representation, direct string matching will not find duplicate records, as we can see in this
example: records (1 MOBILE LIMITED, 30 CITY RD) and (1 MOBILE Ltd, 30 CITY ROAD) represent the
same record entity but use a different dictionary. Therefore, in this paper, a processing technique using
Sparks Rich API is proposed.
4. PROPOSED METHOD
The focus of this article is on exploiting graphs to solve the problem of big data variety. In
particular, an approach is described to represent the data from multiple sources as a graph (entity, edge)
showing in Figure 1. Consider the database R={r1, r2, r3... rn}. These registers are composed of registers
with attributes A={R.A1, R.A2, R.Am}. Also consider the G (ri) function which maps the ri∈R record to the
graph and the S (ri, rj) function used to measure similarity. Let us say that if S (ra, rb) is close to 1, record a
is similar to record b. Typically, G (ri) mapping is a graphical method that can convert related data into
graphs. In simple terms, an appropriate chart represents data with vertices and edges. Here, the relational data
in a chart is rearranged so that the entities share common nodes.
Figure 1. Schema of a graph of the dataset
Int J Elec & Comp Eng ISSN: 2088-8708 
Matching data detection for the integration system (Merieme El Abassi)
1011
Like most technologies, there are various alternative approaches to forming the essential
components of a graph database. One of these methods is the property graph model, in which data is divided
into nodes, relations, and attributes (data stored in the nodes or relations). Thus, nodes present instances or
entities of graphs, and can contain any information called attributes. In addition, an edge represents the link
between two entities, and it always needs a direction, a start and an end node, and a type. The architecture of
the technology is shown in Figure 2, which is roughly divided into four steps.
a) Select the file for deduplication: The first phase summarizes the download the files sources for the
deduplication technique.
b) Create graph: In the entity graph creation stage, the first step is blocking, creating two types of entities
with a set of records as input, the first type belongs to the main data, and the second type belongs to other
data that are matched.
c) Detect the potential duplicate’s entity: The next step, perhaps the most important and also the most
expensive, is to find a possible matching record. The goal is to find a record that looks like a record
without having to use the same record in every field. The connection conditions are very specific: we
choose to use GraphFrameMotives; we will remove the search space and find any duplicates. The range
is wide enough to vary the amount of capture, but the selectivity is sufficient to avoid the use of cartesian
products (the use of cartesian products should be avoided at all costs). Through the query of motif
finding, possible pairs of duplicates are searched.
d) Compute the similarity between potential duplicates: It involves detecting the similarity between potential
duplicates after computing the vector representation of the graph entities. In this approach, the term
frequency-inverse document frequency (TFIDF) is chosen to be used. The weight is a statistical measure
that evaluates the importance of a word to a document in a collection or corpus. Then the output variable
is used to calculate the similarity between two documents represented as a vector by finding the cosine of
the angle between them.
Figure 2. General architecture
5. EXPERIMENTS AND RESULTS
The method proposed in the previous section is mostly used to compute the similarity for finding the
same record of duplicate elements. In this section, we evaluate it on sample input data and show the result.
Table 1 indicates a partial view data set used; it keeps the information about the addresses of some companies
in a file CSV.
As shown in Table 2, there is the detail of the three datasets contain the same address but it in a
different format in which this type of data duplicate is not solved by the basic technique (as sorted
neighborhood blocking, duplicate count strategy, or Q-Gram based indexing). All datasets contain the same
address, as shown the company name is represented in a different format in each tuple of the same company.
In the other colons, we have some information missing or in a different format.
Table 1. Input of sample data
ID Company_Name Address_Line1 Address_Line2 Post_Town County PostCode
1 1 MOBILE LIMITED 30 CITY ROAD null LONDON null EC1Y 2AB
2 1 TECH LTD 57 CHARTERHOUSE
STREET
null LONDON null EC1M
6HA
3 23 SNAPS LIMITED 16 BOWLING GREEN
LANE
null LONDON null EC1R 0BD
4 2E2 SERVICES LIMITED 200 200 ALDERSGATE ALDERSGATE
STREET
LONDON null EC1A
4HD
5 2E2 UK LIMITED 200 ALDERSGATE ALDERSGATE
STREET
LONDON null EC1A
4HD
6 40 50 MEDIA LTD 145-157 ST JOHN STREET null LONDON null EC1V
4PW
7 4D DATA CENTRES LIMITED 30 CITY ROAD LONDON Null null EC1 2AB
8 4GETMOBILE LIMITED 152 KEMP HOUSE CITY ROAD LONDON null EC1V
2NX
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014
1012
Table 2. An example of duplicate data
ID Company_name Address_line1 Address_line2 Post_town Country Postcode
1001 1 MOBILE LIMITED 30 CITY RD. null null null null
1002 1 MOBILE Ltd. 30 CITY RD null null null EC1Y 2AB
1003 1 MOBILE null CITY ROAD London null nulls
1004 GLOBAL TECHNOLOGY null null LONDON null null
1005 GLOBAL TECHNOLOGY Ltd 160 CITY RD LONDON null null EC1V 2NX
1006 1 TECH 57 CHARTERHOUSE null LONDON null null
As shown in Figure 3, the data set is divided into two sets. The first is called the data master contain
the correct data and an identifier unique entitle aid is picked. The second dataset, say transaction data, that
held the records, can identify the same company named bid. In the above example, a deterministic similarity
is based on feature vectors calculated by the term frequency-inverse dense frequency (TF-IDF) method
(afeatures and bfeatures shown in Figure 4). Then, focus on the result shown in Figure 3 to draw the chart as
shown in Figure 4 of similarity between the aid and bid for detected the most similar tuples that exist in our
dataset input.
Figure 4 exposed a comparison between the transaction data and the basic data when the similarity
is greater than or equal to 0.4, which allows us to combine the most seven compatible records. As shown in
the Figure 4, each color identifies a record in the master data and the x-axis identifies a record of
transactional data, so the advantage of this approach is its ease of implementation and computational
complexity, as we structure the data graphically, in any database provides us with different detection
duplicates.
Figure 3. A sample comparison of similar
Figure 4. Comparison between the transaction data and the basic data
Int J Elec & Comp Eng ISSN: 2088-8708 
Matching data detection for the integration system (Merieme El Abassi)
1013
6. CONCLUSION AND FURTHER RESEARCH
In this paper, a method for solving the deduplication problem is presented using the Apache Spark
framework and Scala language. The existing data integration is analyzed when the different formats of data
combine in one format then there is a chance of duplication of data in the format as discussed. The proposed
method is to compute the similarity to detect potential duplicate data. The project can be expanded to include
further improvements to reduce comparisons between different registers and reduce computation time, such
as using parallel computations. In future research, the method of big data integration based on Karma
modeling is being explored.
ACKNOWLEDGEMENTS
The authors thank the anonymous reviewer for offering many suggestions to improve the quality of
this paper. The authors acknowledge financial support from the National Scientific Research and Technology
Center (CNRST) as part of the research grant program.
REFERENCES
[1] S. Garg and A. Bala, “Semantic analysis of big data by applying de-duplication techniques,” 2016 International Conference on
Inventive Computation Technologies (ICICT), Aug. 2016, doi: 10.1109/inventive.2016.7830171.
[2] A. Ben Salem, “Contextual data quality: detection and cleaning driven by data semantics,” (in French), PhD Thesis, Université
Sorbonne Paris Cité, 2015.
[3] I. N. Chengalur-Smith, D. P. Ballou, and H. L. Pazer, “The impact of data quality information on decision making: an exploratory
analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 853–864, 1999, doi: 10.1109/69.824597.
[4] R. Y. Wang, H. B. Kon, and S. E. Madnick, “Data quality requirements analysis and modeling,” in Proceedings of IEEE 9th
International Conference on Data Engineering, 1993, pp. 670–677, doi: 10.1109/ICDE.1993.344012.
[5] T. Hongxun et al., “Data quality assessment for on-line monitoring and measuring system of power quality based on big data and
data provenance theory,” in 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA),
Apr. 2018, pp. 248–252, doi: 10.1109/ICCCBDA.2018.8386521.
[6] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: a survey,” IEEE Transactions on Knowledge
and Data Engineering, vol. 19, no. 1, pp. 1–16, Jan. 2007, doi: 10.1109/TKDE.2007.250581.
[7] H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data and Knowledge Engineering, vol. 69, no. 2,
pp. 197–210, Feb. 2010, doi: 10.1016/j.datak.2009.10.003.
[8] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Transactions on Knowledge
and Data Engineering, vol. 24, no. 9, pp. 1537–1555, Sep. 2012, doi: 10.1109/TKDE.2011.127.
[9] X. L. Dong and D. Srivastava, “Big data integration,” in 2013 IEEE 29th International Conference on Data Engineering (ICDE),
Apr. 2013, pp. 1245–1248, doi: 10.1109/ICDE.2013.6544914.
[10] “USENIX Association,” Proceedings of FAST ’11: 9th USENIX Conference on File and Storage Technologies, 2011, Accessed:
Nov. 08, 2021. [Online]. Available: https://www.usenix.org/legacy/events/fast11/tech/full_papers/fast11_proceedings.pdf
[11] L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking with MapReduce,” arXiv:1010.3053, Oct. 2010.
[12] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,” in 2012 IEEE 28th
International Conference on Data Engineering, Apr. 2012, pp. 1073–1083, doi: 10.1109/ICDE.2012.20.
[13] N. K. Yeganeh, S. Sadiq, and M. A. Sharaf, “A framework for data quality aware query systems,” Information Systems, vol. 46,
pp. 24–44, Dec. 2014, doi: 10.1016/j.is.2014.05.005.
[14] S. Sadiq, “Prologue: research and practice in data quality management,” in Handbook of Data Quality, Berlin, Heidelberg:
Springer Berlin Heidelberg, 2013, pp. 1–11.
[15] M. Bovee, R. P. Srivastava, and B. Mak, “A conceptual framework and belief-function approach to assessing overall information
quality,” International Journal of Intelligent Systems, vol. 18, no. 1, pp. 51–74, Jan. 2003, doi: 10.1002/int.10074.
[16] M. Jarke, M. Lenzerini, Y. Vassiliou, and P. Vassiliadis, “Multidimensional data models and aggregation,” in Fundamentals of
Data Warehouses, Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 87–105.
[17] A. Doan, A. Halevy, and Z. Ives, “Describing data sources,” in Principles of Data Integration, Elsevier, 2012, pp. 65–94.
[18] A. Doan, A. Halevy, and Z. Ives, “Incorporating uncertainty into data integration,” in Principles of Data Integration, Elsevier,
2012, pp. 345–357.
[19] J. Talburt, Entity resolution and information quality. Elsevier, 2010.
[20] J. R. Talburt and Y. Zhou, “A practical guide to entity resolution with OYSTER,” in Handbook of Data Quality, Berlin,
Heidelberg: Springer Berlin Heidelberg, 2013, pp. 235–270.
[21] D. Karapiperis and V. S. Verykios, “Load-balancing the distance computations in record linkage,” ACM SIGKDD Explorations
Newsletter, vol. 17, no. 1, pp. 1–7, Sep. 2015, doi: 10.1145/2830544.2830546.
[22] S. P. Benny, S. Vasavi, and P. Anupriya, “Hadoop framework for entity resolution within high velocity streams,” Procedia
Computer Science, vol. 85, pp. 550–557, 2016, doi: 10.1016/j.procs.2016.05.218.
[23] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl, “Eliminating the redundancy in blocking-based entity
resolution methods,” Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries-JCDL ’11,
2011, doi: 10.1145/1998076.1998093.
[24] C. Yan, Y. Song, J. Wang, and W. Guo, “Eliminating the redundancy in MapReduce-based entity resolution,” in 15th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing, 2015, pp. 1233–1236, doi: 10.1109/CCGrid.2015.24.
[25] A. C. Mon and M. M. S. Thwin, “Efficient dynamic blocking scheme for entity resolution systems,” in International Conference
on Advances in Engineering and Technology (ICAET’2014) 2014, pp. 167–171, doi: 10.15242/IIE.E0314076.
[26] A. Kadadi, R. Agrawal, C. Nyamful, and R. Atiq, “Challenges of data integration and interoperability in big data,” in 2014 IEEE
International Conference on Big Data (Big Data), Oct. 2014, pp. 38–40, doi: 10.1109/BigData.2014.7004486.
[27] W. Xiao, L. Guoqi, and L. Bin, “Research on big data integration based on Karma modeling,” in 2017 8th IEEE International
Conference on Software Engineering and Service Science (ICSESS), 2017, pp. 245–248, doi: 10.1109/ICSESS.2017.8342906.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014
1014
BIOGRAPHIES OF AUTHORS
Merieme El Abassi received her fundamental bachelor’s degree in 2017 on
Mathematics and Computer Science from Ibn Tofail University, Kenitra City, Morocco.
Then, she obtained her master’s degree on Big Data and Cloud Computing in 2020, from
Ibn Tofail, Kenitra, Morocco. The author is currently a Ph.D. researcher in data integration.
Also, she is currently doing her research in the Computer Science Research Laboratory,
Kenitra, Morocco. She can be contacted at merieme.elabassi@uit.ac.ma.
Mohamed Amnai received his IEEA (Computer, Electronics, Electrical
Engineering, and Automation) degree in 2000 from Errachidiacity, Molay Ismail
University. In 2007, he received a master’s degree from Ibn Tofail University in Kenitra. In
2011, he received his Ph.D. in Telecommunications and Computer Science from the
University of Ibn Tofail in Kenitra, Morocco. Since March 2014, he has been a professor
and an assistant professor at the Khouribga National School of Applied Sciences,
University of Sétat, Morocco. He joined the Kénitra Faculty of Science, Department of
Computer Science and Mathematics, Ibn Tofail University, Morocco, in 2018 as an
Associate Professor. He is also an associate member of the Research Laboratory of the
Faculty of Computer Science, Team Networks, and Telecommunications Sciences in
Kenitra, Morocco. He is also an associate member of the laboratory of the IPOSI National
Institute of Applied Sciences, Hassan 1 University, Khouribga, Morocco. He can be
contacted at mohamed.amnai@uit.ac.ma.
Ali Choukri is an assistant professor at the National Academy of Applied
Sciences. He received a master’s degree in computer science and telecommunications from
the University of Ibn Tofail, Kenitra, Morocco, in 2008. In 1992, he obtained a degree from
ENSET (Higher Normal School of Technical Teaching). He has a Ph.D. from the School of
Computer Science and Systems Analysis (ENSIAS). He works in the MIS team in the
SIME laboratory, researching mobile intelligent ad hoc communication systems and
wireless sensor networks. His research interests include ubiquitous computing, internet of
things, delay/fault-tolerant networks, wireless networks, QoS routing, mathematical
modeling and performance analysis of networks, control, and decision theory, game theory,
trust and reputation management, distributed algorithms, meta-heuristics, and optimization,
genetic algorithms. He can be contacted at choukriali@gmail.com.
Youssef Fakhri received a BSc (BS) in Electrophysics in 2001 and a MSc in
Computing and Telecommunications (DESA) in 2003 at the Faculty of Science,
Mohammed V University, Rabat, Morocco, where he obtained his M.Sc. project
development in ICI company in Morocco. He received a Ph.D. 2007 by Mohammed
V-University of Agdal, Rabat, Morocco, in collaboration with the Polytechnic University
of Catalonia (UPC), Spain. He joined the Kénitra Faculty of Science, Department of
Computer Science and Mathematics, Ibn Tofail University, Morocco, in March 2009 as an
Associate Professor and as Laboratory Director of the Computer Science Research
Laboratory of the Kénitra Faculty and a member of the Pole of Competencies STIC
Morocco. He can be contacted at FAKHRI@uit.ac.ma.
Noreddine Gherabi Noreddine Gherabi is a professor of computer science
with industrial and academic experience. He holds a doctorate degree in computer science.
In 2013, he worked as a professor of computer science at Mohamed Ben Abdellah
university and since 2015 has worked as a research professor at Sultan Moulay Slimane
University, Morocco. Member of the International Association of Engineers. Gherabi
having several contributions in information systems namely: big data, semantic web,
pattern recognition and intelligent systems. He has several publications in computer science
(book chapters, international journals, and conferences/workshops), and edited books. He
convened and chaired more than 52 conferences and workshops. Last books in Springer: i)
Intelligent Systems in Big Data, Semantic Web and Machine Learning; ii) Advances in
Information, Communication and Cybersecurity; iii) Information Technology and
Communication Systems. He can be contacted at gherabi@gmail.com/n.gherabi@usms.ma.

More Related Content

PDF
Indexing based Genetic Programming Approach to Record Deduplication
PDF
Implementation of Matching Tree Technique for Online Record Linkage
DOC
Introduction abstract
PDF
Stacked Generalization of Random Forest and Decision Tree Techniques for Libr...
PDF
Big Data Analytics and E Learning in Higher Education. Tulasi.B & Suchithra.R
PDF
The web of data: how are we doing so far?
PDF
Sentimental classification analysis of polarity multi-view textual data using...
PDF
Simplifying Database Normalization within a Visual Interactive Simulation Model
Indexing based Genetic Programming Approach to Record Deduplication
Implementation of Matching Tree Technique for Online Record Linkage
Introduction abstract
Stacked Generalization of Random Forest and Decision Tree Techniques for Libr...
Big Data Analytics and E Learning in Higher Education. Tulasi.B & Suchithra.R
The web of data: how are we doing so far?
Sentimental classification analysis of polarity multi-view textual data using...
Simplifying Database Normalization within a Visual Interactive Simulation Model

Similar to Matching data detection for the integration system (20)

PDF
Study on potential capabilities of a nodb system
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
PDF
Enhancement techniques for data warehouse staging area
PDF
Evaluating the effectiveness of data quality framework in software engineering
PDF
Granularity analysis of classification and estimation for complex datasets wi...
PDF
Cross Domain Data Fusion
PDF
Novel holistic architecture for analytical operation on sensory data relayed...
PDF
Annotation Approach for Document with Recommendation
PDF
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
PDF
Data modeling techniques used for big data in enterprise networks
PDF
Providing highly accurate service recommendation for semantic clustering over...
PDF
Actionable Pattern Discovery for Emotion Detection in BigData in Education an...
PDF
Big Data Emerging Technology: Insights into Innovative Environment for Online...
PDF
Elimination of data redundancy before persisting into dbms using svm classifi...
PDF
H04564550
Study on potential capabilities of a nodb system
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
Enhancement techniques for data warehouse staging area
Evaluating the effectiveness of data quality framework in software engineering
Granularity analysis of classification and estimation for complex datasets wi...
Cross Domain Data Fusion
Novel holistic architecture for analytical operation on sensory data relayed...
Annotation Approach for Document with Recommendation
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
Characterizing and Processing of Big Data Using Data Mining Techniques
Data modeling techniques used for big data in enterprise networks
Providing highly accurate service recommendation for semantic clustering over...
Actionable Pattern Discovery for Emotion Detection in BigData in Education an...
Big Data Emerging Technology: Insights into Innovative Environment for Online...
Elimination of data redundancy before persisting into dbms using svm classifi...
H04564550
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...
Ad

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Project quality management in manufacturing
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
PPT on Performance Review to get promotions
Geodesy 1.pptx...............................................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Internet of Things (IOT) - A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Lecture Notes Electrical Wiring System Components
Embodied AI: Ushering in the Next Era of Intelligent Systems
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Project quality management in manufacturing
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT on Performance Review to get promotions

Matching data detection for the integration system

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 13, No. 1, February 2023, pp. 1008~1014 ISSN: 2088-8708, DOI: 10.11591/ijece.v13i1.pp1008-1014  1008 Journal homepage: http://ijece.iaescore.com Matching data detection for the integration system Merieme El Abassi1 , Mohamed Amnai1 , Ali Choukri1 , Youssef Fakhri1 , Noreddine Gherabi2 1 Laboratory of Computer Sciences Research, Faculty of Sciences, Ibn Tofail University Kenitra, Kenitra, Morocco 2 National School of Applied Sciences of Sultan Moulay Slimane University, Khouribga, Morocco Article Info ABSTRACT Article history: Received Nov 9, 2021 Revised Sep 22, 2022 Accepted Oct 5, 2022 The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data that can be used for the efficient execution of the query. However, we have problems with solving entities, so it is necessary to use different techniques to analyze and verify the data quality in order to obtain good data management. Then, when we have a single database, we call this mechanism deduplication. To solve the problems above, we propose in this article a method to calculate the similarity between the potential duplicate data. This solution is based on graphics technology to narrow the search field for similar features. Then, a composite mechanism is used to locate the most similar records in our database to improve the quality of the data to make good decisions from heterogeneous sources. Keywords: Data integration Data matching Data quality Entity resolution This is an open access article under the CC BY-SA license. Corresponding Author: Merieme El Abassi Laboratory of Computer Sciences Research, Faculty of Sciences, Ibn Tofail University Kenitra Kenitra, Morocco Email: merieme.elabassi@uit.ac.ma 1. INTRODUCTION Big data is like a small data but in a large amount of data with a higher complexity level, because it becomes very difficult to control it by any database management tool [1]. However, big data is characterized via a set of properties, including volume, veracity, variety, and velocity. Volume represents the size of data; it can be extended up to terabyte or more. Velocity denoted how fast the data came in. variety represents that data can be structured, semi-structured and unstructured format. Today, data has become the wealth of companies and management departments, contributing to its development. The decisions based on low-quality data can be very costly, hurting businesses, partners, and customers. Furthermore, the management departments and companies need to improve their relationships through data governance. Hence, having good data quality is very important for companies, especially when they interact with other organizations or make big decisions. The concentrate on the structure of the data to be cleaned or integrated, in order to make some metrics and ways to solve the issues of data quality. That is what the suggested methods depend on to solve these problems. Thus, to get helpful data, we need to analyze it within the range of its usage [2]. As we know, integration projects may require some support to improve the quality of data, because there are few companies who execute the procedures of data quality management in the database or data warehouse they have created. Currently, the problem of entity analysis is a field of research in the field of data quality [3]–[5]. Just as online mining for relationships and entities has established an extensive public knowledge base, companies, governments, and researchers can also use the true value of this data, which can only be used when multiple data sources are integrated. Entity resolution refers to the task of identifying records from the
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  Matching data detection for the integration system (Merieme El Abassi) 1009 same entity in one or more data sources [6]–[9]. When only one database is used, this strategy can be called deduplication [10]. Comparing records that might be matched in an entity collection is a secondary problem. It is impractical to use traditional methods of comparison when collecting big data. Therefore, you generally organize similar records and then compare only those records that appear in the same block to improve the efficiency of the entity resolution algorithm. However, for data across multiple data sources, there are usually different standards, so before analyzing the entity, pattern matching must be performed between the data sources. Traditional pattern matching is no longer effective for large amounts of highly heterogeneous and noisy data on the network. Several techniques are currently used to define the likelihood of matching features. These include sorted neighborhood blocking (SN) [11], duplicate count strategy (DCS) [12], and q-gram based indexing (G-gram). These are feature resolution techniques. Basically, they operate on all records in the dataset. Then we group each record into one or more blocks, and finally we create the pairs for comparison. Therefore, these methods take a long time to identify possible pairings. The probability of error is also high because the search area is larger for large amounts of data. Consequently, to minimize these problems, we need to look for a technique that reduces search time and achieves good results. The main focus of this study is on exploiting graph mechanisms to solve the problem of big data diversity. Here, a technique is discussed to identify matching data resulting from the integration of different types of sources such as databases, comma-separated values (CSV), spreadsheets, web services, and other formats. These duplicate data cause a lot of problems when we make decisions. For example, we might find many records representing the same record with minor differences, such as finding names in a different order, which makes us consider each record duplicate as a new one, or having some names or information errors, or using different abbreviations in each database sources. So, the goal is to convert the data into a graph where the search space for duplicate records can be reduced, the records that have shared data are identified, and then some algorithms are applied to obtain a match ratio between records that have some shared information. The contributions of this paper are five aspects. First, the data integration problem solving is propose based on the deduplication approach. In the second aspect, the problems of data integration are described. Third, the proposal is presented, while in the fourth, the principal focus is on the evaluation of that proposed method. In the last phase, conclusions and ideas for future extensions are discussed. 2. RELATED WORK A lot of studies have conducted many challenges of data integration which we need to overcome. Several propositions have been discussed by Yeganeh et al. [13], who discussed the problem of considering user preferences for data quality, within several settings and improving user satisfaction from query results. Improving query results helps the users’ decision-making process and should lead to higher user satisfaction, this makes the field of data quality study interdisciplinary. Different synergies are proposed to provide comprehensive data quality solutions [14]. Several researchers have worked on grouping the dimensions of data quality into conceptual views of data, data values, and data formats. Similar to the above work, Bovee et al. [15] and Jarke et al. [16] recommended to classify the data quality dimensions based on the user’s role in the data warehouse environment. Also, they propose different dimensions and sub-dimensions, based on the concept of data quality as applicability. The problem of describing the quality of data sources is at the core of data integration and exchange [17], [18]. Talburt [19] talked about entity resolution and information quality and discusses the Fellegi-Sunter theory of the relationship between records, the Stanford entity resolution framework and the algebraic model for entity resolution, which are the main theoretical models that support entity resolution. Also, the way of eliminating the redundant data and supporting the master data management programs is discussed through the concept of entity resolution and by using the Oyster open-source system [20]. In addressing the entity resolution problem, a method for distributing the workload between the different computing nodes is proposed [21]. The approach is based on the use of MapReduce with standard blocking. Benny et al. [22] proposed similar work for entity resolution on Hadoop, which works with semi-structured data. It also contains the preprocessing tasks and the results of the comparison, indexing, and classification. The use of different classification methods and their results is described to improve future comparisons. Papadakis et al. [23] proposed different techniques. The first technique aims to remove the superfluous comparisons from any redundancy-based blocking method, and the second solution is used for reducing the space requirements which are mapped to the Cartesian space. Yan et al. [24] developed a methodology called multi-singer that supports structured and unstructured data types, as well as tasks considering data preprocessing and comparing reduction. The objective of the work [1] is to apply deduplication techniques in different merged data sources, in order to format comparison or finding duplicate elements of the same type. Various techniques for preprocessing and testing large amounts of data are proposed [25], this system allows matching entities assigned to the same
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014 1010 block. It uses dynamic blocking to achieve high performance, reducing the search space and covering the same entities in blocking steps. 3. PROBLEM FORMULATION Data inconsistency can be caused by heterogeneous data sources, which means more tools and techniques to optimize unstructured data are needed. In addition, structured data allows us to run query processes to filter, analyze, and use this data to make business decisions and build organizational capabilities [26]. Organizations face a different challenge when they need to expand to accommodate a wide range of data and create new domains. That is why a solution must be found. This involves creating high performance computing environments with advanced data storage systems. It also reduces latency while improving reliability and access to data quickly. Big data entities pose many challenges when faced with the accumulation of large data sets from different sources. Then ways of running common to the two fields to combine and execute the queries and algorithms must be found. The wave of big data will soon spread to all areas of life, which will not only provide humans with unprecedented opportunities, but also bring major challenges. Therefore, it is very necessary to effectively solve diversity and heterogeneous data issues in big data integration [27]. Thus, the quality of information has become the latest technical requirement for users. To evaluate and enhance data quality, it is essential to implement continuous enhancement strategies. By combining data, duplicate data can be identified and then actions can be taken, like merging two similar or identical records into one. This also helps identify equally important non-duplicates, as you should know that two similar things are not the same. Overall, deduplication is a process used for moving the duplicate data in one or more databases containing information of poor quality, it is a very important technique for having good data quality. This also can be a difficult question, because the same “entity” can be referred to by different names, and these names can also contain typos, so matching ordinary strings is not enough. For example, if you have a dataset consisting of many records, and you are trying to find records that represent the same entity, this problem is difficult to solve in a simple way. Although each entity uses a different lexical representation, direct string matching will not find duplicate records, as we can see in this example: records (1 MOBILE LIMITED, 30 CITY RD) and (1 MOBILE Ltd, 30 CITY ROAD) represent the same record entity but use a different dictionary. Therefore, in this paper, a processing technique using Sparks Rich API is proposed. 4. PROPOSED METHOD The focus of this article is on exploiting graphs to solve the problem of big data variety. In particular, an approach is described to represent the data from multiple sources as a graph (entity, edge) showing in Figure 1. Consider the database R={r1, r2, r3... rn}. These registers are composed of registers with attributes A={R.A1, R.A2, R.Am}. Also consider the G (ri) function which maps the ri∈R record to the graph and the S (ri, rj) function used to measure similarity. Let us say that if S (ra, rb) is close to 1, record a is similar to record b. Typically, G (ri) mapping is a graphical method that can convert related data into graphs. In simple terms, an appropriate chart represents data with vertices and edges. Here, the relational data in a chart is rearranged so that the entities share common nodes. Figure 1. Schema of a graph of the dataset
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  Matching data detection for the integration system (Merieme El Abassi) 1011 Like most technologies, there are various alternative approaches to forming the essential components of a graph database. One of these methods is the property graph model, in which data is divided into nodes, relations, and attributes (data stored in the nodes or relations). Thus, nodes present instances or entities of graphs, and can contain any information called attributes. In addition, an edge represents the link between two entities, and it always needs a direction, a start and an end node, and a type. The architecture of the technology is shown in Figure 2, which is roughly divided into four steps. a) Select the file for deduplication: The first phase summarizes the download the files sources for the deduplication technique. b) Create graph: In the entity graph creation stage, the first step is blocking, creating two types of entities with a set of records as input, the first type belongs to the main data, and the second type belongs to other data that are matched. c) Detect the potential duplicate’s entity: The next step, perhaps the most important and also the most expensive, is to find a possible matching record. The goal is to find a record that looks like a record without having to use the same record in every field. The connection conditions are very specific: we choose to use GraphFrameMotives; we will remove the search space and find any duplicates. The range is wide enough to vary the amount of capture, but the selectivity is sufficient to avoid the use of cartesian products (the use of cartesian products should be avoided at all costs). Through the query of motif finding, possible pairs of duplicates are searched. d) Compute the similarity between potential duplicates: It involves detecting the similarity between potential duplicates after computing the vector representation of the graph entities. In this approach, the term frequency-inverse document frequency (TFIDF) is chosen to be used. The weight is a statistical measure that evaluates the importance of a word to a document in a collection or corpus. Then the output variable is used to calculate the similarity between two documents represented as a vector by finding the cosine of the angle between them. Figure 2. General architecture 5. EXPERIMENTS AND RESULTS The method proposed in the previous section is mostly used to compute the similarity for finding the same record of duplicate elements. In this section, we evaluate it on sample input data and show the result. Table 1 indicates a partial view data set used; it keeps the information about the addresses of some companies in a file CSV. As shown in Table 2, there is the detail of the three datasets contain the same address but it in a different format in which this type of data duplicate is not solved by the basic technique (as sorted neighborhood blocking, duplicate count strategy, or Q-Gram based indexing). All datasets contain the same address, as shown the company name is represented in a different format in each tuple of the same company. In the other colons, we have some information missing or in a different format. Table 1. Input of sample data ID Company_Name Address_Line1 Address_Line2 Post_Town County PostCode 1 1 MOBILE LIMITED 30 CITY ROAD null LONDON null EC1Y 2AB 2 1 TECH LTD 57 CHARTERHOUSE STREET null LONDON null EC1M 6HA 3 23 SNAPS LIMITED 16 BOWLING GREEN LANE null LONDON null EC1R 0BD 4 2E2 SERVICES LIMITED 200 200 ALDERSGATE ALDERSGATE STREET LONDON null EC1A 4HD 5 2E2 UK LIMITED 200 ALDERSGATE ALDERSGATE STREET LONDON null EC1A 4HD 6 40 50 MEDIA LTD 145-157 ST JOHN STREET null LONDON null EC1V 4PW 7 4D DATA CENTRES LIMITED 30 CITY ROAD LONDON Null null EC1 2AB 8 4GETMOBILE LIMITED 152 KEMP HOUSE CITY ROAD LONDON null EC1V 2NX
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014 1012 Table 2. An example of duplicate data ID Company_name Address_line1 Address_line2 Post_town Country Postcode 1001 1 MOBILE LIMITED 30 CITY RD. null null null null 1002 1 MOBILE Ltd. 30 CITY RD null null null EC1Y 2AB 1003 1 MOBILE null CITY ROAD London null nulls 1004 GLOBAL TECHNOLOGY null null LONDON null null 1005 GLOBAL TECHNOLOGY Ltd 160 CITY RD LONDON null null EC1V 2NX 1006 1 TECH 57 CHARTERHOUSE null LONDON null null As shown in Figure 3, the data set is divided into two sets. The first is called the data master contain the correct data and an identifier unique entitle aid is picked. The second dataset, say transaction data, that held the records, can identify the same company named bid. In the above example, a deterministic similarity is based on feature vectors calculated by the term frequency-inverse dense frequency (TF-IDF) method (afeatures and bfeatures shown in Figure 4). Then, focus on the result shown in Figure 3 to draw the chart as shown in Figure 4 of similarity between the aid and bid for detected the most similar tuples that exist in our dataset input. Figure 4 exposed a comparison between the transaction data and the basic data when the similarity is greater than or equal to 0.4, which allows us to combine the most seven compatible records. As shown in the Figure 4, each color identifies a record in the master data and the x-axis identifies a record of transactional data, so the advantage of this approach is its ease of implementation and computational complexity, as we structure the data graphically, in any database provides us with different detection duplicates. Figure 3. A sample comparison of similar Figure 4. Comparison between the transaction data and the basic data
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  Matching data detection for the integration system (Merieme El Abassi) 1013 6. CONCLUSION AND FURTHER RESEARCH In this paper, a method for solving the deduplication problem is presented using the Apache Spark framework and Scala language. The existing data integration is analyzed when the different formats of data combine in one format then there is a chance of duplication of data in the format as discussed. The proposed method is to compute the similarity to detect potential duplicate data. The project can be expanded to include further improvements to reduce comparisons between different registers and reduce computation time, such as using parallel computations. In future research, the method of big data integration based on Karma modeling is being explored. ACKNOWLEDGEMENTS The authors thank the anonymous reviewer for offering many suggestions to improve the quality of this paper. The authors acknowledge financial support from the National Scientific Research and Technology Center (CNRST) as part of the research grant program. REFERENCES [1] S. Garg and A. Bala, “Semantic analysis of big data by applying de-duplication techniques,” 2016 International Conference on Inventive Computation Technologies (ICICT), Aug. 2016, doi: 10.1109/inventive.2016.7830171. [2] A. Ben Salem, “Contextual data quality: detection and cleaning driven by data semantics,” (in French), PhD Thesis, Université Sorbonne Paris Cité, 2015. [3] I. N. Chengalur-Smith, D. P. Ballou, and H. L. Pazer, “The impact of data quality information on decision making: an exploratory analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 853–864, 1999, doi: 10.1109/69.824597. [4] R. Y. Wang, H. B. Kon, and S. E. Madnick, “Data quality requirements analysis and modeling,” in Proceedings of IEEE 9th International Conference on Data Engineering, 1993, pp. 670–677, doi: 10.1109/ICDE.1993.344012. [5] T. Hongxun et al., “Data quality assessment for on-line monitoring and measuring system of power quality based on big data and data provenance theory,” in 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Apr. 2018, pp. 248–252, doi: 10.1109/ICCCBDA.2018.8386521. [6] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: a survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1–16, Jan. 2007, doi: 10.1109/TKDE.2007.250581. [7] H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data and Knowledge Engineering, vol. 69, no. 2, pp. 197–210, Feb. 2010, doi: 10.1016/j.datak.2009.10.003. [8] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537–1555, Sep. 2012, doi: 10.1109/TKDE.2011.127. [9] X. L. Dong and D. Srivastava, “Big data integration,” in 2013 IEEE 29th International Conference on Data Engineering (ICDE), Apr. 2013, pp. 1245–1248, doi: 10.1109/ICDE.2013.6544914. [10] “USENIX Association,” Proceedings of FAST ’11: 9th USENIX Conference on File and Storage Technologies, 2011, Accessed: Nov. 08, 2021. [Online]. Available: https://www.usenix.org/legacy/events/fast11/tech/full_papers/fast11_proceedings.pdf [11] L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking with MapReduce,” arXiv:1010.3053, Oct. 2010. [12] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,” in 2012 IEEE 28th International Conference on Data Engineering, Apr. 2012, pp. 1073–1083, doi: 10.1109/ICDE.2012.20. [13] N. K. Yeganeh, S. Sadiq, and M. A. Sharaf, “A framework for data quality aware query systems,” Information Systems, vol. 46, pp. 24–44, Dec. 2014, doi: 10.1016/j.is.2014.05.005. [14] S. Sadiq, “Prologue: research and practice in data quality management,” in Handbook of Data Quality, Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 1–11. [15] M. Bovee, R. P. Srivastava, and B. Mak, “A conceptual framework and belief-function approach to assessing overall information quality,” International Journal of Intelligent Systems, vol. 18, no. 1, pp. 51–74, Jan. 2003, doi: 10.1002/int.10074. [16] M. Jarke, M. Lenzerini, Y. Vassiliou, and P. Vassiliadis, “Multidimensional data models and aggregation,” in Fundamentals of Data Warehouses, Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 87–105. [17] A. Doan, A. Halevy, and Z. Ives, “Describing data sources,” in Principles of Data Integration, Elsevier, 2012, pp. 65–94. [18] A. Doan, A. Halevy, and Z. Ives, “Incorporating uncertainty into data integration,” in Principles of Data Integration, Elsevier, 2012, pp. 345–357. [19] J. Talburt, Entity resolution and information quality. Elsevier, 2010. [20] J. R. Talburt and Y. Zhou, “A practical guide to entity resolution with OYSTER,” in Handbook of Data Quality, Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 235–270. [21] D. Karapiperis and V. S. Verykios, “Load-balancing the distance computations in record linkage,” ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 1–7, Sep. 2015, doi: 10.1145/2830544.2830546. [22] S. P. Benny, S. Vasavi, and P. Anupriya, “Hadoop framework for entity resolution within high velocity streams,” Procedia Computer Science, vol. 85, pp. 550–557, 2016, doi: 10.1016/j.procs.2016.05.218. [23] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl, “Eliminating the redundancy in blocking-based entity resolution methods,” Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries-JCDL ’11, 2011, doi: 10.1145/1998076.1998093. [24] C. Yan, Y. Song, J. Wang, and W. Guo, “Eliminating the redundancy in MapReduce-based entity resolution,” in 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015, pp. 1233–1236, doi: 10.1109/CCGrid.2015.24. [25] A. C. Mon and M. M. S. Thwin, “Efficient dynamic blocking scheme for entity resolution systems,” in International Conference on Advances in Engineering and Technology (ICAET’2014) 2014, pp. 167–171, doi: 10.15242/IIE.E0314076. [26] A. Kadadi, R. Agrawal, C. Nyamful, and R. Atiq, “Challenges of data integration and interoperability in big data,” in 2014 IEEE International Conference on Big Data (Big Data), Oct. 2014, pp. 38–40, doi: 10.1109/BigData.2014.7004486. [27] W. Xiao, L. Guoqi, and L. Bin, “Research on big data integration based on Karma modeling,” in 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), 2017, pp. 245–248, doi: 10.1109/ICSESS.2017.8342906.
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 1, February 2023: 1008-1014 1014 BIOGRAPHIES OF AUTHORS Merieme El Abassi received her fundamental bachelor’s degree in 2017 on Mathematics and Computer Science from Ibn Tofail University, Kenitra City, Morocco. Then, she obtained her master’s degree on Big Data and Cloud Computing in 2020, from Ibn Tofail, Kenitra, Morocco. The author is currently a Ph.D. researcher in data integration. Also, she is currently doing her research in the Computer Science Research Laboratory, Kenitra, Morocco. She can be contacted at merieme.elabassi@uit.ac.ma. Mohamed Amnai received his IEEA (Computer, Electronics, Electrical Engineering, and Automation) degree in 2000 from Errachidiacity, Molay Ismail University. In 2007, he received a master’s degree from Ibn Tofail University in Kenitra. In 2011, he received his Ph.D. in Telecommunications and Computer Science from the University of Ibn Tofail in Kenitra, Morocco. Since March 2014, he has been a professor and an assistant professor at the Khouribga National School of Applied Sciences, University of Sétat, Morocco. He joined the Kénitra Faculty of Science, Department of Computer Science and Mathematics, Ibn Tofail University, Morocco, in 2018 as an Associate Professor. He is also an associate member of the Research Laboratory of the Faculty of Computer Science, Team Networks, and Telecommunications Sciences in Kenitra, Morocco. He is also an associate member of the laboratory of the IPOSI National Institute of Applied Sciences, Hassan 1 University, Khouribga, Morocco. He can be contacted at mohamed.amnai@uit.ac.ma. Ali Choukri is an assistant professor at the National Academy of Applied Sciences. He received a master’s degree in computer science and telecommunications from the University of Ibn Tofail, Kenitra, Morocco, in 2008. In 1992, he obtained a degree from ENSET (Higher Normal School of Technical Teaching). He has a Ph.D. from the School of Computer Science and Systems Analysis (ENSIAS). He works in the MIS team in the SIME laboratory, researching mobile intelligent ad hoc communication systems and wireless sensor networks. His research interests include ubiquitous computing, internet of things, delay/fault-tolerant networks, wireless networks, QoS routing, mathematical modeling and performance analysis of networks, control, and decision theory, game theory, trust and reputation management, distributed algorithms, meta-heuristics, and optimization, genetic algorithms. He can be contacted at choukriali@gmail.com. Youssef Fakhri received a BSc (BS) in Electrophysics in 2001 and a MSc in Computing and Telecommunications (DESA) in 2003 at the Faculty of Science, Mohammed V University, Rabat, Morocco, where he obtained his M.Sc. project development in ICI company in Morocco. He received a Ph.D. 2007 by Mohammed V-University of Agdal, Rabat, Morocco, in collaboration with the Polytechnic University of Catalonia (UPC), Spain. He joined the Kénitra Faculty of Science, Department of Computer Science and Mathematics, Ibn Tofail University, Morocco, in March 2009 as an Associate Professor and as Laboratory Director of the Computer Science Research Laboratory of the Kénitra Faculty and a member of the Pole of Competencies STIC Morocco. He can be contacted at FAKHRI@uit.ac.ma. Noreddine Gherabi Noreddine Gherabi is a professor of computer science with industrial and academic experience. He holds a doctorate degree in computer science. In 2013, he worked as a professor of computer science at Mohamed Ben Abdellah university and since 2015 has worked as a research professor at Sultan Moulay Slimane University, Morocco. Member of the International Association of Engineers. Gherabi having several contributions in information systems namely: big data, semantic web, pattern recognition and intelligent systems. He has several publications in computer science (book chapters, international journals, and conferences/workshops), and edited books. He convened and chaired more than 52 conferences and workshops. Last books in Springer: i) Intelligent Systems in Big Data, Semantic Web and Machine Learning; ii) Advances in Information, Communication and Cybersecurity; iii) Information Technology and Communication Systems. He can be contacted at gherabi@gmail.com/n.gherabi@usms.ma.