SlideShare a Scribd company logo
TELKOMNIKA, Vol. 16, No. 2, April 2018, pp. 834 ∼ 842
ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013
DOI: 10.12928/telkomnika.v16.i2.7669 834
Data Cleaning Service for Data Warehouse: An
Experimental Comparative Study on Local Data
Arif Bramantoro
Faculty of Computing and Information Technology in Rabigh
King Abdulaziz University, Jeddah, Saudi Arabia
e-mail: asoegihad@kau.edu.sa
Abstract
Data warehouse is a collective entity of data from various data sources. Data are prone to several
complications and irregularities in data warehouse. Data cleaning service is non trivial activity to ensure
data quality. Data cleaning service involves identification of errors, removing them and improve the quality of
data. One of the common methods is duplicate elimination. This research focuses on the service of duplicate
elimination on local data. It initially surveys data quality focusing on quality problems, cleaning methodology,
involved stages and services within data warehouse environment. It also provides a comparison through
some experiments on local data with different cases, such as different spelling on different pronunciation,
misspellings, name abbreviation, honorific prefixes, common nicknames, splitted name and exact match. All
services are evaluated based on the proposed quality of service metrics such as performance, capability to
process the number of records, platform support, data heterogeneity, and price; so that in the future these
services are reliable to handle big data in data warehouse.
Keyword: Data Cleaning Service, Data Warehouse, Data Quality, Local Data
Copyright c 2018 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
Data warehouse is a relational database for questioning and analyzing by further pro-
cessing. It is obtained from several transactions from other sources. Integrated data warehouse
is an integration of files, sources and other records. Several services are used to ensure good
data, such as data cleaning and data integration service within enterprise. Subject oriented data
warehouse is a subject centric model involving several subjects, such as vendor, product, sales
and customer [1].
A good data warehouse must focus on proper analysis and cleaning of data rather than
daily service transaction and operations. This kind of model is required by most enterprises. The
model must be simple and related to the data cleaning objective. It should also avoid data which
are not required for transaction and decision making operations. Nonvolatile nature of data is
important for these operations. It should be physically separated away from the application. This
separation helps data for recovery and other time consuming mechanisms, such as loading and
retrieval of data [2]. Time variant is the period of time which is involved in the data storage in data
warehouse. This is the element of time. The decisions taken during the process is very important,
therefore the generated trend reports are significant elements in data warehousing. A proper
decision support system works successfully with this trend report. There are several commercial
applications such as customer relationship and business applications using utilizing data cleaning
service.
A proper scheme is involved during the development of data warehouse. The questions
and analysis are completed in the designing stage. Meaningful access of relevant data is required
together with the generated values. The extraction of the source is important, therefore it must
be very clean without any unrelated sources. The provided service is data cleaning in any big
enterprise. The input data input are rechecked and tested before they are allocated to a specific
data warehouse. The loaded data are separated from technical specification and process [3].
Received October 25, 2017; Revised February 13, 2018; Accepted March 14, 2018
TELKOMNIKA ISSN: 1693-6930 835
Several automatic executions are also made to eliminate error in the data. Identifying the
incomplete data is a hindrance for processing of data. It makes the corrections more complicated.
A service called back flushing is used to recheck the data cleaning frequently. Installation of data
occurs in the first instance for the model from other sources. Monitoring service is used for the
recovery of data at different levels from huge to small quantity. The amount of load is also related
to the process in warehouse. Hence, caution steps are noteworthy to process the data smoothly.
ETL is the process of Extract, Transform, and Load of data. It means that the extraction
of relevant data is followed by the transformation and consequently the loading of data in the
warehouse. Extract is a method of data extraction which occurs in data warehouse from the
allocated resources. The consolidation of these resources also takes places in the separated
system that is allocated for each level of processing. This step of extraction takes the data into
another level called transforming [4].
Transform is a mechanism to involve the data extraction which is converted from previous
form and placed in the data warehouse without any errors in the data. The source of data needs a
proper manipulation of all methods. It follows a set of functions to extract data into the warehouse
without any modification to the existing data. The technical and other requirements are validated
to meet the requirements. There are several transformations involved in the process, such as
selecting only particular information and assigning them with specific functions. Coding the data
with values is a concrete service of data cleaning which occurs automatically. Another form of
data cleaning service is to encode the result into a new value and to combine data values of two
different methods. The form of data can be simple or complex method. The path of data may
be failed or successful, which both methods involve in handling data in a specific program. For
example, the model can be a translated code in an extracted data.
Load is a process of data handling which is important together with the targeting range
of information. Few data can be overwritten with other non-updated or updated data. The selec-
tion of the design is also important together with a proper understanding of the available choices
related to time and business requirements. The complex model of system is to allow several
changes to be updated and uploaded. The overall quality of data in the data warehouse environ-
ment is validated by utilizing ETL mechanism [5].
The objective of this paper is to identify the causes for data quality problem. Particular
methodology and experiments are adopted to address the problem. It is expected that it con-
tributes to better data quality in data warehouse. Data cleaning and duplicate elimination services
are the appropriate methods to improve the data quality. Experiments are conducted to provide
the results and comparisons between de-duplication services. The research approach as it is
presented here is novel due to the level of implementation by utilizing service oriented approach
as an evident support to the conducted survey.
In this research, only data cleansing service is considered as the main task. The rest
of the data quality services such as completeness and historical reputation services remain as a
future work to compose more services in a service-oriented system.
2. Data Quality in Data Warehouse
Data warehousing is a promising industry for several government organization and pri-
vate institutions, which involves several confidential data storage with regards to internal security.
With the enormous amount of data, the responsibility of organization becomes critical when it
comes to security concerns [6]. The assurance of data quality is the primary objective of any
management levels. There is an increased potential data quality and its irregularities. Data ware-
house is adopted by the organization to improve the relationship between customers, client and
management. Thus, improving the efficiency for the entire organization is required.
Data quality is defined as the measured performance or the loss of data in an organization
[2]. The purpose of data quality measurement is to identify the missing data from the system.
The quality of data attained for the data warehouse model assures the inputs on the client side.
However, one user is different from another user. The data must be simple, consistent and full
of understanding. The abundance of data increase the burden on the system side. The quality
Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
836 ISSN: 1693-6930
of data is critical together with the identification of irregularities. The key quality of data and its
dimension metrics are important to understand the effective quality improvement.
Data quality has the importance due to the use of data warehouse system. Data quality
is measured in each phase of operations. Metrics are selected to ensure measurement of data
quality and analysis. The selection of metrics is critical to the final result which directly affects
to the customer relationship. Quantifying data is important to save the cost and improve market
standards in a competitive economy [7]. In this paper, data quality and quality of service metrics
are combined to improve the confidence of data quality process.
With an increasing technology and enormous data inputs in industry, the authorities need
to improve the quality of data in enterprise. There are several problems faced by the enterprises
in order to maintain and sustain their quality of service in delivering the project. The types of data
are classified into intrinsic, contextual, representative and accessible. Performance is the factor
of quality standard in any enterprise trademarks. The addition of data must be static rather than
dynamic in order to efficiently avoid irregularities during monitoring the process of quality standard
improvements. The consumer must be carefully considered when there are data sent by the client
to the enterprise [8]. A common data quality framework includes a loop of activity in weighting its
cost and benefit as illustrated in Figure 1.
Figure 1. Data Quality Process Loop
3. Classification of Data Quality
Modern data quality improvement approach requires a real time scenario with the pref-
erence to avoid operational and analytical models [9]. The correctness and assurance of data
quality is measured after during the improvement. The data quality issues requires to be han-
dled for designing services for data warehouse without any quality problems. The identification of
problems caused by poor data is examined to derive a proper procedure. Inaccurate information
by the customer is another cause for a decline in quality. Unlike conventional approach, there
are several other proximity and time variant issues that must be given a consideration in modern
approach. The source of data in modern data warehouse is related to the data quality improve-
ment in data warehouses. The fields are filled by the ones in unstructured forms. These issues
are improvements and advancement for modern research in data warehouse compared to the
conventional method of Inmon [10] and Kimball research [11].
According to Data Warehouse Institute [12], data quality improvement includes the cor-
rection on defective data to ensure the achievement of minimum level of data quality standard. It
is also mentioned that data are required to be flawless without any irregularities. It has to meet
the standard requirement of the compatible application. The quality of data required by user is
different from the one required by the organization. Strict rules are used to avoid an improper
data processing. Validation is made at particular level where data are equipped with pin numbers
or passwords. The frequent data errors are considered as a common phenomenon, however the
model developed for data quality in data warehouse is regularly adaptive to all changes. Hence,
TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
TELKOMNIKA ISSN: 1693-6930 837
data in high quality can be used in operations, decision making process and modeling. In addition,
the quality information indicates which data model needed by data warehouse.
The probability of errors that can lead to a decline in data quality is required for records
and protocol distribution in a network. The calculation of technical information and the require-
ment protocol proposed by the enterprise have to be fulfilled to achieve data quality. The assured
mechanism for development of these protocols can benefit the enterprise by providing data qual-
ity management in a large scale. The goal is to meet market standards rather than to adopt low
cost protocols that may lead to a failure of the suggested model. Many organization and govern-
ment agencies are involved with huge database collection [13]. The importance of data quality
becomes a big concern to achieve results and experiments required by a client. If this is not
taken seriously, several complications may arise due to a failure in data quality which affects the
customer relationship model at any level of processes in enterprise.
An effective risk management is needed for a system to learn from its deficiencies. The
designed protocols must be in such a way to cope up with the risk and deliver the required stan-
dard results. The policy makers must decide the risk strategies to comprehend the desired data
quality standards. Further management of the risk mitigation protocols for data quality improve-
ment and the desired policy formulation play a major role based on data quality requirements. The
agent for risk mitigation approach is assigned after several testing levels, since they are going to
play a major role in the enterprise working level. The decisions are taken from the policy of the
risk mitigation for data quality approaches. The management of any enterprise should pay an
attention to Llyods approach [14] of data quality model and risk mitigation standards.
4. Duplicate Elimination Test Bed
Duplicate elimination is one of the important concrete services in data cleaning service
composition. The main objective of data cleaning service is to maintain data quantity. It is a
service-oriented method to remove duplicated data which may be represented by the user more
than one time. The general idea is a matching process that enables to identify duplicated data.
One important aspect during the search of the duplication of the same records is the
ambiguity of data. There are several experiments conducted to convince the duplicate elimination.
Several services are used during the matching process on those experiments. However, only few
of them give the desired results.
The duplicate information are displayed and recorded in the form of a table together with
the indication of its percentage. The effective services are generally chosen to get successful
results of the improvement on data quality standards. The aim of this research is to compare
the duplicate elimination services and find out which ones perform better. The comparison is
generally based on two parameters. The first parameter is the time to detect the errors in the data
that alter the system and environment. Additional time is required to improve the quality of data
in the process system of any functionality. The second parameter is the memory that determines
on the effectiveness of the data quality.
The services required for the experiments are available from the following service providers:
WinPure Clean and Match (referred as WinPure), DoubleTake3 Dedupe & Merge (referred as
DoubleTake), WizSame (referred as the same name), and Dedupe Express (referred as DQ-
Global).
Before the comparison between services are conducted, the experimental test bed needs
to be developed on real data in local Saudi Arabia. During the first experiment, there are eight
duplicates from the data set manually selected from data warehouse and further examined by
the duplicate detection services as shown in Figure 2. It is important to note that the data with
high privacy are preserved. Due to the limitation of the page, only the result of the duplicate data
detected by DoubleTake service is presented in this paper as illustrated in Figure 3 which has
seven duplicate data. In this figure, DoubleTake service provides some information that might be
different from other services, such as the number of suppressed records and the rate of records
per hour. Hence, this research standardizes the service output as the number of duplicate records
to ease the comparison. The quality of service is included for the performance analysis as well.
Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
838 ISSN: 1693-6930
Figure 2. First Experiment Dataset
Figure 3. Duplicate Data Detected By DoubleTake
Due to the limitation of the page, only one experimental test bed is presented in this paper.
A summary of total five experiments is presented in Table 1.
Table 1. Summary of Five Experiments
WinPure DoubleTake Wizsame DQGlobal
Experiment 1 50% 88% 75% 88%
Experiment 2 25% 75% 67% 33%
Experiment 3 50% 90% 90% 80%
Experiment 4 88% 50% 75% 63%
Experiment 5 17% 100% 92% 83%
TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
TELKOMNIKA ISSN: 1693-6930 839
5. Comparisons Between Services in Duplicate Detection
In this comparison, there is a finer granularity based on the previous experiment test bed.
Each service processes the same set of records so that the detection capability of all services can
be justified. All the records for the comparison are based on the duplicate types. The comparisons
are made based on the predefined duplication types. There are seven duplication types as follows:
1. Different spelling and pronunciation comparison.
The duplicated records examined in this comparison and the examination result by running
four services are illustrated in Figure 4. Due to the existence of different languages in
Saudi Arabia, inconsistent name transliterated from another language is not uncommon. It
is interesting to note that the service provided by WinPure is unable to detect any records
with different spelling and pronunciation.
Figure 4. Different Spelling and Pronunciation Duplicated Records and Examination Results
2. Comparison based on misspellings.
The duplicated records examined in this comparison and the examination result by running
four services are illustrated in Figure 5. This comparison provides less percentage of the
detected records than the previous comparison. It can be inferred that the misspelling cases
have more variants in the records. The next comparisons are not shown as a figure due to
the limitation of the page.
3. Comparison based on name abbreviation.
The duplicated records are examined in this comparison and the examination results by
running four services. In this comparison, the records with name abbreviation are handled
more accurately by the services, except for WinPure service.
4. Comparison based on honorific prefixes.
The duplicated records are examined in this comparison and the examination result by run-
ning four services. It is interesting to note that DQGlobal service underperforms in this
experiment.
5. Comparison based on common nicknames.
The duplicated records are examined in this comparison and the examination results by
running four services. In this comparison, DoubleTake service is unable to detect.
Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
840 ISSN: 1693-6930
Figure 5. Misspellings Duplicated Records and Examination Results
6. Comparison based on splitted name.
The duplicated records are examined in this comparison and the examination result by run-
ning four services. In this comparison, WinPure service underperforms again.
7. Comparison based on exact match.
The duplicated records are examined in this comparison and the examination result by run-
ning four services. Exact match feature is important for some cases which need a specific
handling, such as to investigate the internal mistake of data warehousing.
A complete comparison summary for seven duplicate types is illustrated in Figure 6. It can
be inferred that WizSame service has the highest reliability in any comparison criteria amongst
other services, although it has no peak performance in term of the number of detected records.
Figure 6. Seven Type Duplicates Examination Results
6. Quality of Data Cleaning Services
In addition to the evaluation for quality of data, there is a requirement to assess the data
cleaning service based on the quality of service. There are several quality of service metrics
are taken into account in this paper, such as performance, capability to process the number of
TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
TELKOMNIKA ISSN: 1693-6930 841
records, data heterogeneity, and price. for The performance of the service is broken down into
two metrics: processing time and memory. Time is an important factor which is mostly taken
into account in most algorithm comparisons which is calculated based on the processed records.
1000 records are considerably enough to be taken into account for this comparison. The time
spent by each service on the processing of 1000 records is being calculated. The results of these
record manipulation depends on the system environment. Therefore, the comparison between
all of these services is conducted in the same environment. The environment related to experi-
ments are kept consistent on all four services. Modifying the environment may affect the overall
performance of these services.
The result of the processing time evaluation is presented in Figure 7 (a). It shows that
WizSame and WinPure utilized less CPU time for processing 1000 records. Accordingly, DQ-
Global took the maximum time for processing 1000 records. Figure 7 (b) presents the compari-
son between the memory utilization of examination services. In this evaluation, both DoubleTake
and WizSame had an optimal performance, while DQGlobal and WinPure had more memory
consumption for the processing of 1000 records.
Figure 7. Time Spent and Memory Utilization on The Processing of 1000 Records
The capability of each service to process the records is an important metric for data
cleaning service. The comparison describes how many records that each service can process the
removal of duplication. WinPure was able to process 250,000 records at maximum. DoubleTake
was able to process twenty million records at maximum. WizSame was able to process one million
record at maximum. DQGlobal was able to process one million records at maximum.
Different service has different capability to process particular data format. It is considered
as a data heterogeneity metric. WinPure is able to process Text File, MS Excel, MS Access,
Dbase, MS SQL Server. DoubleTake is able to process MS Excel, MS Access, Dbase, Plain Text
File, ODBC, FoxPro, MS SQL Server, DB2 and Oracle. WizSame is able to process dBase, MS
SQL, MS Access and Oracle, Plain text file, Dbase, ODBC, OLE DB. DQGlobal is able to process
MS Access, Paradox, MS Excel, DBF, Lotus, FoxPro, and Plain text file. This comparison shows
that DoubleTake runs more data formats compared to other services. WizSame scores second
for running more data formats in removing duplication.
Price is another quality of service metrics considered in this paper. The service that has
high price is not feasible for particular users. The price for purchasing the license of the appli-
cations is in a wide range by the time this paper is written. WinPure costs $949.00, DoubleTake
costs $5,900.00, WizSame costs $2,495.00 and DQGlobal costs $3,850.00. This comparison im-
plies that DoubleTake has the highest price compared to the rest of the services. However, since
we wrap all these applications as services, the cost is minimized by paying only for the executed
services.
Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
842 ISSN: 1693-6930
7. Conclusion
In data warehousing, data cleaning service plays an important roles in many domains. If
the data is not clean and full of anomalies, the resultant data have a lot of issues, such as data
integration and query errors. In order to get the best form of the extracted data, it is important
to clean the data as an initial step. Data redundancy should be removed to maintain the data
integrity. This research provides an overview about the quality of data to be used in data ware-
housing and to analyze, practice and experiment the concept of data quality by utilizing real local
data. Hence, this research has two contributions. First, it surveyed of data quality in the environ-
ment of data warehouse and the data integrity analysis. Second, it compared the services that
can remove the duplication of data through some real experiments. The experiments were con-
ducted based on the performance measures so that it could be determined which service is more
effective for the removal of data duplication. The comparison is considered as an aid for users to
select the best services depending on their needs, especially in the scope of Saudi Arabia.
Acknowledgement
This work was supported by the Deanship of Scientific Research (DSR), King Abdulaziz
University, Jeddah, Saudi Arabia. The author, therefore, gratefully acknowledges the DSR techni-
cal and financial support. The author also thanks Mshari AlTuraifi for conducting the experiments
in Saudi Arabia.
References
[1] B. Moustaid and M. Fakir, “Implementation of business intelligence for sales management,”
IAES International Journal of Artificial Intelligence (IJ-AI), vol. 5, no. 1, pp. 22–34, 2016.
[2] I. Khliad, “Data warehouse design and implementation based on quality requirements,” Inter-
national Journal of Advances in Engineering and Technology, pp. 642–651, 2014.
[3] L. Robert, “Data quality in healthcare data warehouse enviornments,” 34th Hawaii Interna-
tional Conference System on System Sciences, pp. 9–1, 2001.
[4] G. Shankaranarayanan, “Towards implementing total data quality management in a data
warehouse,” Journal of Information Technology Management, vol. 16, no. 1, pp. 21–30, 2005.
[5] A. Amine, R. A. Daoud, and B. Bouikhalene, “Efficiency comparaison and evaluation between
two etl extraction tools,” Indonesian Journal of Electrical Engineering and Computer Science,
vol. 3, no. 1, pp. 174–181, 2016.
[6] R. Archana, R. S. Hegadi, and T. Manjunath, “A big data security using data masking meth-
ods,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 7, no. 2, pp.
449–456, 2017.
[7] H. Marcus, K. Mathias, and K. Bernard, “How to measure data quality; a metric approach,”
Twenty Eighth International conference on Information System, Montreal, pp. 1–15, 2007.
[8] H. Frederik, Z. Dennis, and L. Anders, “The cost of poor quality,” Journal of industrial En-
gneering and Management, pp. 163–193, 2011.
[9] K. Rahul, “Data quality in data warehouse problems and solution,” Journal of Computer En-
gineering (ISOR-JCE) ISSN-2278-0661, Volume 16, Issue1, pp. 18–24, 2014.
[10] B. Inmon, “Data warehousing 2.0 architecture for next generation of data warehousing,” Tech.
Rep., 2010.
[11] R. Kimball, M. Ross, J. Mundy, and W. Thornthwaite, The Kimball Group Reader: Relent-
lessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collec-
tion. John Wiley & Sons, 2015.
[12] United States Department of Interior CIO, “Data quality management guide,” Tech. Rep.,
2008.
[13] Q. Sun and Q. Xu, “Research on collaborative mechanism of data warehouse in sharing
platform,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 2,
pp. 1100–1108, 2014.
[14] LLyods, “Solvency ii-section 4-statistical quality standards,” Tech. Rep., 2010.
TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842

More Related Content

PDF
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
PDF
A simplified approach for quality management in data warehouse
PDF
ijcatr04081001
PDF
A Data Quality Model for Asset Management in Engineering Organisations
PDF
IRJET- A Review of Data Cleaning and its Current Approaches
PDF
Data Ware House Testing
PDF
CXAIR for Data Migration
PPTX
km ppt neew one
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A simplified approach for quality management in data warehouse
ijcatr04081001
A Data Quality Model for Asset Management in Engineering Organisations
IRJET- A Review of Data Cleaning and its Current Approaches
Data Ware House Testing
CXAIR for Data Migration
km ppt neew one

What's hot (17)

PDF
H1803014347
PDF
Designing a Framework to Standardize Data Warehouse Development Process for E...
PDF
Value of information integration to supply chain management role of internal...
PDF
CPT-OPEX Returned Mail Case Study
PDF
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
PDF
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
PPTX
Data warehouse,data mining & Big Data
PDF
Case study top braid
PDF
Challenges in integrating various DBMS during SAP implementation
PDF
Efficient Cost Minimization for Big Data Processing
PDF
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
PPT
What is the price of bad customer data?
PDF
Data Profiling, Data Catalogs and Metadata Harmonisation
PDF
Data Quality MDM
PDF
A DATA GOVERNANCE MATURITY ASSESSMENT: A CASE STUDY OF SAUDI ARABIA
PPTX
WITSML to PPDM mapping project
H1803014347
Designing a Framework to Standardize Data Warehouse Development Process for E...
Value of information integration to supply chain management role of internal...
CPT-OPEX Returned Mail Case Study
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
Data warehouse,data mining & Big Data
Case study top braid
Challenges in integrating various DBMS during SAP implementation
Efficient Cost Minimization for Big Data Processing
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
What is the price of bad customer data?
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Quality MDM
A DATA GOVERNANCE MATURITY ASSESSMENT: A CASE STUDY OF SAUDI ARABIA
WITSML to PPDM mapping project
Ad

Similar to Data Cleaning Service for Data Warehouse: An Experimental Comparative Study on Local Data (20)

PDF
Multidimensional Challenges and the Impact of Test Data Management
PDF
Making Data Quality a Way of Life
PDF
Boosting Business with Data Integration.
PDF
Enhancing Data Staging as a Mechanism for Fast Data Access
PDF
Enhancing Data Staging as a Mechanism for Fast Data Access
PDF
Enhancing Data Staging as a Mechanism for Fast Data Access
PDF
Enhancing Data Staging as a Mechanism for Fast Data Access
PPTX
data collection, data integration, data management, data modeling.pptx
PDF
AUTO-CDD: automatic cleaning dirty data using machine learning techniques
PPTX
Collaborate 2012-accelerated-business-data-validation-and-managemet
PDF
Leveraging Automated Data Validation to Reduce Software Development Timeline...
PDF
Creating a Successful DataOps Framework for Your Business.pdf
PPTX
Modern trends in information systems
PPTX
Data Processing and its Types
PPTX
Different types of data processing
PPTX
DataOps Best Practices for Real-Time Big Data Management
PDF
Data warehouse-testing
PDF
Quality Assurance in Knowledge Data Warehouse
PDF
Improving Data Extraction Performance
PDF
Master data management gfoa
Multidimensional Challenges and the Impact of Test Data Management
Making Data Quality a Way of Life
Boosting Business with Data Integration.
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
data collection, data integration, data management, data modeling.pptx
AUTO-CDD: automatic cleaning dirty data using machine learning techniques
Collaborate 2012-accelerated-business-data-validation-and-managemet
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Creating a Successful DataOps Framework for Your Business.pdf
Modern trends in information systems
Data Processing and its Types
Different types of data processing
DataOps Best Practices for Real-Time Big Data Management
Data warehouse-testing
Quality Assurance in Knowledge Data Warehouse
Improving Data Extraction Performance
Master data management gfoa
Ad

More from TELKOMNIKA JOURNAL (20)

PDF
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
PDF
Implementation of ICMP flood detection and mitigation system based on softwar...
PDF
Indonesian continuous speech recognition optimization with convolution bidir...
PDF
Recognition and understanding of construction safety signs by final year engi...
PDF
The use of dolomite to overcome grounding resistance in acidic swamp land
PDF
Clustering of swamp land types against soil resistivity and grounding resistance
PDF
Hybrid methodology for parameter algebraic identification in spatial/time dom...
PDF
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
PDF
Deep learning approaches for accurate wood species recognition
PDF
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
PDF
Reversible data hiding with selective bits difference expansion and modulus f...
PDF
Website-based: smart goat farm monitoring cages
PDF
Novel internet of things-spectroscopy methods for targeted water pollutants i...
PDF
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
PDF
Convolutional neural network-based real-time drowsy driver detection for acci...
PDF
Addressing overfitting in comparative study for deep learningbased classifica...
PDF
Integrating artificial intelligence into accounting systems: a qualitative st...
PDF
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
PDF
Adulterated beef detection with redundant gas sensor using optimized convolut...
PDF
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
Implementation of ICMP flood detection and mitigation system based on softwar...
Indonesian continuous speech recognition optimization with convolution bidir...
Recognition and understanding of construction safety signs by final year engi...
The use of dolomite to overcome grounding resistance in acidic swamp land
Clustering of swamp land types against soil resistivity and grounding resistance
Hybrid methodology for parameter algebraic identification in spatial/time dom...
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
Deep learning approaches for accurate wood species recognition
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
Reversible data hiding with selective bits difference expansion and modulus f...
Website-based: smart goat farm monitoring cages
Novel internet of things-spectroscopy methods for targeted water pollutants i...
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
Convolutional neural network-based real-time drowsy driver detection for acci...
Addressing overfitting in comparative study for deep learningbased classifica...
Integrating artificial intelligence into accounting systems: a qualitative st...
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
Adulterated beef detection with redundant gas sensor using optimized convolut...
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...

Recently uploaded (20)

PPTX
communication and presentation skills 01
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Feature types and data preprocessing steps
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
Software Engineering and software moduleing
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Current and future trends in Computer Vision.pptx
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
introduction to high performance computing
PPTX
Information Storage and Retrieval Techniques Unit III
communication and presentation skills 01
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
Feature types and data preprocessing steps
737-MAX_SRG.pdf student reference guides
Fundamentals of Mechanical Engineering.pptx
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Fundamentals of safety and accident prevention -final (1).pptx
Visual Aids for Exploratory Data Analysis.pdf
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Software Engineering and software moduleing
Module 8- Technological and Communication Skills.pptx
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Current and future trends in Computer Vision.pptx
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
introduction to high performance computing
Information Storage and Retrieval Techniques Unit III

Data Cleaning Service for Data Warehouse: An Experimental Comparative Study on Local Data

  • 1. TELKOMNIKA, Vol. 16, No. 2, April 2018, pp. 834 ∼ 842 ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013 DOI: 10.12928/telkomnika.v16.i2.7669 834 Data Cleaning Service for Data Warehouse: An Experimental Comparative Study on Local Data Arif Bramantoro Faculty of Computing and Information Technology in Rabigh King Abdulaziz University, Jeddah, Saudi Arabia e-mail: asoegihad@kau.edu.sa Abstract Data warehouse is a collective entity of data from various data sources. Data are prone to several complications and irregularities in data warehouse. Data cleaning service is non trivial activity to ensure data quality. Data cleaning service involves identification of errors, removing them and improve the quality of data. One of the common methods is duplicate elimination. This research focuses on the service of duplicate elimination on local data. It initially surveys data quality focusing on quality problems, cleaning methodology, involved stages and services within data warehouse environment. It also provides a comparison through some experiments on local data with different cases, such as different spelling on different pronunciation, misspellings, name abbreviation, honorific prefixes, common nicknames, splitted name and exact match. All services are evaluated based on the proposed quality of service metrics such as performance, capability to process the number of records, platform support, data heterogeneity, and price; so that in the future these services are reliable to handle big data in data warehouse. Keyword: Data Cleaning Service, Data Warehouse, Data Quality, Local Data Copyright c 2018 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction Data warehouse is a relational database for questioning and analyzing by further pro- cessing. It is obtained from several transactions from other sources. Integrated data warehouse is an integration of files, sources and other records. Several services are used to ensure good data, such as data cleaning and data integration service within enterprise. Subject oriented data warehouse is a subject centric model involving several subjects, such as vendor, product, sales and customer [1]. A good data warehouse must focus on proper analysis and cleaning of data rather than daily service transaction and operations. This kind of model is required by most enterprises. The model must be simple and related to the data cleaning objective. It should also avoid data which are not required for transaction and decision making operations. Nonvolatile nature of data is important for these operations. It should be physically separated away from the application. This separation helps data for recovery and other time consuming mechanisms, such as loading and retrieval of data [2]. Time variant is the period of time which is involved in the data storage in data warehouse. This is the element of time. The decisions taken during the process is very important, therefore the generated trend reports are significant elements in data warehousing. A proper decision support system works successfully with this trend report. There are several commercial applications such as customer relationship and business applications using utilizing data cleaning service. A proper scheme is involved during the development of data warehouse. The questions and analysis are completed in the designing stage. Meaningful access of relevant data is required together with the generated values. The extraction of the source is important, therefore it must be very clean without any unrelated sources. The provided service is data cleaning in any big enterprise. The input data input are rechecked and tested before they are allocated to a specific data warehouse. The loaded data are separated from technical specification and process [3]. Received October 25, 2017; Revised February 13, 2018; Accepted March 14, 2018
  • 2. TELKOMNIKA ISSN: 1693-6930 835 Several automatic executions are also made to eliminate error in the data. Identifying the incomplete data is a hindrance for processing of data. It makes the corrections more complicated. A service called back flushing is used to recheck the data cleaning frequently. Installation of data occurs in the first instance for the model from other sources. Monitoring service is used for the recovery of data at different levels from huge to small quantity. The amount of load is also related to the process in warehouse. Hence, caution steps are noteworthy to process the data smoothly. ETL is the process of Extract, Transform, and Load of data. It means that the extraction of relevant data is followed by the transformation and consequently the loading of data in the warehouse. Extract is a method of data extraction which occurs in data warehouse from the allocated resources. The consolidation of these resources also takes places in the separated system that is allocated for each level of processing. This step of extraction takes the data into another level called transforming [4]. Transform is a mechanism to involve the data extraction which is converted from previous form and placed in the data warehouse without any errors in the data. The source of data needs a proper manipulation of all methods. It follows a set of functions to extract data into the warehouse without any modification to the existing data. The technical and other requirements are validated to meet the requirements. There are several transformations involved in the process, such as selecting only particular information and assigning them with specific functions. Coding the data with values is a concrete service of data cleaning which occurs automatically. Another form of data cleaning service is to encode the result into a new value and to combine data values of two different methods. The form of data can be simple or complex method. The path of data may be failed or successful, which both methods involve in handling data in a specific program. For example, the model can be a translated code in an extracted data. Load is a process of data handling which is important together with the targeting range of information. Few data can be overwritten with other non-updated or updated data. The selec- tion of the design is also important together with a proper understanding of the available choices related to time and business requirements. The complex model of system is to allow several changes to be updated and uploaded. The overall quality of data in the data warehouse environ- ment is validated by utilizing ETL mechanism [5]. The objective of this paper is to identify the causes for data quality problem. Particular methodology and experiments are adopted to address the problem. It is expected that it con- tributes to better data quality in data warehouse. Data cleaning and duplicate elimination services are the appropriate methods to improve the data quality. Experiments are conducted to provide the results and comparisons between de-duplication services. The research approach as it is presented here is novel due to the level of implementation by utilizing service oriented approach as an evident support to the conducted survey. In this research, only data cleansing service is considered as the main task. The rest of the data quality services such as completeness and historical reputation services remain as a future work to compose more services in a service-oriented system. 2. Data Quality in Data Warehouse Data warehousing is a promising industry for several government organization and pri- vate institutions, which involves several confidential data storage with regards to internal security. With the enormous amount of data, the responsibility of organization becomes critical when it comes to security concerns [6]. The assurance of data quality is the primary objective of any management levels. There is an increased potential data quality and its irregularities. Data ware- house is adopted by the organization to improve the relationship between customers, client and management. Thus, improving the efficiency for the entire organization is required. Data quality is defined as the measured performance or the loss of data in an organization [2]. The purpose of data quality measurement is to identify the missing data from the system. The quality of data attained for the data warehouse model assures the inputs on the client side. However, one user is different from another user. The data must be simple, consistent and full of understanding. The abundance of data increase the burden on the system side. The quality Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
  • 3. 836 ISSN: 1693-6930 of data is critical together with the identification of irregularities. The key quality of data and its dimension metrics are important to understand the effective quality improvement. Data quality has the importance due to the use of data warehouse system. Data quality is measured in each phase of operations. Metrics are selected to ensure measurement of data quality and analysis. The selection of metrics is critical to the final result which directly affects to the customer relationship. Quantifying data is important to save the cost and improve market standards in a competitive economy [7]. In this paper, data quality and quality of service metrics are combined to improve the confidence of data quality process. With an increasing technology and enormous data inputs in industry, the authorities need to improve the quality of data in enterprise. There are several problems faced by the enterprises in order to maintain and sustain their quality of service in delivering the project. The types of data are classified into intrinsic, contextual, representative and accessible. Performance is the factor of quality standard in any enterprise trademarks. The addition of data must be static rather than dynamic in order to efficiently avoid irregularities during monitoring the process of quality standard improvements. The consumer must be carefully considered when there are data sent by the client to the enterprise [8]. A common data quality framework includes a loop of activity in weighting its cost and benefit as illustrated in Figure 1. Figure 1. Data Quality Process Loop 3. Classification of Data Quality Modern data quality improvement approach requires a real time scenario with the pref- erence to avoid operational and analytical models [9]. The correctness and assurance of data quality is measured after during the improvement. The data quality issues requires to be han- dled for designing services for data warehouse without any quality problems. The identification of problems caused by poor data is examined to derive a proper procedure. Inaccurate information by the customer is another cause for a decline in quality. Unlike conventional approach, there are several other proximity and time variant issues that must be given a consideration in modern approach. The source of data in modern data warehouse is related to the data quality improve- ment in data warehouses. The fields are filled by the ones in unstructured forms. These issues are improvements and advancement for modern research in data warehouse compared to the conventional method of Inmon [10] and Kimball research [11]. According to Data Warehouse Institute [12], data quality improvement includes the cor- rection on defective data to ensure the achievement of minimum level of data quality standard. It is also mentioned that data are required to be flawless without any irregularities. It has to meet the standard requirement of the compatible application. The quality of data required by user is different from the one required by the organization. Strict rules are used to avoid an improper data processing. Validation is made at particular level where data are equipped with pin numbers or passwords. The frequent data errors are considered as a common phenomenon, however the model developed for data quality in data warehouse is regularly adaptive to all changes. Hence, TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
  • 4. TELKOMNIKA ISSN: 1693-6930 837 data in high quality can be used in operations, decision making process and modeling. In addition, the quality information indicates which data model needed by data warehouse. The probability of errors that can lead to a decline in data quality is required for records and protocol distribution in a network. The calculation of technical information and the require- ment protocol proposed by the enterprise have to be fulfilled to achieve data quality. The assured mechanism for development of these protocols can benefit the enterprise by providing data qual- ity management in a large scale. The goal is to meet market standards rather than to adopt low cost protocols that may lead to a failure of the suggested model. Many organization and govern- ment agencies are involved with huge database collection [13]. The importance of data quality becomes a big concern to achieve results and experiments required by a client. If this is not taken seriously, several complications may arise due to a failure in data quality which affects the customer relationship model at any level of processes in enterprise. An effective risk management is needed for a system to learn from its deficiencies. The designed protocols must be in such a way to cope up with the risk and deliver the required stan- dard results. The policy makers must decide the risk strategies to comprehend the desired data quality standards. Further management of the risk mitigation protocols for data quality improve- ment and the desired policy formulation play a major role based on data quality requirements. The agent for risk mitigation approach is assigned after several testing levels, since they are going to play a major role in the enterprise working level. The decisions are taken from the policy of the risk mitigation for data quality approaches. The management of any enterprise should pay an attention to Llyods approach [14] of data quality model and risk mitigation standards. 4. Duplicate Elimination Test Bed Duplicate elimination is one of the important concrete services in data cleaning service composition. The main objective of data cleaning service is to maintain data quantity. It is a service-oriented method to remove duplicated data which may be represented by the user more than one time. The general idea is a matching process that enables to identify duplicated data. One important aspect during the search of the duplication of the same records is the ambiguity of data. There are several experiments conducted to convince the duplicate elimination. Several services are used during the matching process on those experiments. However, only few of them give the desired results. The duplicate information are displayed and recorded in the form of a table together with the indication of its percentage. The effective services are generally chosen to get successful results of the improvement on data quality standards. The aim of this research is to compare the duplicate elimination services and find out which ones perform better. The comparison is generally based on two parameters. The first parameter is the time to detect the errors in the data that alter the system and environment. Additional time is required to improve the quality of data in the process system of any functionality. The second parameter is the memory that determines on the effectiveness of the data quality. The services required for the experiments are available from the following service providers: WinPure Clean and Match (referred as WinPure), DoubleTake3 Dedupe & Merge (referred as DoubleTake), WizSame (referred as the same name), and Dedupe Express (referred as DQ- Global). Before the comparison between services are conducted, the experimental test bed needs to be developed on real data in local Saudi Arabia. During the first experiment, there are eight duplicates from the data set manually selected from data warehouse and further examined by the duplicate detection services as shown in Figure 2. It is important to note that the data with high privacy are preserved. Due to the limitation of the page, only the result of the duplicate data detected by DoubleTake service is presented in this paper as illustrated in Figure 3 which has seven duplicate data. In this figure, DoubleTake service provides some information that might be different from other services, such as the number of suppressed records and the rate of records per hour. Hence, this research standardizes the service output as the number of duplicate records to ease the comparison. The quality of service is included for the performance analysis as well. Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
  • 5. 838 ISSN: 1693-6930 Figure 2. First Experiment Dataset Figure 3. Duplicate Data Detected By DoubleTake Due to the limitation of the page, only one experimental test bed is presented in this paper. A summary of total five experiments is presented in Table 1. Table 1. Summary of Five Experiments WinPure DoubleTake Wizsame DQGlobal Experiment 1 50% 88% 75% 88% Experiment 2 25% 75% 67% 33% Experiment 3 50% 90% 90% 80% Experiment 4 88% 50% 75% 63% Experiment 5 17% 100% 92% 83% TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
  • 6. TELKOMNIKA ISSN: 1693-6930 839 5. Comparisons Between Services in Duplicate Detection In this comparison, there is a finer granularity based on the previous experiment test bed. Each service processes the same set of records so that the detection capability of all services can be justified. All the records for the comparison are based on the duplicate types. The comparisons are made based on the predefined duplication types. There are seven duplication types as follows: 1. Different spelling and pronunciation comparison. The duplicated records examined in this comparison and the examination result by running four services are illustrated in Figure 4. Due to the existence of different languages in Saudi Arabia, inconsistent name transliterated from another language is not uncommon. It is interesting to note that the service provided by WinPure is unable to detect any records with different spelling and pronunciation. Figure 4. Different Spelling and Pronunciation Duplicated Records and Examination Results 2. Comparison based on misspellings. The duplicated records examined in this comparison and the examination result by running four services are illustrated in Figure 5. This comparison provides less percentage of the detected records than the previous comparison. It can be inferred that the misspelling cases have more variants in the records. The next comparisons are not shown as a figure due to the limitation of the page. 3. Comparison based on name abbreviation. The duplicated records are examined in this comparison and the examination results by running four services. In this comparison, the records with name abbreviation are handled more accurately by the services, except for WinPure service. 4. Comparison based on honorific prefixes. The duplicated records are examined in this comparison and the examination result by run- ning four services. It is interesting to note that DQGlobal service underperforms in this experiment. 5. Comparison based on common nicknames. The duplicated records are examined in this comparison and the examination results by running four services. In this comparison, DoubleTake service is unable to detect. Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
  • 7. 840 ISSN: 1693-6930 Figure 5. Misspellings Duplicated Records and Examination Results 6. Comparison based on splitted name. The duplicated records are examined in this comparison and the examination result by run- ning four services. In this comparison, WinPure service underperforms again. 7. Comparison based on exact match. The duplicated records are examined in this comparison and the examination result by run- ning four services. Exact match feature is important for some cases which need a specific handling, such as to investigate the internal mistake of data warehousing. A complete comparison summary for seven duplicate types is illustrated in Figure 6. It can be inferred that WizSame service has the highest reliability in any comparison criteria amongst other services, although it has no peak performance in term of the number of detected records. Figure 6. Seven Type Duplicates Examination Results 6. Quality of Data Cleaning Services In addition to the evaluation for quality of data, there is a requirement to assess the data cleaning service based on the quality of service. There are several quality of service metrics are taken into account in this paper, such as performance, capability to process the number of TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842
  • 8. TELKOMNIKA ISSN: 1693-6930 841 records, data heterogeneity, and price. for The performance of the service is broken down into two metrics: processing time and memory. Time is an important factor which is mostly taken into account in most algorithm comparisons which is calculated based on the processed records. 1000 records are considerably enough to be taken into account for this comparison. The time spent by each service on the processing of 1000 records is being calculated. The results of these record manipulation depends on the system environment. Therefore, the comparison between all of these services is conducted in the same environment. The environment related to experi- ments are kept consistent on all four services. Modifying the environment may affect the overall performance of these services. The result of the processing time evaluation is presented in Figure 7 (a). It shows that WizSame and WinPure utilized less CPU time for processing 1000 records. Accordingly, DQ- Global took the maximum time for processing 1000 records. Figure 7 (b) presents the compari- son between the memory utilization of examination services. In this evaluation, both DoubleTake and WizSame had an optimal performance, while DQGlobal and WinPure had more memory consumption for the processing of 1000 records. Figure 7. Time Spent and Memory Utilization on The Processing of 1000 Records The capability of each service to process the records is an important metric for data cleaning service. The comparison describes how many records that each service can process the removal of duplication. WinPure was able to process 250,000 records at maximum. DoubleTake was able to process twenty million records at maximum. WizSame was able to process one million record at maximum. DQGlobal was able to process one million records at maximum. Different service has different capability to process particular data format. It is considered as a data heterogeneity metric. WinPure is able to process Text File, MS Excel, MS Access, Dbase, MS SQL Server. DoubleTake is able to process MS Excel, MS Access, Dbase, Plain Text File, ODBC, FoxPro, MS SQL Server, DB2 and Oracle. WizSame is able to process dBase, MS SQL, MS Access and Oracle, Plain text file, Dbase, ODBC, OLE DB. DQGlobal is able to process MS Access, Paradox, MS Excel, DBF, Lotus, FoxPro, and Plain text file. This comparison shows that DoubleTake runs more data formats compared to other services. WizSame scores second for running more data formats in removing duplication. Price is another quality of service metrics considered in this paper. The service that has high price is not feasible for particular users. The price for purchasing the license of the appli- cations is in a wide range by the time this paper is written. WinPure costs $949.00, DoubleTake costs $5,900.00, WizSame costs $2,495.00 and DQGlobal costs $3,850.00. This comparison im- plies that DoubleTake has the highest price compared to the rest of the services. However, since we wrap all these applications as services, the cost is minimized by paying only for the executed services. Data Cleaning Service for Data Warehouse: An Experimental ... (Arif Bramantoro)
  • 9. 842 ISSN: 1693-6930 7. Conclusion In data warehousing, data cleaning service plays an important roles in many domains. If the data is not clean and full of anomalies, the resultant data have a lot of issues, such as data integration and query errors. In order to get the best form of the extracted data, it is important to clean the data as an initial step. Data redundancy should be removed to maintain the data integrity. This research provides an overview about the quality of data to be used in data ware- housing and to analyze, practice and experiment the concept of data quality by utilizing real local data. Hence, this research has two contributions. First, it surveyed of data quality in the environ- ment of data warehouse and the data integrity analysis. Second, it compared the services that can remove the duplication of data through some real experiments. The experiments were con- ducted based on the performance measures so that it could be determined which service is more effective for the removal of data duplication. The comparison is considered as an aid for users to select the best services depending on their needs, especially in the scope of Saudi Arabia. Acknowledgement This work was supported by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, Saudi Arabia. The author, therefore, gratefully acknowledges the DSR techni- cal and financial support. The author also thanks Mshari AlTuraifi for conducting the experiments in Saudi Arabia. References [1] B. Moustaid and M. Fakir, “Implementation of business intelligence for sales management,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 5, no. 1, pp. 22–34, 2016. [2] I. Khliad, “Data warehouse design and implementation based on quality requirements,” Inter- national Journal of Advances in Engineering and Technology, pp. 642–651, 2014. [3] L. Robert, “Data quality in healthcare data warehouse enviornments,” 34th Hawaii Interna- tional Conference System on System Sciences, pp. 9–1, 2001. [4] G. Shankaranarayanan, “Towards implementing total data quality management in a data warehouse,” Journal of Information Technology Management, vol. 16, no. 1, pp. 21–30, 2005. [5] A. Amine, R. A. Daoud, and B. Bouikhalene, “Efficiency comparaison and evaluation between two etl extraction tools,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 3, no. 1, pp. 174–181, 2016. [6] R. Archana, R. S. Hegadi, and T. Manjunath, “A big data security using data masking meth- ods,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 7, no. 2, pp. 449–456, 2017. [7] H. Marcus, K. Mathias, and K. Bernard, “How to measure data quality; a metric approach,” Twenty Eighth International conference on Information System, Montreal, pp. 1–15, 2007. [8] H. Frederik, Z. Dennis, and L. Anders, “The cost of poor quality,” Journal of industrial En- gneering and Management, pp. 163–193, 2011. [9] K. Rahul, “Data quality in data warehouse problems and solution,” Journal of Computer En- gineering (ISOR-JCE) ISSN-2278-0661, Volume 16, Issue1, pp. 18–24, 2014. [10] B. Inmon, “Data warehousing 2.0 architecture for next generation of data warehousing,” Tech. Rep., 2010. [11] R. Kimball, M. Ross, J. Mundy, and W. Thornthwaite, The Kimball Group Reader: Relent- lessly Practical Tools for Data Warehousing and Business Intelligence Remastered Collec- tion. John Wiley & Sons, 2015. [12] United States Department of Interior CIO, “Data quality management guide,” Tech. Rep., 2008. [13] Q. Sun and Q. Xu, “Research on collaborative mechanism of data warehouse in sharing platform,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 2, pp. 1100–1108, 2014. [14] LLyods, “Solvency ii-section 4-statistical quality standards,” Tech. Rep., 2010. TELKOMNIKA Vol. 16, No. 2, April 2018 : 834 ∼ 842