SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948
An User Friendly Interface for Data Preprocessing and Visualization
using Machine Learning Models
Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2
1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130,
Tamil Nadu, India.
2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130,
Tamil Nadu, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – Machine learning is one of the most efficient
techniques for prediction and classification related problems.
In this modern era, most of the industries all over the world
depend upon the machine learning models which leadintothe
data analytics century. There is no properandefficienttool for
handling the datasets which use machine learning models for
data prediction and Visualization. So, in this paper a novel
idea is proposed for making the user-friendly approach to
handle the machine learning models for data prediction and
visualisation. A tool is developed, such that it performs data
cleaning which will be a prerequisite for data analysis and
then provides a visible representation of the cleansed data.
The developed tool will take the input as structured dataset
that contains both textual and numerical data which are then
processed using machine learning algorithms to obtain a pre-
processed dataset. This process may undergo series of steps to
produce visualized and predicted data as per the chosen
effective algorithm to obtain efficient result.
Key Words: Machine learning, visualization, pre-
processing, Tool, user interface.
1. INTRODUCTION
An organisation uses the dataset for predictive
analysis and an important concern in these cases is data
quality. Using noisy data can hamper with the correctness of
analysis. The common errors are missing values, duplicates
and other errors. These errors need to be corrected for
reliable decisions and analytics. The users must know that
the effects of using the noisy data before proceeding with the
cleaning process. Noise removal will improve the model
performance, due to the fact that noises may disturb the
discovery of important information.
Machine learning is the appreciated application of
Artificial Intelligence. It is used to learn automatically
without any human assistancethatprovideshugedataset for
analysing with a large number of data fields. With the data
provided by the system after implementing the machine
learning algorithms, organizations are able to work more
effective and acquire profit over their competitors. The
system that uses machine learning technique will be able to
predict how the structure looks like and adjust the data
according to their structure. The mainchallengesinmachine
learning model is to deal with large data sources for data
cleaning process. Data cleaning process is carried by taking
in huge datasets which are checkedforthepossible errorsby
using data pre-processing techniques. The other challenges
include avoiding learning process from noisy data, avoiding
building a prejudiced model, not giving reasons for
compromising with the qualityofthedata.The bestpractices
for data cleaning using machine learning techniquesthatare
filling missing values, removing unnecessary rows,reducing
the size of the data and implementing a good quality plan.
The success of machine learning applications
depends on the amount of good quality data that is given to
it. But this process of cleaning may not be considered as a
main area in data pre-processing. The system that uses
powerful algorithms to process the noisy data can yield bad
results if irrelevant or wrong training set of data is given. In
the proposed model ML algorithms to find out the different
patterns in the data and group it by itself into clean and
noisy data which will help in reducing execution time.
2. Related Work:
Data Pre-processing is used to convert the raw data
into pre-processed data set. [1] In Machine Learning, the
data pre-processing is used to transform or encode the data
easily by their algorithm. It consists of interactive steps as
follows. Data cleaning is used to detect and correct
inaccurate records from a record or tables, and then
replacing, modifying or deleting this noisy data. Data
integration will combines the data residing indifferent
sources that provides user with a unified view of these data
[2]. The process of selecting suitable data for a research
project will impact data integritywhereData transformation
converts data from a source data format into resultant data
[3].
The tools which are available to process the data in
data processing and visualizing are Knime, Shogun, Oryx 2,
Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python
[12] [13]. In this paper, we will focus on removing the noisy
data that identifies the numerical values, predicting and
filling in missing values and detect outliers which hamper
with data analysis [11]. We propose a system that simplifies
the process for the user and allows for better processing. In
summary, Machine learning for data cleaning might be the
only way to provide complete and trustworthy data sets for
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949
effective analytics, so we provide an user friendly interface
for pre-processing and model analysis with visualizationfor
the ease of user.
3. System Design:
The Data Pre-processingisdone withthreemethods
they are Data Cleaning, Data Transformation and Data
Reduction. The data cleaning application is to process the
raw dataset containing both textual and numerical data that
convert it into a cleaned dataset which can be used for data
analysis. Initially, users must upload the dataset in which
they perform the analysis. They can choose the operations
that they want to perform on their dataset from themodules
provided. This application performs a series of operations
which includes removing columns with less information or
no information, removing unnecessary rows, identifyingthe
numerical values, filling in the missing fields and identifying
the outliers. Some columns may contain less information or
no information that makes it hard to rely on such columns
for analysis and so such columns can be removed and they
don’t cause significant damage to the data.
Some rows may contain empty fields which will
again tamper with the proper pre-processing of the dataset.
Hence such values are identified and removed. The dataset
will contain categorical features ranging from numerical to
non-numerical values. This application requires only
numerical data which is used for analysis and prediction,
such that the fields containing numeric values areidentified.
If you try to remove them, you might reduce the amount of
data that is available. So, these fields need to be filled in
appropriate values.
4. Implementation:
The outliers with data points are really far from the
rest of your data points. Mathematically, an outlierisusually
defined as an observation over three standard deviations
from the mean. They can show up due to errors in data entry
or measurement, or just because there's a variation in the
population. Identifying and handlingoutliersisanimportant
part of data cleaning.
In Data Analysis we are using the subsequent
algorithms to analyse the cleansed data. Linear regression,
SVM (Support Vector Machine), KNN (K-Nearest
Neighbours), Logistic Regression, Decision Tree, K-Means,
Random Forest, Naive Bayes, Dimensional Reduction
Algorithms, Gradient Boosting Algorithms.
Linear Regression algorithm will use the
info points to seek out the simplest fit line to model the
info. A line can be represented by the equation, y = m*x +
c where y is the dependent variable and x is the
independent variable. Basic calculus theories are
applied to seek out the values for m and c using the given
data set. The SVM will separate the data points using a line.
The KNN will predict unknown data point with its k nearest
neighbours. The value of k is a critical factor regarding the
accuracy of prediction. It determines the nearest distance
using basic distance functions like Euclidean. Thisalgorithm
has to be a high computation power and that we have to
normalize the information initially to bring every datum
within the same range. The Decision Tree algorithm is used
to solve classification problems.Sometechniquesareusedto
categorize the data they are Gini, Chi-square, entropy etc. K-
Mean is an unsupervised algorithm that provides a solution
for clustering problem. The algorithm will follow the
procedure to form a cluster which contains homogeneous
data.
Random forest is identified as a collection of
decision trees. Every tree will try to estimate a classification
and this is called as a vote. We consider each votefromevery
tree and chose the maximum voted classification. Naive
Bayes can be applied only if the features are independent to
each other. Gradient Boosting Algorithm usesmultipleweak
algorithms to form accurate algorithm. Instead of using the
single estimator, will create a more stable and robust
algorithm. Based on the data set the algorithm is predicted
and provides an efficient result for data analysing process.
5. Results:
The user can click on the Submit button that is
provided and then select the operations they wish to
perform on their dataset from the list of operations
provided. The user can then upload the dataset into the
application by click on the Upload button to start the pre-
processing. Initially the original dataset is displayed and
then dataset after operation 1 will be displayed as cleansed
dataset. The selected operations are performed with the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950
Cleansed dataset; finally the user will perform the data
analysis with the required algorithm to obtain the result in
visualization and it can be download by the user.
Upload Noisy Dataset:
Displaying the noisy Dataset:
Preprocessing the Data:
Applying Machine Learning Modal:
Output:
6. Conclusion:
Our developed systemperformsData Cleaning,Data
Transformation and Data Reduction in data pre-processing.
Our system which takes the rawdatasetsintotheapplication
which are then pre-processed to clean up all the noisy data
using pre-processing techniques and the cleansed data is
visualized to the users after all the pre-processing is done.
This system saves a lot of time since manual cleaning can be
avoided. After cleansing the user can choose or select the
machine learning model which will provide efficient results
as plots. This serves as an effective purpose for the users
who wants to clean huge datasets and visualizestheanalysis
of pre-processed data. In future the accuracy and
comparison of the machine learning algorithms can be done
within the friendly user interface.
REFERENCES
[1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini,
“TextTile: An Interactive Visualization Tool for Seamless
Exploratory, Analysis of Structured Dataand Unstructured
Text“, IEEE-2018.
[2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao
Zhang, “Efficient Outlier Detection for High-Dimensional“,
IEEE-2019.
[3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven
documents,” IEEE-2011.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951
[4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and
Dissemination of Scientific Literature Collections with
SurVis”, IEEE-2016.
[5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive
visualisation of large datasets”, IEEE-2016.
[6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An
Abstraction-based approach”, IEEE-2015.
[7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B.
Bederson,“Keshif :Rapid and Expressive Tabular Data
Exploration for Novices”, IEEE-2018.
[8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C.
Faloutsosk, “LOCI: Fast outlier detection using the local
correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE),
Bengaluru, India, 2003, pp. 315–326.
[9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions
for efficient object detection”, IEEE Trans. Cybern., vol. 47,
no. 1, pp. 117–129, Jan. 2017.
[10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier
detection for temporal data: A survey”, IEEE Trans. Knowl.
Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014.
[11] S. F. Roth and J. Mattis, “Automating the presentation of
information,” in Artificial Intelligence Applications, 1991.
Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE,
1991, pp. 90–97.
[12] M. Bostock and J. Heer, “Protovis: A graphical toolkit
for visualization,” Visualization and Computer Graphics,
IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009.
[13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M.
Stonebraker, “Bigdawg: a polystore for diverse interactive
applications,” in IEEE Viz Data Systems for Interactive
Analysis, 2015.
[14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A.
Kementsietsidis “Conditional functional dependencies for
data cleaning. In Data Engineering”, IEEE 23rd International
Conference on, pages 746–755. IEEE, 2007.

More Related Content

PDF
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
PDF
COMPARATIVE ANALYSIS OF DIFFERENT MACHINE LEARNING ALGORITHMS FOR PLANT DISEA...
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
Comparative Study on Machine Learning Algorithms for Network Intrusion Detect...
PDF
Comparative Analysis: Effective Information Retrieval Using Different Learnin...
PDF
MACHINE LEARNING ALGORITHMS FOR HETEROGENEOUS DATA: A COMPARATIVE STUDY
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
IRJET- A Review of Data Cleaning and its Current Approaches
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
COMPARATIVE ANALYSIS OF DIFFERENT MACHINE LEARNING ALGORITHMS FOR PLANT DISEA...
IRJET- Missing Data Imputation by Evidence Chain
Comparative Study on Machine Learning Algorithms for Network Intrusion Detect...
Comparative Analysis: Effective Information Retrieval Using Different Learnin...
MACHINE LEARNING ALGORITHMS FOR HETEROGENEOUS DATA: A COMPARATIVE STUDY
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Review of Data Cleaning and its Current Approaches

What's hot (20)

PDF
IRJET - An Overview of Machine Learning Algorithms for Data Science
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
PDF
Distributed Digital Artifacts on the Semantic Web
PDF
A Study on Machine Learning and Its Working
PDF
Cross Domain Recommender System using Machine Learning and Transferable Knowl...
PDF
A02610104
PDF
[IJET-V1I3P11] Authors : Hemangi Bhalekar, Swati Kumbhar, Hiral Mewada, Prati...
PDF
EDGE DETECTION IN DIGITAL IMAGE USING MORPHOLOGY OPERATION
PDF
IRJET - Encoded Polymorphic Aspect of Clustering
PDF
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
PDF
IRJET-Scaling Distributed Associative Classifier using Big Data
PPTX
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
PDF
Ijatcse71852019
PDF
Survey on semi supervised classification methods and feature selection
PDF
Recommendation system using bloom filter in mapreduce
PDF
Data mining techniques
PDF
IRJET- A Survey on Mining of Tweeter Data for Predicting User Behavior
PDF
Survey on semi supervised classification methods and
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET - An Overview of Machine Learning Algorithms for Data Science
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
Distributed Digital Artifacts on the Semantic Web
A Study on Machine Learning and Its Working
Cross Domain Recommender System using Machine Learning and Transferable Knowl...
A02610104
[IJET-V1I3P11] Authors : Hemangi Bhalekar, Swati Kumbhar, Hiral Mewada, Prati...
EDGE DETECTION IN DIGITAL IMAGE USING MORPHOLOGY OPERATION
IRJET - Encoded Polymorphic Aspect of Clustering
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...
IRJET-Scaling Distributed Associative Classifier using Big Data
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Ijatcse71852019
Survey on semi supervised classification methods and feature selection
Recommendation system using bloom filter in mapreduce
Data mining techniques
IRJET- A Survey on Mining of Tweeter Data for Predicting User Behavior
Survey on semi supervised classification methods and
Survey on Feature Selection and Dimensionality Reduction Techniques
Ad

Similar to IRJET - An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models (20)

PDF
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
PDF
Email Spam Detection Using Machine Learning
PDF
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
PDF
IRJET- Road Accident Prediction using Machine Learning Algorithm
PDF
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
PDF
IRJET - Customer Churn Analysis in Telecom Industry
PDF
Comparative Study of Enchancement of Automated Student Attendance System Usin...
DOCX
Machine Learning Approaches and its Challenges
PDF
IRJET - House Price Prediction using Machine Learning and RPA
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PDF
A Hierarchical Feature Set optimization for effective code change based Defec...
PPTX
1) Introduction to Data Analyticszz.pptx
PDF
AIRLINE FARE PRICE PREDICTION
PDF
IRJET- Probability based Missing Value Imputation Method and its Analysis
PDF
A Firefly based improved clustering algorithm
PDF
IRJET- Machine Learning
PDF
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
PDF
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
PDF
Fast Range Aggregate Queries for Big Data Analysis
PDF
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
Email Spam Detection Using Machine Learning
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET- Road Accident Prediction using Machine Learning Algorithm
Efficiently Detecting and Analyzing Spam Reviews Using Live Data Feed
IRJET - Customer Churn Analysis in Telecom Industry
Comparative Study of Enchancement of Automated Student Attendance System Usin...
Machine Learning Approaches and its Challenges
IRJET - House Price Prediction using Machine Learning and RPA
13_Data Preprocessing in Python.pptx (1).pdf
A Hierarchical Feature Set optimization for effective code change based Defec...
1) Introduction to Data Analyticszz.pptx
AIRLINE FARE PRICE PREDICTION
IRJET- Probability based Missing Value Imputation Method and its Analysis
A Firefly based improved clustering algorithm
IRJET- Machine Learning
IRJET- Identify the Human or Bots Twitter Data using Machine Learning Alg...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
Fast Range Aggregate Queries for Big Data Analysis
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Well-logging-methods_new................
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PPT on Performance Review to get promotions
PDF
composite construction of structures.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Geodesy 1.pptx...............................................
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Structs to JSON How Go Powers REST APIs.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Lecture Notes Electrical Wiring System Components
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Well-logging-methods_new................
CH1 Production IntroductoryConcepts.pptx
Internet of Things (IOT) - A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT on Performance Review to get promotions
composite construction of structures.pdf
Mechanical Engineering MATERIALS Selection
Geodesy 1.pptx...............................................
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Structs to JSON How Go Powers REST APIs.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
Construction Project Organization Group 2.pptx
additive manufacturing of ss316l using mig welding
Lecture Notes Electrical Wiring System Components

IRJET - An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 948 An User Friendly Interface for Data Preprocessing and Visualization using Machine Learning Models Mr. S. Yoganand1, Bharathi Kannan R2, Daya Meenakshi B2 1Assistant Professor, Department of Computer Science and Engineering, Agni College of Technology Chennai-130, Tamil Nadu, India. 2,3UG Student, Department of Computer Science and Engineering,Agni College of Technology Chennai-130, Tamil Nadu, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – Machine learning is one of the most efficient techniques for prediction and classification related problems. In this modern era, most of the industries all over the world depend upon the machine learning models which leadintothe data analytics century. There is no properandefficienttool for handling the datasets which use machine learning models for data prediction and Visualization. So, in this paper a novel idea is proposed for making the user-friendly approach to handle the machine learning models for data prediction and visualisation. A tool is developed, such that it performs data cleaning which will be a prerequisite for data analysis and then provides a visible representation of the cleansed data. The developed tool will take the input as structured dataset that contains both textual and numerical data which are then processed using machine learning algorithms to obtain a pre- processed dataset. This process may undergo series of steps to produce visualized and predicted data as per the chosen effective algorithm to obtain efficient result. Key Words: Machine learning, visualization, pre- processing, Tool, user interface. 1. INTRODUCTION An organisation uses the dataset for predictive analysis and an important concern in these cases is data quality. Using noisy data can hamper with the correctness of analysis. The common errors are missing values, duplicates and other errors. These errors need to be corrected for reliable decisions and analytics. The users must know that the effects of using the noisy data before proceeding with the cleaning process. Noise removal will improve the model performance, due to the fact that noises may disturb the discovery of important information. Machine learning is the appreciated application of Artificial Intelligence. It is used to learn automatically without any human assistancethatprovideshugedataset for analysing with a large number of data fields. With the data provided by the system after implementing the machine learning algorithms, organizations are able to work more effective and acquire profit over their competitors. The system that uses machine learning technique will be able to predict how the structure looks like and adjust the data according to their structure. The mainchallengesinmachine learning model is to deal with large data sources for data cleaning process. Data cleaning process is carried by taking in huge datasets which are checkedforthepossible errorsby using data pre-processing techniques. The other challenges include avoiding learning process from noisy data, avoiding building a prejudiced model, not giving reasons for compromising with the qualityofthedata.The bestpractices for data cleaning using machine learning techniquesthatare filling missing values, removing unnecessary rows,reducing the size of the data and implementing a good quality plan. The success of machine learning applications depends on the amount of good quality data that is given to it. But this process of cleaning may not be considered as a main area in data pre-processing. The system that uses powerful algorithms to process the noisy data can yield bad results if irrelevant or wrong training set of data is given. In the proposed model ML algorithms to find out the different patterns in the data and group it by itself into clean and noisy data which will help in reducing execution time. 2. Related Work: Data Pre-processing is used to convert the raw data into pre-processed data set. [1] In Machine Learning, the data pre-processing is used to transform or encode the data easily by their algorithm. It consists of interactive steps as follows. Data cleaning is used to detect and correct inaccurate records from a record or tables, and then replacing, modifying or deleting this noisy data. Data integration will combines the data residing indifferent sources that provides user with a unified view of these data [2]. The process of selecting suitable data for a research project will impact data integritywhereData transformation converts data from a source data format into resultant data [3]. The tools which are available to process the data in data processing and visualizing are Knime, Shogun, Oryx 2, Tensor flow, Weka, RapidMiner, Trifacta Wrangler, Python [12] [13]. In this paper, we will focus on removing the noisy data that identifies the numerical values, predicting and filling in missing values and detect outliers which hamper with data analysis [11]. We propose a system that simplifies the process for the user and allows for better processing. In summary, Machine learning for data cleaning might be the only way to provide complete and trustworthy data sets for
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 949 effective analytics, so we provide an user friendly interface for pre-processing and model analysis with visualizationfor the ease of user. 3. System Design: The Data Pre-processingisdone withthreemethods they are Data Cleaning, Data Transformation and Data Reduction. The data cleaning application is to process the raw dataset containing both textual and numerical data that convert it into a cleaned dataset which can be used for data analysis. Initially, users must upload the dataset in which they perform the analysis. They can choose the operations that they want to perform on their dataset from themodules provided. This application performs a series of operations which includes removing columns with less information or no information, removing unnecessary rows, identifyingthe numerical values, filling in the missing fields and identifying the outliers. Some columns may contain less information or no information that makes it hard to rely on such columns for analysis and so such columns can be removed and they don’t cause significant damage to the data. Some rows may contain empty fields which will again tamper with the proper pre-processing of the dataset. Hence such values are identified and removed. The dataset will contain categorical features ranging from numerical to non-numerical values. This application requires only numerical data which is used for analysis and prediction, such that the fields containing numeric values areidentified. If you try to remove them, you might reduce the amount of data that is available. So, these fields need to be filled in appropriate values. 4. Implementation: The outliers with data points are really far from the rest of your data points. Mathematically, an outlierisusually defined as an observation over three standard deviations from the mean. They can show up due to errors in data entry or measurement, or just because there's a variation in the population. Identifying and handlingoutliersisanimportant part of data cleaning. In Data Analysis we are using the subsequent algorithms to analyse the cleansed data. Linear regression, SVM (Support Vector Machine), KNN (K-Nearest Neighbours), Logistic Regression, Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction Algorithms, Gradient Boosting Algorithms. Linear Regression algorithm will use the info points to seek out the simplest fit line to model the info. A line can be represented by the equation, y = m*x + c where y is the dependent variable and x is the independent variable. Basic calculus theories are applied to seek out the values for m and c using the given data set. The SVM will separate the data points using a line. The KNN will predict unknown data point with its k nearest neighbours. The value of k is a critical factor regarding the accuracy of prediction. It determines the nearest distance using basic distance functions like Euclidean. Thisalgorithm has to be a high computation power and that we have to normalize the information initially to bring every datum within the same range. The Decision Tree algorithm is used to solve classification problems.Sometechniquesareusedto categorize the data they are Gini, Chi-square, entropy etc. K- Mean is an unsupervised algorithm that provides a solution for clustering problem. The algorithm will follow the procedure to form a cluster which contains homogeneous data. Random forest is identified as a collection of decision trees. Every tree will try to estimate a classification and this is called as a vote. We consider each votefromevery tree and chose the maximum voted classification. Naive Bayes can be applied only if the features are independent to each other. Gradient Boosting Algorithm usesmultipleweak algorithms to form accurate algorithm. Instead of using the single estimator, will create a more stable and robust algorithm. Based on the data set the algorithm is predicted and provides an efficient result for data analysing process. 5. Results: The user can click on the Submit button that is provided and then select the operations they wish to perform on their dataset from the list of operations provided. The user can then upload the dataset into the application by click on the Upload button to start the pre- processing. Initially the original dataset is displayed and then dataset after operation 1 will be displayed as cleansed dataset. The selected operations are performed with the
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 950 Cleansed dataset; finally the user will perform the data analysis with the required algorithm to obtain the result in visualization and it can be download by the user. Upload Noisy Dataset: Displaying the noisy Dataset: Preprocessing the Data: Applying Machine Learning Modal: Output: 6. Conclusion: Our developed systemperformsData Cleaning,Data Transformation and Data Reduction in data pre-processing. Our system which takes the rawdatasetsintotheapplication which are then pre-processed to clean up all the noisy data using pre-processing techniques and the cleansed data is visualized to the users after all the pre-processing is done. This system saves a lot of time since manual cleaning can be avoided. After cleansing the user can choose or select the machine learning model which will provide efficient results as plots. This serves as an effective purpose for the users who wants to clean huge datasets and visualizestheanalysis of pre-processed data. In future the accuracy and comparison of the machine learning algorithms can be done within the friendly user interface. REFERENCES [1] Cristian Felix, Anshul Vikram Pandey, and EnricoBertini, “TextTile: An Interactive Visualization Tool for Seamless Exploratory, Analysis of Structured Dataand Unstructured Text“, IEEE-2018. [2] Data,Huawen Liu, Xuelong Li, Jiuyong Li, andShichao Zhang, “Efficient Outlier Detection for High-Dimensional“, IEEE-2019. [3] M. Bostock, V. Ogievetsky, and J. Heer, “Datadriven documents,” IEEE-2011.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 951 [4] F. Beck, S. Koch, and D. Weiskopf, “Visual Analysis and Dissemination of Scientific Literature Collections with SurVis”, IEEE-2016. [5] Parke Godfrey, JarekGryz and PioterLasek,“Interactive visualisation of large datasets”, IEEE-2016. [6] Dileep kumarkoshleyand RajuHadler,“Data Cleaning: An Abstraction-based approach”, IEEE-2015. [7] Mehmet Adil Yalçın;NiklasElmqvist; Benjamin B. Bederson,“Keshif :Rapid and Expressive Tabular Data Exploration for Novices”, IEEE-2018. [8] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsosk, “LOCI: Fast outlier detection using the local correlation integral,” IEEE 19th Int. Conf. Data Eng. (ICDE), Bengaluru, India, 2003, pp. 315–326. [9] Y. Pang, J. Cao, and X. Li, “Learning samplingdistributions for efficient object detection”, IEEE Trans. Cybern., vol. 47, no. 1, pp. 117–129, Jan. 2017. [10] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey”, IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2250–2267, Sep. 2014. [11] S. F. Roth and J. Mattis, “Automating the presentation of information,” in Artificial Intelligence Applications, 1991. Pro-ceedings. , Seventh IEEE Conference on, vol. 1.IEEE, 1991, pp. 90–97. [12] M. Bostock and J. Heer, “Protovis: A graphical toolkit for visualization,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 6, pp. 1121–1128, 2009. [13] A. Dziedzic, J. Duggan, A. J. Elmore, V. Gadepally, and M. Stonebraker, “Bigdawg: a polystore for diverse interactive applications,” in IEEE Viz Data Systems for Interactive Analysis, 2015. [14] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis “Conditional functional dependencies for data cleaning. In Data Engineering”, IEEE 23rd International Conference on, pages 746–755. IEEE, 2007.