SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 704
Comparative Analysis of Various Tools for Data Mining and
Big Data Mining
Mrs. G. SangeethaLakshmi1, Ms. M. Jayashree2
1Asst Prof,Department of Computer science and Application, DKM College for Women (Autonomous), Vellore.
2Research scholar, Department of Computer Science, DKM College for Women (Autonomous), Vellore, TamilNadu.
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Data mining and knowledge discovery has
emerged to extract useful, interesting, and unknown
patterns and knowledge from huge amount of
database. Big data is the term used to delineate
massive amountsofinformationofbothstructuredand
unstructured data types. Data mining techniques can
be classified as classification, association, clustering,
anomaly detection,regressionanalysis,prediction,and
tracking patterns. Data mining tools which are helpful
to achieve abovedataminingtechniques.Thisresearch
analysis various datamining and big data mining tools
with different perspectives.This research will help for
researchers to select appropriate datamining tool or
tools for their research.
Keywords—Big data; association; clustering;
anomalyDetection.
1. INTRODUCTION:
Data mining involves six common classes:
Anomaly detection (outlier/change/deviation
detection) – The identification of unusualdatarecords,
that might be interesting or data errors that require
further investigation.
Association rule learning (dependency modelling) –
Searches for relationships between variables. For
example,asupermarketmightgatherdataoncustomer
purchasing habits. Using association rule learning, the
supermarket can determine which products are
frequentlyboughttogetherandusethisinformationfor
marketing purposes.
This is sometimes referred to as market basket
analysis.
Clustering – is the task of discovering groups and
structures in the data that are in some way or another
"similar", without using known structures in the data.
Classification – is the task of generalizing known
structure to apply to new data. For example, an e-mail
program might attempt to classify an e-mail as
"legitimate" or as "spam".
Regression – attempts to find a function which models
the data with the least error that is, for estimating the
relationships among data or datasets.
Summarization – providing a more compact
representation of the data set, including visualization
and report generation.
The rapid and inevitable development of technology is
causing a substantial global increase in the volume of
data. Such data mean better information, and
information is wealth. This is because information
makes it possible for mankind to have a safer and
better future, which is the primary goal of scientists
and researchers. Due to this incredible amount of
information that can be obtained from Big Data,
humanity is able to make considerable progress in
diverse fields ranging from health and safety to
education and economy.
The analysis and modeling of big data are not new
subjects for actuaries, bankers, andinsurers;DMhelps
them overcome many difficulties in their aim to
manage money more effectively, control the system,
reduce or transfer potential risks, understand client
requirements, improve funds management, increase
market share, and reduce or transfer potential risks .
Specifically, DM can be used in the banking and
insuranceindustriestodeterminedefaultrisksandrisk
groups, specify the correct insurance options for
individual customers, increase customer satisfaction,
and identify credit card fraud.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 705
Big data can be described by the following
characteristics:
Volume-The quantity of generated and stored data.
The size of the data determines the value and potential
insight, and whether it can be considered big data or
not.
Variety-The type and nature of the data. This helps
people who analyze it to effectively use the resulting
insight. Big data draws from text, images, audio, video;
plus it completes missing pieces through data fusion.
Velocity-In this context, the speed at which the data is
generated and processed to meet the demands and
challenges that lie in the path of growth and
development. Big data is often available in real-time.
Compared to small data, big data are produced more
continually. Two kinds of velocity related to big data
are the frequency of generation and the frequency of
handling, recording, and publishing.
Veracity-It is the extended definition for big data,
which refers to the data quality and the data value.The
data quality of captured data can varygreatly,affecting
the accurate analysis.
Data must be processed with advancedtools(analytics
and algorithms) to reveal meaningful information. For
example, to manage a factory one must consider both
visible and invisible issues with various components.
Information generation algorithms must detect and
address invisible issues such as machine degradation,
component wear, etc. on the factory floor.
2. LITERATURE REVIEW:
A comparative analysis of data mining tools and to
observe their behavior based on some selected
parameters which will further be helpful to find the
most appropriate tool for the given data set and the
parameters. M.Hall et al, expressed the importance of
WEKA tool which is an open source implemented in
Java language. WEKA is used for implementing the
most of the data mining techniques. In this research
focused on comparison of various data mining tools
basedontraditionaldataminingtools,dashboards,text
mining, and standalone application. This study
compared four open source Data Mining tools which
are KNIME, Orange, Rapid Miner and Weka.
The research objective is to reveal the most accurate
tool and technique for the classification task. Analysts
may use the results to rapidly achieve a good result . In
this study, various frequently used open-source data
mining tools and tools with open source algorithms
implementations are selected and compared against
user groups, data structures, algorithms included,
visualization capabilities, platforms, programming
languages, and import and export options.
In addition, evaluation of publicly available datasets
has been performed by using selected tools . Wang et
al. (2008) in their comparison of leading data mining
software packages, compared them against several
software different ways, such as portability, reliability,
efficiency, human engineering, understanding,
modifiability, price, training and support.
3. METHODOLOGY
A. TRADITIONAL OPEN SOURCE DATA MINING
TOOLS
Orange: Orangeisanopensourcetoolfordataanalysis
and visualization. Data mining is done through python
or visual programming which has components for
machine learning feature selection, and text mining.
Python is picking up in popularity because it’s simple
and easy to learn yet powerful.
Hence, when it comes to looking for a tool for your
work and you are a Python developer, look no further
than Orange, a Python-based, powerful and open
source tool for both novices and experts.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 706
R: R is free open source software programming
language and software environment for statistical
computing and graphics. TheRlanguageiswidelyused
among data miners for developing statistical software
and data analysis. Ease of use and extensibility has
raised R’s popularity substantially in recent years The
R language is widely used among data miners for
developing statistical software and data analysis. Ease
of use and extensibility has raised R’s popularity
substantially in recent years.
Besides data mining it provides statistical and
graphical techniques, including linear and nonlinear
modeling,classicalstatisticaltests,time-seriesanalysis,
classification, clustering, and others.Inadditiontodata
mining, RapidMiner also provides functionality like
data preprocessing and visualization, predictive
analytics and statistical modeling, evaluation, and
deployment. What makes it even more powerfulisthat
it provides learning schemes, models and algorithms
from WEKA and R scripts.
Weka: Weka, open source data mining software, is a
collection of machine learning algorithms for data
mining tasks such as Data Pre – Processing, Data
Classification, Data Regression Data Clustering, Data
Association Rules, and Data Visualization. The
algorithms can either be applied directly to a data set
or called from your own JAVA code.
Shogun: Shogun is a free open sourcesoftwaretoolbox
written in C++. It offers lots of algorithms, and data
structure for machine learning problems. The Shogun
focus on Support Vector Machine (SVM), regression,
and classification data mining problems.
Rapid Miner: Rapid Miner operates through visual
programmingandiscapableofmanipulating,analyzing
and modelling data.
Data preprocessing has three main components:
extraction,transformationandloading.KNIMEdoesall
three. It gives you a graphical user interface to allow
for the assembly of nodes for data processing. It is an
open source data analytics, reporting and integration
platform. KNIME also integrates various components
for machine learning and data mining through its
modular data pipelining concept and has caught the
eye of business intelligence and financial dataanalysis.
Written in Java and based on Eclipse, KNIME is easy to
extend and to add plugins. Additional functionalities
can be added on the go. Plenty of data integration
modules are already included in the core version.
The algorithms that have similar accuracy rates were
compared again with different statistical criteria such
as ROC (receiver operating characteristic), precision,
recall, F-measure, and the root mean squared error
(RMSE) to achieve the bestresults.Asaresult,themost
appropriate algorithm for this dataset is found as the
logistic regression algorithm.
The aim of this study is to use DM classification
algorithms to investigate the effects of certain
demographic andsocioeconomiccharacteristicsonthe
probability of individuals’ default risk, as well as to
predicttheirfuturepaymentchallengesbydetermining
individual attributes using a logistic regression
classification algorithm.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 707
4. TRADITIONAL COMMERCIAL DATA MINING
TOOLS
Sisence: Sisenseisabusinessintelligenceplatformthat
lets you join, analyze, andpicture out information they
require to make better and more intelligent business
decisions and craft out workable plans and strategies.
Neural Designer: This is a desktop applicationfordata
mining which is uses neural network and machine
learning
SharePoint: SharePoint is a Microsoft-hosted cloud
service that empowers companies to store, access,
share, and manage documented information from all
devices.
Cognos: IBM Cognos is a set of smart self-
servicecapabilities that enable them to quickly
andconfidently determine and make decisions on
insight. The engaging experience provided by Cognos
Analytics encourages business users to make and/or
configure dashboards and reportsontheirown–while
providing IT with a proven and scalable platform that
can be deployed either on premises or in cloud.
Borad: Board is a Management Intelligence Toolkit
that combinescompactsoftware.BOARDenablesusers
to collect and gather data from almost any source, as
well as createfull self-service reporting. These reports
can be delivered in different formats if needed, like
CSV, HTML and more. Features of businessintelligence
(BI) and corporate performance management (CPM)
into a comprehensive and compact software.
5. BIG DATA MINING TOOLS
Sisence: Sisenseisabusinessintelligenceplatformthat
lets you join, analyze, andpicture out information they
require to make better and more intelligent business
decisions and craft out workable plans and strategies.
KEEL:KEELstandsfor"KnowledgeExtractionbasedon
Evolutionary Learning," and it aims to help usesassess
evolutionary algorithms for data mining problems like
regression, classification, clustering and pattern
mining. It includes a large collection of existing
algorithms that it uses to compare and with new
algorithms. Operating System: OS Independent.
MAHOUT:This Apache project offers algorithms for
clustering, classificationandbatch-basedcollaborative
filtering that run on top of Hadoop.Theproject'sgoalis
to build scalable machine learning libraries. Operating
System: OS Independent.
6. CONCLUSION:
This study compared the Traditional free data mining
tools described in the different perspectives such as
business size, category, platform, data visualization,
and which language used for developed the tools,
TraditionalCommercialDataMiningtools,andBigdata
mining. Traditional free data mining tools describedin
the different perspectives such as business size,
category, platform, data visualization, and which
language used for developed the tools. Traditional
Commercial Data Mining tools described in the
different perspectives such as Official URL of the tools,
businesssize,features,category,datavisualization,and
whether free trial available or not. Big Data Mining
tools described in the different perspectives such as
Official URL of the tools, business size, deployment,
what are the big data features exist, and whether free
version available or not. This study will help to choose
correct data mining tools for upcoming researchers
who are going to do theresearchunderthedatamining
and machine learning.
REFERENCES:
[1] Dr. Anil Sharma, Balrajpreet Kaur, A RESEARCH
REVIEW ONCOMPARATIVE ANALYSIS OF DATA
MINING TOOLS,TECHNIQUES AND PARAMETERS,
International Journal ofAdvanced Research in
Computer Science, Volume 8, No. 7, July– August 2017.
[2] M.Hall, E.Frank, G.Holmes, B.Reutemann , IH
Witten,"TheWEKA Data Mining Software: An Update,"
SIGKDD Explorations,2009.
[3] Mrs. Parminder Kaur, Dr. Qamar Parvez Rana,
Comparison of Various Tools for Data Mining,
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 708
International Journal of Engineering Research &
Technology (IJERT) , Volume 3, Issue 10 – 2014.
[4] Luís C. Borges, Viriato M. Marques , Jorge
Bernardino Comparison of dataminingtechniquesand
tools for data classification, C3S2E '13 Proceedings of
the International C*Conference on Computer Science
and Software Engineering Pages 113-116.
[5] Dakić Dušanka et al, A Comparison of
Contemporary Data Mining Tools, XVII International
Scientific Conference on Industrial Systems (IS'17),
Novi Sad, Serbia, October 4. – 6.
[6] Wang J, Hu X, Hollister K, Zhu D. (2008) "A
comparison and scenario analysis of leading data
mining software". Int J Knowl Manage 2008, 4:17–34.
[7] M. Antonie, A. Coman, and O. R. Zaiane,“Application
of Data Mining Techniques for Medical Image
Classification,” in Proceedings of the second
international Workshop on Multimida Data Mining
(MDM/KDD’2001), 2001, pp. 94–101.
[8] J. Srivastava, R. Cooley, M. Deshpande, and P.-N.
Tan, “Web usage mining: Discovery andapplicationsof
usage patterns from web data,” Acm Sigkdd …, vol. 1,
no. 2, pp. 12–23, 2000.
[9] J. Han and M. Kamber, Data Mining, Southeast Asia
Edition:Concepts and Techniques. Morgan kaufmann,
2006.
[10] J. Hipp, U. Güntzer, and G. Nakhaeizadeh,
“Algorithms for association rule mining - a general
survey and comparison,” ACM SIGKDD Explor. Newsl.,
vol. 2, no. 1, pp. 58–64, 2000.
[11] C. Zhang and S. Zhang, Association rule mining:
models andalgorithms, vol. 2307. Springer-Verlag,
2002.
[12] R. Agrawal, T. Imieliński, and A. Swami, “Mining
association rules between sets of items in large
databases,” ACM SIGMODRec., 1993.

More Related Content

DOCX
Big Data Analytics
PDF
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
PDF
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
PDF
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
PDF
Lect 1 introduction
PPTX
Data analytics
PDF
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
PDF
5. data mining tools and techniques a review--31-39
Big Data Analytics
A COMPREHENSIVE STUDY ON POTENTIAL RESEARCH OPPORTUNITIES OF BIG DATA ANALYTI...
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
Lect 1 introduction
Data analytics
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
5. data mining tools and techniques a review--31-39

What's hot (19)

PDF
Integrating Structure and Analytics with Unstructured Data
PDF
Real World Application of Big Data In Data Mining Tools
PDF
A Survey on Data Mining
PDF
Comparison Between WEKA and Salford System in Data Mining Software
PDF
DOCUMENT SELECTION USING MAPREDUCE
PDF
Big Data Analytics: Recent Achievements and New Challenges
PPT
Analysis of ‘Unstructured’ Data
PDF
11.challenging issues of spatio temporal data mining
PDF
IRJET- Information Reterival of Text-based Deep Stock Prediction
PDF
Data Mining vs Statistics
PPTX
Data Mining
PDF
BigData Analytics_1.7
PDF
Big data – A Review
DOC
An introduction to data mining
PPTX
PPTX
Real-time Big Data Analytics: From Deployment to Production
PDF
The 17 V’s of Big Data
PDF
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
Integrating Structure and Analytics with Unstructured Data
Real World Application of Big Data In Data Mining Tools
A Survey on Data Mining
Comparison Between WEKA and Salford System in Data Mining Software
DOCUMENT SELECTION USING MAPREDUCE
Big Data Analytics: Recent Achievements and New Challenges
Analysis of ‘Unstructured’ Data
11.challenging issues of spatio temporal data mining
IRJET- Information Reterival of Text-based Deep Stock Prediction
Data Mining vs Statistics
Data Mining
BigData Analytics_1.7
Big data – A Review
An introduction to data mining
Real-time Big Data Analytics: From Deployment to Production
The 17 V’s of Big Data
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
Ad

Similar to IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data Mining (20)

PDF
IRJET- Big Data: A Study
PDF
Fundamentals of data mining and its applications
PDF
IRJET- A Study on Data Mining in Software
PDF
Data Science: A Revolution of Data
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
DOCX
Tools for Unstructured Data Analytics
PDF
Data Analytics: Tools, Techniques &Trend
PDF
Introduction to Data Science: data science process
PDF
IRJET- Big Data Management and Growth Enhancement
PPTX
1 UNIT-DSP.pptx
PDF
Introduction to Data Analytics and data analytics life cycle
PDF
The Data Scientist’s Toolkit: Key Techniques for Extracting Value
DOCX
Big data (word file)
DOCX
Nikita rajbhoj(a 50)
PDF
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
PDF
Big Data: Review, Classification and Analysis Survey
PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PDF
Data Mining – A Perspective Approach
PDF
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
PDF
Computing for Data Analysis: Theory and Practices 1st Edition Sanjay Chakraborty
IRJET- Big Data: A Study
Fundamentals of data mining and its applications
IRJET- A Study on Data Mining in Software
Data Science: A Revolution of Data
KIT-601 Lecture Notes-UNIT-1.pdf
Tools for Unstructured Data Analytics
Data Analytics: Tools, Techniques &Trend
Introduction to Data Science: data science process
IRJET- Big Data Management and Growth Enhancement
1 UNIT-DSP.pptx
Introduction to Data Analytics and data analytics life cycle
The Data Scientist’s Toolkit: Key Techniques for Extracting Value
Big data (word file)
Nikita rajbhoj(a 50)
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Big Data: Review, Classification and Analysis Survey
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Data Mining – A Perspective Approach
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Computing for Data Analysis: Theory and Practices 1st Edition Sanjay Chakraborty
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
573137875-Attendance-Management-System-original
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
OOP with Java - Java Introduction (Basics)
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Digital Logic Computer Design lecture notes
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
bas. eng. economics group 4 presentation 1.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
573137875-Attendance-Management-System-original
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data Mining

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 704 Comparative Analysis of Various Tools for Data Mining and Big Data Mining Mrs. G. SangeethaLakshmi1, Ms. M. Jayashree2 1Asst Prof,Department of Computer science and Application, DKM College for Women (Autonomous), Vellore. 2Research scholar, Department of Computer Science, DKM College for Women (Autonomous), Vellore, TamilNadu. ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - Data mining and knowledge discovery has emerged to extract useful, interesting, and unknown patterns and knowledge from huge amount of database. Big data is the term used to delineate massive amountsofinformationofbothstructuredand unstructured data types. Data mining techniques can be classified as classification, association, clustering, anomaly detection,regressionanalysis,prediction,and tracking patterns. Data mining tools which are helpful to achieve abovedataminingtechniques.Thisresearch analysis various datamining and big data mining tools with different perspectives.This research will help for researchers to select appropriate datamining tool or tools for their research. Keywords—Big data; association; clustering; anomalyDetection. 1. INTRODUCTION: Data mining involves six common classes: Anomaly detection (outlier/change/deviation detection) – The identification of unusualdatarecords, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) – Searches for relationships between variables. For example,asupermarketmightgatherdataoncustomer purchasing habits. Using association rule learning, the supermarket can determine which products are frequentlyboughttogetherandusethisinformationfor marketing purposes. This is sometimes referred to as market basket analysis. Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. Summarization – providing a more compact representation of the data set, including visualization and report generation. The rapid and inevitable development of technology is causing a substantial global increase in the volume of data. Such data mean better information, and information is wealth. This is because information makes it possible for mankind to have a safer and better future, which is the primary goal of scientists and researchers. Due to this incredible amount of information that can be obtained from Big Data, humanity is able to make considerable progress in diverse fields ranging from health and safety to education and economy. The analysis and modeling of big data are not new subjects for actuaries, bankers, andinsurers;DMhelps them overcome many difficulties in their aim to manage money more effectively, control the system, reduce or transfer potential risks, understand client requirements, improve funds management, increase market share, and reduce or transfer potential risks . Specifically, DM can be used in the banking and insuranceindustriestodeterminedefaultrisksandrisk groups, specify the correct insurance options for individual customers, increase customer satisfaction, and identify credit card fraud.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 705 Big data can be described by the following characteristics: Volume-The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. Variety-The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion. Velocity-In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing. Veracity-It is the extended definition for big data, which refers to the data quality and the data value.The data quality of captured data can varygreatly,affecting the accurate analysis. Data must be processed with advancedtools(analytics and algorithms) to reveal meaningful information. For example, to manage a factory one must consider both visible and invisible issues with various components. Information generation algorithms must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor. 2. LITERATURE REVIEW: A comparative analysis of data mining tools and to observe their behavior based on some selected parameters which will further be helpful to find the most appropriate tool for the given data set and the parameters. M.Hall et al, expressed the importance of WEKA tool which is an open source implemented in Java language. WEKA is used for implementing the most of the data mining techniques. In this research focused on comparison of various data mining tools basedontraditionaldataminingtools,dashboards,text mining, and standalone application. This study compared four open source Data Mining tools which are KNIME, Orange, Rapid Miner and Weka. The research objective is to reveal the most accurate tool and technique for the classification task. Analysts may use the results to rapidly achieve a good result . In this study, various frequently used open-source data mining tools and tools with open source algorithms implementations are selected and compared against user groups, data structures, algorithms included, visualization capabilities, platforms, programming languages, and import and export options. In addition, evaluation of publicly available datasets has been performed by using selected tools . Wang et al. (2008) in their comparison of leading data mining software packages, compared them against several software different ways, such as portability, reliability, efficiency, human engineering, understanding, modifiability, price, training and support. 3. METHODOLOGY A. TRADITIONAL OPEN SOURCE DATA MINING TOOLS Orange: Orangeisanopensourcetoolfordataanalysis and visualization. Data mining is done through python or visual programming which has components for machine learning feature selection, and text mining. Python is picking up in popularity because it’s simple and easy to learn yet powerful. Hence, when it comes to looking for a tool for your work and you are a Python developer, look no further than Orange, a Python-based, powerful and open source tool for both novices and experts.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 706 R: R is free open source software programming language and software environment for statistical computing and graphics. TheRlanguageiswidelyused among data miners for developing statistical software and data analysis. Ease of use and extensibility has raised R’s popularity substantially in recent years The R language is widely used among data miners for developing statistical software and data analysis. Ease of use and extensibility has raised R’s popularity substantially in recent years. Besides data mining it provides statistical and graphical techniques, including linear and nonlinear modeling,classicalstatisticaltests,time-seriesanalysis, classification, clustering, and others.Inadditiontodata mining, RapidMiner also provides functionality like data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. What makes it even more powerfulisthat it provides learning schemes, models and algorithms from WEKA and R scripts. Weka: Weka, open source data mining software, is a collection of machine learning algorithms for data mining tasks such as Data Pre – Processing, Data Classification, Data Regression Data Clustering, Data Association Rules, and Data Visualization. The algorithms can either be applied directly to a data set or called from your own JAVA code. Shogun: Shogun is a free open sourcesoftwaretoolbox written in C++. It offers lots of algorithms, and data structure for machine learning problems. The Shogun focus on Support Vector Machine (SVM), regression, and classification data mining problems. Rapid Miner: Rapid Miner operates through visual programmingandiscapableofmanipulating,analyzing and modelling data. Data preprocessing has three main components: extraction,transformationandloading.KNIMEdoesall three. It gives you a graphical user interface to allow for the assembly of nodes for data processing. It is an open source data analytics, reporting and integration platform. KNIME also integrates various components for machine learning and data mining through its modular data pipelining concept and has caught the eye of business intelligence and financial dataanalysis. Written in Java and based on Eclipse, KNIME is easy to extend and to add plugins. Additional functionalities can be added on the go. Plenty of data integration modules are already included in the core version. The algorithms that have similar accuracy rates were compared again with different statistical criteria such as ROC (receiver operating characteristic), precision, recall, F-measure, and the root mean squared error (RMSE) to achieve the bestresults.Asaresult,themost appropriate algorithm for this dataset is found as the logistic regression algorithm. The aim of this study is to use DM classification algorithms to investigate the effects of certain demographic andsocioeconomiccharacteristicsonthe probability of individuals’ default risk, as well as to predicttheirfuturepaymentchallengesbydetermining individual attributes using a logistic regression classification algorithm.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 707 4. TRADITIONAL COMMERCIAL DATA MINING TOOLS Sisence: Sisenseisabusinessintelligenceplatformthat lets you join, analyze, andpicture out information they require to make better and more intelligent business decisions and craft out workable plans and strategies. Neural Designer: This is a desktop applicationfordata mining which is uses neural network and machine learning SharePoint: SharePoint is a Microsoft-hosted cloud service that empowers companies to store, access, share, and manage documented information from all devices. Cognos: IBM Cognos is a set of smart self- servicecapabilities that enable them to quickly andconfidently determine and make decisions on insight. The engaging experience provided by Cognos Analytics encourages business users to make and/or configure dashboards and reportsontheirown–while providing IT with a proven and scalable platform that can be deployed either on premises or in cloud. Borad: Board is a Management Intelligence Toolkit that combinescompactsoftware.BOARDenablesusers to collect and gather data from almost any source, as well as createfull self-service reporting. These reports can be delivered in different formats if needed, like CSV, HTML and more. Features of businessintelligence (BI) and corporate performance management (CPM) into a comprehensive and compact software. 5. BIG DATA MINING TOOLS Sisence: Sisenseisabusinessintelligenceplatformthat lets you join, analyze, andpicture out information they require to make better and more intelligent business decisions and craft out workable plans and strategies. KEEL:KEELstandsfor"KnowledgeExtractionbasedon Evolutionary Learning," and it aims to help usesassess evolutionary algorithms for data mining problems like regression, classification, clustering and pattern mining. It includes a large collection of existing algorithms that it uses to compare and with new algorithms. Operating System: OS Independent. MAHOUT:This Apache project offers algorithms for clustering, classificationandbatch-basedcollaborative filtering that run on top of Hadoop.Theproject'sgoalis to build scalable machine learning libraries. Operating System: OS Independent. 6. CONCLUSION: This study compared the Traditional free data mining tools described in the different perspectives such as business size, category, platform, data visualization, and which language used for developed the tools, TraditionalCommercialDataMiningtools,andBigdata mining. Traditional free data mining tools describedin the different perspectives such as business size, category, platform, data visualization, and which language used for developed the tools. Traditional Commercial Data Mining tools described in the different perspectives such as Official URL of the tools, businesssize,features,category,datavisualization,and whether free trial available or not. Big Data Mining tools described in the different perspectives such as Official URL of the tools, business size, deployment, what are the big data features exist, and whether free version available or not. This study will help to choose correct data mining tools for upcoming researchers who are going to do theresearchunderthedatamining and machine learning. REFERENCES: [1] Dr. Anil Sharma, Balrajpreet Kaur, A RESEARCH REVIEW ONCOMPARATIVE ANALYSIS OF DATA MINING TOOLS,TECHNIQUES AND PARAMETERS, International Journal ofAdvanced Research in Computer Science, Volume 8, No. 7, July– August 2017. [2] M.Hall, E.Frank, G.Holmes, B.Reutemann , IH Witten,"TheWEKA Data Mining Software: An Update," SIGKDD Explorations,2009. [3] Mrs. Parminder Kaur, Dr. Qamar Parvez Rana, Comparison of Various Tools for Data Mining,
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 708 International Journal of Engineering Research & Technology (IJERT) , Volume 3, Issue 10 – 2014. [4] Luís C. Borges, Viriato M. Marques , Jorge Bernardino Comparison of dataminingtechniquesand tools for data classification, C3S2E '13 Proceedings of the International C*Conference on Computer Science and Software Engineering Pages 113-116. [5] Dakić Dušanka et al, A Comparison of Contemporary Data Mining Tools, XVII International Scientific Conference on Industrial Systems (IS'17), Novi Sad, Serbia, October 4. – 6. [6] Wang J, Hu X, Hollister K, Zhu D. (2008) "A comparison and scenario analysis of leading data mining software". Int J Knowl Manage 2008, 4:17–34. [7] M. Antonie, A. Coman, and O. R. Zaiane,“Application of Data Mining Techniques for Medical Image Classification,” in Proceedings of the second international Workshop on Multimida Data Mining (MDM/KDD’2001), 2001, pp. 94–101. [8] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, “Web usage mining: Discovery andapplicationsof usage patterns from web data,” Acm Sigkdd …, vol. 1, no. 2, pp. 12–23, 2000. [9] J. Han and M. Kamber, Data Mining, Southeast Asia Edition:Concepts and Techniques. Morgan kaufmann, 2006. [10] J. Hipp, U. Güntzer, and G. Nakhaeizadeh, “Algorithms for association rule mining - a general survey and comparison,” ACM SIGKDD Explor. Newsl., vol. 2, no. 1, pp. 58–64, 2000. [11] C. Zhang and S. Zhang, Association rule mining: models andalgorithms, vol. 2307. Springer-Verlag, 2002. [12] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMODRec., 1993.