USING WEKA TO CLUSTERING AND
     REGRESSION ANALYSIS
                 ( ITB PAPER )




          ANURADHA CHAKRABORTY
              ROLL NO: 10BM60014




  VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand. WEKA
is free software available under the GNU General Public License. WEKA is a unique software
compared to MS –EXCEL because it can be used to run multivariate regression without any
hassles. It also gives output showing dependent variable equation and other statistical data.

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It
is also well-suited for developing new machine learning schemes.

The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, saved
as *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serial
files, LIBSVM, SVM Light, CSV, C4.5 among others.

USING WEKA:

The WEKA GUI Chooser has the four following options:
   1. Weka Explorer
   2. Weka Experimenter
   3. Weka Knowledge Flow
   4. Simple CLI




Weka Explorer has the following options in each tabs:
  1. Preprocess
  2. Classify
  3. Cluster
  4. Associate
  5. Select Attributes
  6. Visualize
Apart from doing these statistical operations, each of the data can be visualized graphically and
filtered according to requirement.




Weka Experimenter:
There are several algorithms for each process. Thus the criticality of the software lies in
identifying the optimal algorithm. For Regression and classification, Experimenter gives a
comparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not there
for Clustering algorithms.

Import of data:
Data is imported in form of CSV file which is converted into arff format automatically while
importing. The data is imported through Preprocess tab of WEKA as shown in picture above.



                                      CLUSTERING
Definition: Cluster analysis is a class of statistical techniques that can be applied to data that
exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into
clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a
cluster are similar to each other. They are also dissimilar to objects outside the cluster,
particularly objects in other clusters.”
DATA SET USED FOR CLUSTERING

The example used is a survey report on instant noodles. It had:
Instances: 76
Attribute: 33

The questions or attributes were as follows:
Age
Profession
Diabetesstop
Obesitystop
Otherstop
Cadburynchocl
Homemadesweets
Sweetfrmshop
Cakepastry
Sugarcube
Celebration
Gifts
Beginningauspicious
Yummyfood
Healthconcern
Lunchdinnerafter
Tastytraditn
Abroad
Frequencyeating
Inflnearby
Inflfrndrelative
Inflblogonline
Advert
Quality
Packaging
Ambience
Price
Imptraditonsweet
 Newexperimentswt
 Newvariety
 Homedeliveryimp
 Impchitchatplace
 Packagdsweetslngtime
PROCEDURE AND RESULT:

Data-set is taken from my AMRP project survey, regarding the interest and motivation of
consumers towards traditional sweets.

Simple K-Mean Algorithm was used to cluster the data set.

The output is as follows:

 Attribute        Full Data    0         1
                    (76)     (44)       (32)
 =======================================================
 Age               1.6711    1.6364   1.7188
 Profession          1.7632  1.6818   1.875
 Diabetesstop        2.3553   2.3636  2.3438
 Obesitystop         1.9605   1.9545  1.9688
 otherstop           1.9474  1.8636   2.0625
 Cadburynchocl       4.2895   4.25    4.3438
 homemadesweets       4.3421   4.3636  4.3125
 sweetfrmshop         4.0395   4.1136  3.9375
 cakepastry          3.9342    4.0455  3.7813
 sugarcube           2.4605    2.5    2.4063
 celebration          4.1447   4.3409 3.875
 gifts               3.7632   3.7955   3.7188
 beginningauspicious 3.7763    3.8636  3.6563
 yummyfood           3.8158    3.9318  3.6563
 healthconcern       2.9868    3       2.9688
 lunchdinnerafter    3.9737   4.0909   3.8125
 tastytraditn        3.7632   4.0227   3.4063
 abroad              1.8684   1.8864   1.8438
 frequencyeating     2.5658    2.4318   2.75
 inflnearby            3.0       4.0    3.0
 inflfrndrelative      4.0        4.0    3.0
 inflblogonline        3.0        3.0    2.0
 advert                3.0        3.0     2.0
 quality               5.0       5.0      5.0
 packaging            3.0        3.0       4.0
 ambience              3.0        3.0      4.0
 price                 3.0       4.0      3.0
 imptraditonsweet      5.0       5.0      3.0
 newexperimentswt 3.0            3.0      3.0
 newvariety            3.0      3.0       4.0
 homedeliveryimp 2.8158       2.8409      2.7813
 impchitchatplace 3.3421     3.3182        3.375
packagdsweetslngtime        3.1579        3     3.375

Note: The significant values in the above table, on which the cluster characteristics are formed,
are marked with red.

Clustered Instances

0    44 ( 58%)
1    32 ( 42%)


INTERPRETATION:

ASPECTS                        CLUSTER ‘0’                            CLUSTER ‘1’
Traditionality                 Loves traditional sweets.              Loves experiments and newer
                               Considers     sweet     as   a         variety of sweets
                               traditional symbol. Wants
                               sweet after lunch or dinner.
Frequency of consumption       High                                   Medium
Price                          More price sensitive                   Lesser price sensitive.
Influnce by friends and High                                          Medium. Generally tries new
relatives or advertisements to                                        shop by own instinct.
try a new shop
Ambience of shop and Matters less                                     Matters significantly.
packaging
Food Court for chatting (Like preferred                               prefered
Haldiram)
Packaged/ tinned sweets        Medium                                 Good Demand


INFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING:

There are two distinct clusters of consumers in the sweet industry.

Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savored
after lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try new
variants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/
blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience and
packaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typical
favorite ones.

Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as well
as experimental sweets (the new variants). They often prefer trying out new shops and
brands. Packaged sweets are also preferred which can be savored later. Apart from quality,
ambience and packaging plays a vital role, where as price is of medium importance. This
cluster seems to be more impulsive consumers, and would probably not mind paying a premium
for some new and creative sweets. So, brands like K.C. Das will be their preferred choice.


                               REGRESSION
The next procedure is regression analysis.

We obtain data from stores on monthly sales of a celebration chocolate pack depending on the
amount spent on its promotion in terms of posters used around the block or any other effort .

Here after we select all attributes and go to classify tab and run regression function.




OUTPUT

The output obtained is given below
= Run information ===

Scheme:    weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1
Instances: 46
Attributes: 3
         Sales
         Price
         Promotion
Test mode: split 80.0% train, remainder test

=== Classifier model (full training set) ===


Linear Regression Model

Sales = -53.2173 * Price +       3.6131 * Promotion + 5837.5208

Time taken to build model: 0 seconds

=== Evaluation on test split ===
=== Summary ===

Correlation coefficient        0.8066
Mean absolute error          543.6332
Root mean squared error         711.4575
Relative absolute error       48.288 %
Root relative squared error     59.6886 %
Total Number of Instances         5
Ignored Class Unknown Instances          4


INTERPRETETION


The given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model.
As expected we find that sales will decrease due to increase in price and increase with increase in
promotion budget.
This explains how WEKA can be used for multivariate regression .



REFERENCE

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://www.cs.waikato.ac.nz/ml/weka/

http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)

More Related Content

PDF
What's cooking
PDF
Ipd2a f11 mktng_skills_akr
PPT
TibcoSpotfire@VGSoM
PPTX
Patanjali 2
PPTX
Presentation on Patanjali Ltd.
PPSX
Reuters: Pictures of the Year 2016 (Part 2)
PDF
The Six Highest Performing B2B Blog Post Formats
PDF
13103855 Research On Consumer Preference For Soft Drinks
What's cooking
Ipd2a f11 mktng_skills_akr
TibcoSpotfire@VGSoM
Patanjali 2
Presentation on Patanjali Ltd.
Reuters: Pictures of the Year 2016 (Part 2)
The Six Highest Performing B2B Blog Post Formats
13103855 Research On Consumer Preference For Soft Drinks

Similar to Weka for clustering and regression itb vgsom (20)

PDF
13103855 Research On Consumer Preference For Soft Drinks
PDF
4 Data scientist professional - Case Tasty Bites - Predicting recipe site tra...
PDF
Ready To Eat Foods
PPTX
Mc donalds
PDF
Irfs of ny presentation 02 19 13 - mm (jg) #403 [compatibility mode]
PPTX
ANOVA in Marketing Research
PDF
Market basket analysis using apriori algorithm on
PDF
Mr course module 06
PDF
Market Research to Understand Behavior of sales wrt Price and flavor
PPTX
Shri ganeshay namaha
PDF
Diane wu insight demo
PDF
Maximizing Your ML Success with Innovative Feature Engineering
PDF
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
PPTX
Diane wu Insight demo
PDF
Diane wu insight final demo
PDF
Term Paper on WEKA
PDF
ITB tutorial WEKA Prabhat Agarwal
DOC
2153 12557-health drink-report
DOCX
DB Topic of Discussion Information-related CapabilitiesAnalyze .docx
PPTX
ASMD 2022 for class.pptx
13103855 Research On Consumer Preference For Soft Drinks
4 Data scientist professional - Case Tasty Bites - Predicting recipe site tra...
Ready To Eat Foods
Mc donalds
Irfs of ny presentation 02 19 13 - mm (jg) #403 [compatibility mode]
ANOVA in Marketing Research
Market basket analysis using apriori algorithm on
Mr course module 06
Market Research to Understand Behavior of sales wrt Price and flavor
Shri ganeshay namaha
Diane wu insight demo
Maximizing Your ML Success with Innovative Feature Engineering
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
Diane wu Insight demo
Diane wu insight final demo
Term Paper on WEKA
ITB tutorial WEKA Prabhat Agarwal
2153 12557-health drink-report
DB Topic of Discussion Information-related CapabilitiesAnalyze .docx
ASMD 2022 for class.pptx
Ad

Recently uploaded (20)

PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
August Patch Tuesday
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Modernising the Digital Integration Hub
PPT
What is a Computer? Input Devices /output devices
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hybrid model detection and classification of lung cancer
PPT
Geologic Time for studying geology for geologist
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Five Habits of High-Impact Board Members
DOCX
search engine optimization ppt fir known well about this
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Developing a website for English-speaking practice to English as a foreign la...
CloudStack 4.21: First Look Webinar slides
August Patch Tuesday
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Enhancing emotion recognition model for a student engagement use case through...
Modernising the Digital Integration Hub
What is a Computer? Input Devices /output devices
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hybrid model detection and classification of lung cancer
Geologic Time for studying geology for geologist
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Five Habits of High-Impact Board Members
search engine optimization ppt fir known well about this
WOOl fibre morphology and structure.pdf for textiles
Tartificialntelligence_presentation.pptx
Benefits of Physical activity for teenagers.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Ad

Weka for clustering and regression itb vgsom

  • 1. USING WEKA TO CLUSTERING AND REGRESSION ANALYSIS ( ITB PAPER ) ANURADHA CHAKRABORTY ROLL NO: 10BM60014 VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
  • 2. WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is free software available under the GNU General Public License. WEKA is a unique software compared to MS –EXCEL because it can be used to run multivariate regression without any hassles. It also gives output showing dependent variable equation and other statistical data. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, saved as *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serial files, LIBSVM, SVM Light, CSV, C4.5 among others. USING WEKA: The WEKA GUI Chooser has the four following options: 1. Weka Explorer 2. Weka Experimenter 3. Weka Knowledge Flow 4. Simple CLI Weka Explorer has the following options in each tabs: 1. Preprocess 2. Classify 3. Cluster 4. Associate 5. Select Attributes 6. Visualize
  • 3. Apart from doing these statistical operations, each of the data can be visualized graphically and filtered according to requirement. Weka Experimenter: There are several algorithms for each process. Thus the criticality of the software lies in identifying the optimal algorithm. For Regression and classification, Experimenter gives a comparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not there for Clustering algorithms. Import of data: Data is imported in form of CSV file which is converted into arff format automatically while importing. The data is imported through Preprocess tab of WEKA as shown in picture above. CLUSTERING Definition: Cluster analysis is a class of statistical techniques that can be applied to data that exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a cluster are similar to each other. They are also dissimilar to objects outside the cluster, particularly objects in other clusters.”
  • 4. DATA SET USED FOR CLUSTERING The example used is a survey report on instant noodles. It had: Instances: 76 Attribute: 33 The questions or attributes were as follows: Age Profession Diabetesstop Obesitystop Otherstop Cadburynchocl Homemadesweets Sweetfrmshop Cakepastry Sugarcube Celebration Gifts Beginningauspicious Yummyfood Healthconcern Lunchdinnerafter Tastytraditn Abroad Frequencyeating Inflnearby Inflfrndrelative Inflblogonline Advert Quality Packaging Ambience Price Imptraditonsweet Newexperimentswt Newvariety Homedeliveryimp Impchitchatplace Packagdsweetslngtime
  • 5. PROCEDURE AND RESULT: Data-set is taken from my AMRP project survey, regarding the interest and motivation of consumers towards traditional sweets. Simple K-Mean Algorithm was used to cluster the data set. The output is as follows: Attribute Full Data 0 1 (76) (44) (32) ======================================================= Age 1.6711 1.6364 1.7188 Profession 1.7632 1.6818 1.875 Diabetesstop 2.3553 2.3636 2.3438 Obesitystop 1.9605 1.9545 1.9688 otherstop 1.9474 1.8636 2.0625 Cadburynchocl 4.2895 4.25 4.3438 homemadesweets 4.3421 4.3636 4.3125 sweetfrmshop 4.0395 4.1136 3.9375 cakepastry 3.9342 4.0455 3.7813 sugarcube 2.4605 2.5 2.4063 celebration 4.1447 4.3409 3.875 gifts 3.7632 3.7955 3.7188 beginningauspicious 3.7763 3.8636 3.6563 yummyfood 3.8158 3.9318 3.6563 healthconcern 2.9868 3 2.9688 lunchdinnerafter 3.9737 4.0909 3.8125 tastytraditn 3.7632 4.0227 3.4063 abroad 1.8684 1.8864 1.8438 frequencyeating 2.5658 2.4318 2.75 inflnearby 3.0 4.0 3.0 inflfrndrelative 4.0 4.0 3.0 inflblogonline 3.0 3.0 2.0 advert 3.0 3.0 2.0 quality 5.0 5.0 5.0 packaging 3.0 3.0 4.0 ambience 3.0 3.0 4.0 price 3.0 4.0 3.0 imptraditonsweet 5.0 5.0 3.0 newexperimentswt 3.0 3.0 3.0 newvariety 3.0 3.0 4.0 homedeliveryimp 2.8158 2.8409 2.7813 impchitchatplace 3.3421 3.3182 3.375
  • 6. packagdsweetslngtime 3.1579 3 3.375 Note: The significant values in the above table, on which the cluster characteristics are formed, are marked with red. Clustered Instances 0 44 ( 58%) 1 32 ( 42%) INTERPRETATION: ASPECTS CLUSTER ‘0’ CLUSTER ‘1’ Traditionality Loves traditional sweets. Loves experiments and newer Considers sweet as a variety of sweets traditional symbol. Wants sweet after lunch or dinner. Frequency of consumption High Medium Price More price sensitive Lesser price sensitive. Influnce by friends and High Medium. Generally tries new relatives or advertisements to shop by own instinct. try a new shop Ambience of shop and Matters less Matters significantly. packaging Food Court for chatting (Like preferred prefered Haldiram) Packaged/ tinned sweets Medium Good Demand INFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING: There are two distinct clusters of consumers in the sweet industry. Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savored after lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try new variants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/ blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience and packaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typical favorite ones. Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as well as experimental sweets (the new variants). They often prefer trying out new shops and brands. Packaged sweets are also preferred which can be savored later. Apart from quality, ambience and packaging plays a vital role, where as price is of medium importance. This
  • 7. cluster seems to be more impulsive consumers, and would probably not mind paying a premium for some new and creative sweets. So, brands like K.C. Das will be their preferred choice. REGRESSION The next procedure is regression analysis. We obtain data from stores on monthly sales of a celebration chocolate pack depending on the amount spent on its promotion in terms of posters used around the block or any other effort . Here after we select all attributes and go to classify tab and run regression function. OUTPUT The output obtained is given below = Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1 Instances: 46
  • 8. Attributes: 3 Sales Price Promotion Test mode: split 80.0% train, remainder test === Classifier model (full training set) === Linear Regression Model Sales = -53.2173 * Price + 3.6131 * Promotion + 5837.5208 Time taken to build model: 0 seconds === Evaluation on test split === === Summary === Correlation coefficient 0.8066 Mean absolute error 543.6332 Root mean squared error 711.4575 Relative absolute error 48.288 % Root relative squared error 59.6886 % Total Number of Instances 5 Ignored Class Unknown Instances 4 INTERPRETETION The given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model. As expected we find that sales will decrease due to increase in price and increase with increase in promotion budget. This explains how WEKA can be used for multivariate regression . REFERENCE http://en.wikipedia.org/wiki/Weka_(machine_learning) http://www.cs.waikato.ac.nz/ml/weka/ http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)