Weka for clustering and regression itb vgsom

USING WEKA TO CLUSTERING AND
REGRESSION ANALYSIS
( ITB PAPER )

ANURADHA CHAKRABORTY
ROLL NO: 10BM60014

VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR

WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand. WEKA
is free software available under the GNU General Public License. WEKA is a unique software
compared to MS –EXCEL because it can be used to run multivariate regression without any
hassles. It also gives output showing dependent variable equation and other statistical data.

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It
is also well-suited for developing new machine learning schemes.

The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, saved
as *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serial
files, LIBSVM, SVM Light, CSV, C4.5 among others.

USING WEKA:

The WEKA GUI Chooser has the four following options:
1. Weka Explorer
2. Weka Experimenter
3. Weka Knowledge Flow
4. Simple CLI

Weka Explorer has the following options in each tabs:
1. Preprocess
2. Classify
3. Cluster
4. Associate
5. Select Attributes
6. Visualize

Apart from doing these statistical operations, each of the data can be visualized graphically and
filtered according to requirement.

Weka Experimenter:
There are several algorithms for each process. Thus the criticality of the software lies in
identifying the optimal algorithm. For Regression and classification, Experimenter gives a
comparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not there
for Clustering algorithms.

Import of data:
Data is imported in form of CSV file which is converted into arff format automatically while
importing. The data is imported through Preprocess tab of WEKA as shown in picture above.

CLUSTERING
Definition: Cluster analysis is a class of statistical techniques that can be applied to data that
exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into
clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a
cluster are similar to each other. They are also dissimilar to objects outside the cluster,
particularly objects in other clusters.”

DATA SET USED FOR CLUSTERING

The example used is a survey report on instant noodles. It had:
Instances: 76
Attribute: 33

The questions or attributes were as follows:
Age
Profession
Diabetesstop
Obesitystop
Otherstop
Cadburynchocl
Homemadesweets
Sweetfrmshop
Cakepastry
Sugarcube
Celebration
Gifts
Beginningauspicious
Yummyfood
Healthconcern
Lunchdinnerafter
Tastytraditn
Abroad
Frequencyeating
Inflnearby
Inflfrndrelative
Inflblogonline
Advert
Quality
Packaging
Ambience
Price
Imptraditonsweet
Newexperimentswt
Newvariety
Homedeliveryimp
Impchitchatplace
Packagdsweetslngtime

PROCEDURE AND RESULT:

Data-set is taken from my AMRP project survey, regarding the interest and motivation of
consumers towards traditional sweets.

Simple K-Mean Algorithm was used to cluster the data set.

The output is as follows:

Attribute Full Data 0 1
(76) (44) (32)
=======================================================
Age 1.6711 1.6364 1.7188
Profession 1.7632 1.6818 1.875
Diabetesstop 2.3553 2.3636 2.3438
Obesitystop 1.9605 1.9545 1.9688
otherstop 1.9474 1.8636 2.0625
Cadburynchocl 4.2895 4.25 4.3438
homemadesweets 4.3421 4.3636 4.3125
sweetfrmshop 4.0395 4.1136 3.9375
cakepastry 3.9342 4.0455 3.7813
sugarcube 2.4605 2.5 2.4063
celebration 4.1447 4.3409 3.875
gifts 3.7632 3.7955 3.7188
beginningauspicious 3.7763 3.8636 3.6563
yummyfood 3.8158 3.9318 3.6563
healthconcern 2.9868 3 2.9688
lunchdinnerafter 3.9737 4.0909 3.8125
tastytraditn 3.7632 4.0227 3.4063
abroad 1.8684 1.8864 1.8438
frequencyeating 2.5658 2.4318 2.75
inflnearby 3.0 4.0 3.0
inflfrndrelative 4.0 4.0 3.0
inflblogonline 3.0 3.0 2.0
advert 3.0 3.0 2.0
quality 5.0 5.0 5.0
packaging 3.0 3.0 4.0
ambience 3.0 3.0 4.0
price 3.0 4.0 3.0
imptraditonsweet 5.0 5.0 3.0
newexperimentswt 3.0 3.0 3.0
newvariety 3.0 3.0 4.0
homedeliveryimp 2.8158 2.8409 2.7813
impchitchatplace 3.3421 3.3182 3.375

packagdsweetslngtime 3.1579 3 3.375

Note: The significant values in the above table, on which the cluster characteristics are formed,
are marked with red.

Clustered Instances

0 44 ( 58%)
1 32 ( 42%)

INTERPRETATION:

ASPECTS CLUSTER ‘0’ CLUSTER ‘1’
Traditionality Loves traditional sweets. Loves experiments and newer
Considers sweet as a variety of sweets
traditional symbol. Wants
sweet after lunch or dinner.
Frequency of consumption High Medium
Price More price sensitive Lesser price sensitive.
Influnce by friends and High Medium. Generally tries new
relatives or advertisements to shop by own instinct.
try a new shop
Ambience of shop and Matters less Matters significantly.
packaging
Food Court for chatting (Like preferred prefered
Haldiram)
Packaged/ tinned sweets Medium Good Demand

INFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING:

There are two distinct clusters of consumers in the sweet industry.

Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savored
after lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try new
variants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/
blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience and
packaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typical
favorite ones.

Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as well
as experimental sweets (the new variants). They often prefer trying out new shops and
brands. Packaged sweets are also preferred which can be savored later. Apart from quality,
ambience and packaging plays a vital role, where as price is of medium importance. This

cluster seems to be more impulsive consumers, and would probably not mind paying a premium
for some new and creative sweets. So, brands like K.C. Das will be their preferred choice.

REGRESSION
The next procedure is regression analysis.

We obtain data from stores on monthly sales of a celebration chocolate pack depending on the
amount spent on its promotion in terms of posters used around the block or any other effort .

Here after we select all attributes and go to classify tab and run regression function.

OUTPUT

The output obtained is given below
= Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1
Instances: 46

Attributes: 3
Sales
Price
Promotion
Test mode: split 80.0% train, remainder test

=== Classifier model (full training set) ===

Linear Regression Model

Sales = -53.2173 * Price + 3.6131 * Promotion + 5837.5208

Time taken to build model: 0 seconds

=== Evaluation on test split ===
=== Summary ===

Correlation coefficient 0.8066
Mean absolute error 543.6332
Root mean squared error 711.4575
Relative absolute error 48.288 %
Root relative squared error 59.6886 %
Total Number of Instances 5
Ignored Class Unknown Instances 4

INTERPRETETION

The given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model.
As expected we find that sales will decrease due to increase in price and increase with increase in
promotion budget.
This explains how WEKA can be used for multivariate regression .

REFERENCE

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://www.cs.waikato.ac.nz/ml/weka/

http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)

Weka for clustering and regression itb vgsom

More Related Content

Similar to Weka for clustering and regression itb vgsom (20)

Recently uploaded (20)

Weka for clustering and regression itb vgsom