SlideShare a Scribd company logo
Data mining - Weka
                  Submitted as a part of the course
                    ‘IT for Business Intelligence’

                                      Ramya Krishna P
                                        10BM60056
                                         4/19/2012




This paper introduces Weka briefly and proceeds to demonstrate application of two data mining
techniques – association rules and regression.
Table of Contents
Weka – Introduction ..................................................................................................................................... 1
   Requirements............................................................................................................................................ 1
Getting started .............................................................................................................................................. 1
Data sets ....................................................................................................................................................... 2
Association rules ........................................................................................................................................... 3
   Business application.................................................................................................................................. 3
   Data set ..................................................................................................................................................... 4
   Preprocess................................................................................................................................................. 5
   Associate ................................................................................................................................................... 6
Regression ..................................................................................................................................................... 7
   Business applications ................................................................................................................................ 7
   Data set ..................................................................................................................................................... 8
   Preprocess................................................................................................................................................. 8
   Linear regression ..................................................................................................................................... 10
   Non-numeric input variables .................................................................................................................. 13
References .................................................................................................................................................. 14
Weka – Introduction
Weka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to do
classification, regression, clustering, forming association rules and visualization. It is open source
software.

Requirements
For latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have used
Weka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here.


Getting started
You can run Weka through command prompt or through GUI. We go by the GUI. Here is how it looks
like.




For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have




                                                                                                        1
To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about data
sets.


Data sets
The default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. A
snapshot of a .arff file is like this.
So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, upload
your data to .csv format.

Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’.

I will proceed with the rest of the tutorial through examples.


Association rules
To give a little introduction about association rules, this is a method to develop relations between
variables in data sets. We develop some rules from these relations that have a certain level of support
and confidence. These rules can be of a great business value sometimes. One typical business
application of association rules is ‘Market basket analysis’.

Business application
The market-basket problem assumes we have some large number of items, e.g., bread, milk. Customers
fill their market baskets with some subset of the items, and we get to know what items people buy
together, even if we don't know who they are. By developing association rules of the form,
{X1, X2, . . .Xn} -> Y

we have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might also
stock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our data
set.

Data set
The format of my data set is like this

TID1         ID2    ID5    ID6
TID2         ID3    ID4    ID6    ID7    ID9
TID3         ID4    ID5
TID4         ID1    ID4    ID5    ID7    ID9    ID10

...

where, the first column gives the transaction id and then each row has a number of products, which
have been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the data
set in this form (the rows are of unequal lengths). Both .arff and .csv require each data record to
have the same number of fields.

 To change the data format, create one attribute per "item" and use "true" and "false" field values
in the data row corresponding to the item. We can't use 0 and 1 because Apriori (the algorithm we will
be using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now looks
like

TID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID10

1,false, true, false, false, true, true, false, false, false, false
2, false, false, true, true, false, true, true, false, true, false
3, false, false, true, true, false, false, false, false, false, false
4, true, false, false, true, true, false, true, false, true, true

Now, I have a sample data set (which I have downloaded from here) which is thankfully, already in
the.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this in
Weka, you get an error that the heap size is not sufficient. You can change the heap size by changing the
value of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heap
size of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and
400 rows. A snapshot of the data set is like this.
Preprocess
Once you choose this file under ‘open file’, this is how it looks like.
Weka lists all the attributes present in the data set. It also provides visualizations of these data and
other stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, we
can select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a
‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing the
expression. We check ‘all’. Then we go to ‘Associate’ tab.

Associate
We go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, by
clicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, are
listed.
We can change these parameters as per requirements. To know what each parameter stands for, click
on ‘More’. After changing the parameters, click on ‘Ok’.

Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a while
and mean-while the bird roams this side and that side.

A part of the output is shown here.




Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is

 Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396        <conf:(1)> lift:(1.01) lev:(0) [1]
conv:(1.98)

That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule also
specifies confidence, conviction and leverage of each rule(explanation of each can be found under
‘more’ , shown above).

The model can be run by changing the parameters and each of the results can be seen under the ‘Result
list’. The results can also be saved for later.


Regression
Regression, is as one knows a relation between a dependent variable and one or more independent
variables. As there is not much need to explain about regression, we jump into the process.

Business applications
Before we start with the tutorial, here are some areas where regression can be used
Trend line analysis - to show the movement of financial or product attributes over time. Stock
         prices, oil prices can be analyzed using trend lines.
         Risk analysis for investments - The capital asset pricing model was developed using linear
         regression analysis
         Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or
         market shares.
         Total quality control - Quality control methods use linear regression frequently to analyze key
         product specifications and other measurable parameters of product or organization (for eg.,
         customer complaints over time).
         Human Resources - to predict the demographics and types of future work forces for large
         companies.

Data set
I have used a data set provided by Weka website for this. A number of datasets for different techniques
can be found here.

The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due to
industrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependent
variables are


    1.   country code
    2.   year
    3.   unemployment
    4.   inflation
    5.   parliamentary representation of social democratic and labor parties and
    6.   a time-invariant measure of union centralization.


If your data is not in .csv or .arff, it needs to be preprocessed as explained above.

Preprocess
After uploading the data into Weka, it looks like this.
For each numerical attribute, weka gives the stastics like mean, max, min, stdev.

On clicking ‘visualize all’, the graphs of all variables are shown.
We check ‘All’ to select all variables and click on ‘Classify’ now.

Linear regression
We click ‘choose’ under Classifier and select ‘Linear Regression’ as shown.




Click on box beside ‘choose’ to select parameters for Linear Regression.
Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we have
uploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data to
build the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied data
and then average them out to create a final model and Percentage split, where WEKA takes a percentile
subset of the supplied data to build a final model. For this example, we choose Use training set.

By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, we
change the variable to the required variable by choosing from the drop-down. We choose ‘volume’ as
the dependent variable and click on ‘Start’.

A part of the output is shown below.
The first line of the model is

175.7183 * country=5,3,13,17,7,1,18,6,9,4,10

It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if the
country code is 8, you would put a ‘0’.

By default, Weka employs attribute selection, which means it may not include all of the attributes in the
regression equation. Hence we have not got all the dependent variables in the above model. To
eliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attribute
selection" and run the model again.

Now the model is as follows
Non-numeric input variables
If we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we can
convert the two values to 0 and 1.

However, we have techniques to handle both numeric and non-numeric (categorical) attributes.

    1. One way is to build a decision tree and have each classification be a numeric value that is the
       average of the values for the training examples in that subgroup - the result is called a
       regression tree
    2. Another option is to have a separate regression equation for each classification in the tree –
       based on the training examples in that subgroup – this is called a model tree.
References
  1. http://www.cs.waikato.ac.nz/ml/weka/
  2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html
  3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php
  4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv
  5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes,
     Bernhard Pfahringer Peter Reutemann, and Ian H. Witten
  6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html
  7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html
  8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf

More Related Content

PDF
Use of Failure Mechanisms enhance FMEA and FMECA
PPTX
Face Recognition
PDF
Fault Tree Analysis.pdf
PPTX
Music genre prediction
PDF
Application of FMEA to a Sterility Testing Isolator: A Case Study
PPTX
Pattern recognition and Machine Learning.
PDF
YOLOv5 BASED WEB APPLICATION FOR INDIAN CURRENCY NOTE DETECTION
PPTX
Many eyes @Vgsom
Use of Failure Mechanisms enhance FMEA and FMECA
Face Recognition
Fault Tree Analysis.pdf
Music genre prediction
Application of FMEA to a Sterility Testing Isolator: A Case Study
Pattern recognition and Machine Learning.
YOLOv5 BASED WEB APPLICATION FOR INDIAN CURRENCY NOTE DETECTION
Many eyes @Vgsom

Viewers also liked (11)

PPTX
Some Thoughts on Learning Analytics and Educational Data Mining
PPTX
Data Mining Project for student academic specialization and performance
PPTX
Students academic performance using clustering technique
PPTX
Grand challenges for the Educational Data Mining and Learning Sciences Commun...
PPTX
Predicting Student Performance in Solving Parameterized Exercises
PPTX
USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMS
PDF
Big Data in Education
PPTX
DATA MINING TOOL- ORANGE
PPT
Information security in big data -privacy and data mining
PPTX
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
PPTX
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Some Thoughts on Learning Analytics and Educational Data Mining
Data Mining Project for student academic specialization and performance
Students academic performance using clustering technique
Grand challenges for the Educational Data Mining and Learning Sciences Commun...
Predicting Student Performance in Solving Parameterized Exercises
USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LMS
Big Data in Education
DATA MINING TOOL- ORANGE
Information security in big data -privacy and data mining
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Ad

Similar to Data Mining _ Weka (20)

DOC
Data mining techniques using weka
PDF
Itb weka nikhil
PDF
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
PPT
Shraddha weka
PPT
Shraddha weka
PPT
Introduction to Weka and Preprocessing.ppt
DOCX
Weka Term Paper_VGSoM_10BM60011
PDF
wekapresentation-130107115704-phpapp02.pdf
PPT
WEKA Tutorial
PPTX
Analytics machine learning in weka
PDF
Classification and Clustering Analysis using Weka
PPTX
WEKA Tutorial and Introduction Data mining
PDF
DATA MINING on WEKA
PPTX
Weka_new_forthedataming_practicalss.pptx
PPTX
PDF
weka-190429184259.pdf
PPTX
Weka presentation
PDF
Weka project - Classification & Association Rule Generation
PDF
Data Mining using Weka
PPT
Weka presentation
Data mining techniques using weka
Itb weka nikhil
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Shraddha weka
Shraddha weka
Introduction to Weka and Preprocessing.ppt
Weka Term Paper_VGSoM_10BM60011
wekapresentation-130107115704-phpapp02.pdf
WEKA Tutorial
Analytics machine learning in weka
Classification and Clustering Analysis using Weka
WEKA Tutorial and Introduction Data mining
DATA MINING on WEKA
Weka_new_forthedataming_practicalss.pptx
weka-190429184259.pdf
Weka presentation
Weka project - Classification & Association Rule Generation
Data Mining using Weka
Weka presentation
Ad

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Cell Types and Its function , kingdom of life
PDF
Pre independence Education in Inndia.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
master seminar digital applications in india
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pharma ospi slides which help in ospi learning
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Cell Types and Its function , kingdom of life
Pre independence Education in Inndia.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
TR - Agricultural Crops Production NC III.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O7-L3 Supply Chain Operations - ICLT Program
Week 4 Term 3 Study Techniques revisited.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
human mycosis Human fungal infections are called human mycosis..pptx
Final Presentation General Medicine 03-08-2024.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Microbial diseases, their pathogenesis and prophylaxis
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
master seminar digital applications in india

Data Mining _ Weka

  • 1. Data mining - Weka Submitted as a part of the course ‘IT for Business Intelligence’ Ramya Krishna P 10BM60056 4/19/2012 This paper introduces Weka briefly and proceeds to demonstrate application of two data mining techniques – association rules and regression.
  • 2. Table of Contents Weka – Introduction ..................................................................................................................................... 1 Requirements............................................................................................................................................ 1 Getting started .............................................................................................................................................. 1 Data sets ....................................................................................................................................................... 2 Association rules ........................................................................................................................................... 3 Business application.................................................................................................................................. 3 Data set ..................................................................................................................................................... 4 Preprocess................................................................................................................................................. 5 Associate ................................................................................................................................................... 6 Regression ..................................................................................................................................................... 7 Business applications ................................................................................................................................ 7 Data set ..................................................................................................................................................... 8 Preprocess................................................................................................................................................. 8 Linear regression ..................................................................................................................................... 10 Non-numeric input variables .................................................................................................................. 13 References .................................................................................................................................................. 14
  • 3. Weka – Introduction Weka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to do classification, regression, clustering, forming association rules and visualization. It is open source software. Requirements For latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have used Weka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here. Getting started You can run Weka through command prompt or through GUI. We go by the GUI. Here is how it looks like. For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have 1
  • 4. To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about data sets. Data sets The default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. A snapshot of a .arff file is like this.
  • 5. So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, upload your data to .csv format. Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’. I will proceed with the rest of the tutorial through examples. Association rules To give a little introduction about association rules, this is a method to develop relations between variables in data sets. We develop some rules from these relations that have a certain level of support and confidence. These rules can be of a great business value sometimes. One typical business application of association rules is ‘Market basket analysis’. Business application The market-basket problem assumes we have some large number of items, e.g., bread, milk. Customers fill their market baskets with some subset of the items, and we get to know what items people buy together, even if we don't know who they are. By developing association rules of the form,
  • 6. {X1, X2, . . .Xn} -> Y we have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might also stock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our data set. Data set The format of my data set is like this TID1 ID2 ID5 ID6 TID2 ID3 ID4 ID6 ID7 ID9 TID3 ID4 ID5 TID4 ID1 ID4 ID5 ID7 ID9 ID10 ... where, the first column gives the transaction id and then each row has a number of products, which have been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the data set in this form (the rows are of unequal lengths). Both .arff and .csv require each data record to have the same number of fields. To change the data format, create one attribute per "item" and use "true" and "false" field values in the data row corresponding to the item. We can't use 0 and 1 because Apriori (the algorithm we will be using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now looks like TID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID10 1,false, true, false, false, true, true, false, false, false, false 2, false, false, true, true, false, true, true, false, true, false 3, false, false, true, true, false, false, false, false, false, false 4, true, false, false, true, true, false, true, false, true, true Now, I have a sample data set (which I have downloaded from here) which is thankfully, already in the.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this in Weka, you get an error that the heap size is not sufficient. You can change the heap size by changing the value of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heap size of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and 400 rows. A snapshot of the data set is like this.
  • 7. Preprocess Once you choose this file under ‘open file’, this is how it looks like.
  • 8. Weka lists all the attributes present in the data set. It also provides visualizations of these data and other stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, we can select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a ‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing the expression. We check ‘all’. Then we go to ‘Associate’ tab. Associate We go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, by clicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, are listed.
  • 9. We can change these parameters as per requirements. To know what each parameter stands for, click on ‘More’. After changing the parameters, click on ‘Ok’. Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a while and mean-while the bird roams this side and that side. A part of the output is shown here. Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396 <conf:(1)> lift:(1.01) lev:(0) [1] conv:(1.98) That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule also specifies confidence, conviction and leverage of each rule(explanation of each can be found under ‘more’ , shown above). The model can be run by changing the parameters and each of the results can be seen under the ‘Result list’. The results can also be saved for later. Regression Regression, is as one knows a relation between a dependent variable and one or more independent variables. As there is not much need to explain about regression, we jump into the process. Business applications Before we start with the tutorial, here are some areas where regression can be used
  • 10. Trend line analysis - to show the movement of financial or product attributes over time. Stock prices, oil prices can be analyzed using trend lines. Risk analysis for investments - The capital asset pricing model was developed using linear regression analysis Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or market shares. Total quality control - Quality control methods use linear regression frequently to analyze key product specifications and other measurable parameters of product or organization (for eg., customer complaints over time). Human Resources - to predict the demographics and types of future work forces for large companies. Data set I have used a data set provided by Weka website for this. A number of datasets for different techniques can be found here. The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due to industrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependent variables are 1. country code 2. year 3. unemployment 4. inflation 5. parliamentary representation of social democratic and labor parties and 6. a time-invariant measure of union centralization. If your data is not in .csv or .arff, it needs to be preprocessed as explained above. Preprocess After uploading the data into Weka, it looks like this.
  • 11. For each numerical attribute, weka gives the stastics like mean, max, min, stdev. On clicking ‘visualize all’, the graphs of all variables are shown.
  • 12. We check ‘All’ to select all variables and click on ‘Classify’ now. Linear regression We click ‘choose’ under Classifier and select ‘Linear Regression’ as shown. Click on box beside ‘choose’ to select parameters for Linear Regression.
  • 13. Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we have uploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data to build the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. For this example, we choose Use training set. By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, we change the variable to the required variable by choosing from the drop-down. We choose ‘volume’ as the dependent variable and click on ‘Start’. A part of the output is shown below.
  • 14. The first line of the model is 175.7183 * country=5,3,13,17,7,1,18,6,9,4,10 It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if the country code is 8, you would put a ‘0’. By default, Weka employs attribute selection, which means it may not include all of the attributes in the regression equation. Hence we have not got all the dependent variables in the above model. To eliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attribute selection" and run the model again. Now the model is as follows
  • 15. Non-numeric input variables If we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we can convert the two values to 0 and 1. However, we have techniques to handle both numeric and non-numeric (categorical) attributes. 1. One way is to build a decision tree and have each classification be a numeric value that is the average of the values for the training examples in that subgroup - the result is called a regression tree 2. Another option is to have a separate regression equation for each classification in the tree – based on the training examples in that subgroup – this is called a model tree.
  • 16. References 1. http://www.cs.waikato.ac.nz/ml/weka/ 2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html 3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php 4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv 5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer Peter Reutemann, and Ian H. Witten 6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html 7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html 8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf