SlideShare a Scribd company logo
Term paper on Data mining
How to use Weka for data analysis
Submitted by: Shubham Gupta (10BM60085)

Vinod Gupta School of Management
The first technique that we would do on weka is classification. The data below shows the financial
situation in Japan. The data has been collected from 1970-2009. The columns represent:

    1)   BROAD: Broad money supplied in the economy
    2)   DOMC: Domestic consumption
    3)   PSC: Payment securities
    4)   CLAIMS: Represents the claims on the government.
    5)   TOTRES: Total Reserves
    6)   GDP: Gross domestic product
    7)   LIQLB: Liquid Liability

We want to get a decision tree that would help us decide what values of independent variable may
result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we
would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions
better. Hence to get such rules we perform this analysis to generate a decision tree.


  YEAR       BROAD     CLAIMS     DOMC       PSC         TOTRES        LIQLB           GDP
  1970       83.65      61.88     134.25    111.75     4876114550     104.73      205,995,000,000
  1971       106.70     21.37     147.59    123.72    15469150615     118.21      232,681,000,000
  1972       116.14     23.17     160.29    133.47    18932675966     129.03      308,137,000,000
  1973       116.02     19.84     157.87    132.20    13723930639     126.07      418,640,000,000
  1974       113.08     13.72     154.00    126.49    16551248298     120.50      464,705,000,000
  1975       118.31     13.02     164.40    129.96    14910849997     127.56      505,317,000,000
  1976       122.40     12.09     169.96    130.63    18590784646     131.20      567,926,000,000
  1977       125.82      8.76     172.45    128.49    25907710023     133.90      698,968,000,000
  1978       130.36      8.56     178.29    127.71    37824744320     139.12      982,078,000,000
  1979       135.51      8.19     183.05    129.23    31926244737     142.67    1,022,190,000,000
  1980       137.95      8.09     188.44    131.29    38918848626     144.30    1,071,000,000,000
  1981       142.13      8.04     194.09    134.10    37839039769     150.03    1,183,790,000,000
  1982       149.54      7.67     203.99    139.59    34403732201     156.18    1,100,410,000,000
  1983       156.55      6.72     213.12    145.03    33844549531     162.92    1,200,190,000,000
  1984       159.31      6.69     217.77    147.43    33898638541     165.34    1,275,560,000,000
  1985       160.68      7.66     220.09    149.90    34641202378     167.41    1,364,160,000,000
  1986       167.30      7.67     230.23    156.30    51727320082     174.65    2,020,890,000,000
  1987       175.85     12.27     243.85    173.48    92701641597     183.77    2,448,670,000,000
  1988       178.70     10.66     251.68    182.52    1.06668E+11     186.47    2,971,030,000,000
  1989       182.62     10.13     258.13    190.28    93672771034     192.14    2,972,670,000,000
  1990       184.06      8.46     259.15    194.81    87828362969     190.16    3,058,040,000,000
  1991       184.35      5.20     257.54    195.40    80625855126     189.32    3,484,770,000,000
  1992       187.89      4.16     265.33    199.63    79696644593     190.93    3,796,110,000,000
  1993       193.97      1.33     274.00    202.14    1.07989E+11     198.16    4,350,010,000,000
  1994       200.35      1.88     281.02    204.58    1.35146E+11     204.45    4,778,990,000,000
1995       205.79        1.26      287.13   203.90   1.9262E+11       209.90    5,264,380,000,000
   1996       209.72        1.81      292.42   205.21   2.25594E+11      213.63    4,642,540,000,000
   1997       215.31        6.47      276.47   217.76   2.26679E+11      221.38    4,261,840,000,000
   1998       229.64        1.80      298.40   228.01   2.22443E+11      233.17    3,857,030,000,000
   1999       239.91        -1.20     309.92   231.08   2.93948E+11      243.22    4,368,730,000,000
   2000       242.24        -1.58     308.91   222.28   3.61639E+11      243.84    4,667,450,000,000
   2001       225.31       -33.25     299.43   193.01   4.01958E+11      187.41    4,095,480,000,000
   2002       207.79        -4.32     299.16   182.40   4.69618E+11      190.79    3,918,340,000,000
   2003       209.70       -1.99      307.26   180.71   6.73554E+11      191.84    4,229,100,000,000
   2004       207.51       -1.10      303.48   174.12   8.44667E+11      189.79    4,605,920,000,000
   2005       207.24       1.79       312.85   182.87   8.46896E+11      189.30    4,552,200,000,000
   2006       204.73       -0.14      304.96   179.99   8.95321E+11      186.06    4,362,590,000,000
   2007       201.50       0.16       294.31   172.56   9.73297E+11      184.17    4,377,940,000,000
   2008       207.14       0.76       295.42   165.48   1.03076E+12      189.52    4,879,860,000,000
   2009       223.76       -1.12      320.53   171.00   1.04899E+12      206.13    5,032,980,000,000


Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.




Figure 1 Shows how to load data in Weka

Weka software is used to classify the above data to find out how these economical factors be modified
or fixed so as to get an 11% growth in the previous year’s GDP
Figure 2 Diagram shows where you could the used tree technique

The following shows the output by running the above data in Weka. The Classifier used is to create the
required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some
enhancements. M5Base. Implements base routines for generating M5 Model trees and rule
the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for
‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten
(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown
below:

=== Run information ===

Scheme: weka.classifiers.trees.M5P -M 4.0

Relation:    Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1

Instances: 945

Attributes: 6

        BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB

Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===

M5 pruned model tree:

(Using smoothed linear models)

BROAD <= 153.045 : LM1 (13/5.644%)

BROAD > 153.045 :

| PSC <= 203.02 :

| | BROAD <= 177.275 : LM2 (5/0.653%)

| | BROAD > 177.275 :

| | | TOTRES <= 871108500000 : LM3 (11/8.309%)

| | | TOTRES > 871108500000 : LM4 (4/1.446%)

| PSC > 203.02 : LM5 (7/2.741%)



LM num: 1

LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168

LM num: 2

LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097

LM num: 3

LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87

LM num: 4

LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563

LM num: 5

LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517

Number of Rules: 5



Time taken to build model: 0.08 seconds
=== Cross-validation ===

=== Summary ===



Correlation coefficient          0.9882

Mean absolute error              3.412

Root mean squared error            5.4145

Relative absolute error           11.529 %

Root relative squared error       15.1993 %

Total Number of Instances          40

Ignored Class Unknown Instances     905




Interpretation of the Results:

Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)
based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check
PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP
values as shown in the figure above.



Linear Regression with Weka
The second technique is to conduct linear regression through Weka on the same data. When the
outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique
to consider. In the previous technique we created five linear models from the same data; hence M5P’s
performance is slightly worse than any linear model. The idea is to express the class as a linear
combination of the attributes with predetermined weights. From the previous data, we can also find
linear regression equation between various parameters determining GDP. To run the regression, go to
classify tab on Weka and choose linear regression from functions as shown.




Figure 3 Shows where to find LR in Weka

Following output is generated by the above analysis:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

        YEAR

        BROAD
CLAIMS

        DOMC

        PSC

        TOTRES

        LIQLB

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0           * TOTRES +     -6.9705

Time taken to build model: 0.2 seconds

=== Cross-validation ===

=== Summary ===



Correlation coefficient              0.9738

Mean absolute error                  4.8731

Root mean squared error               8.0404

Relative absolute error              16.4661 %

Root relative squared error          22.5707 %

Total Number of Instances               40

Ignored Class Unknown Instances         905

The above analysis gives as a mathematical relationship (linear) between various variables. The Value of
the fifth variable (dependent) can be found out once other independent variable values are known. This
equation also tells how these variables are related. A negative relation shows reciprocal relationship and
vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The
same is shown in the figure below.
CLUSTERING IN WEKA
Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have
used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans
implementation clustering data use k-means, or the algorithm can decide using cross-validation- in
which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the
above data. The result is shown as table with rows that are attributes names and columns that
correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The
number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is
either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of
the output shows the result of applying the learned cluster model. In this case, it assigned each training
set to one of the clusters, showing the same result as the parenthetical numbers at the top of each
column. An alternative is to use a separate test set or a percentage split of training data, in which case
figures would be different. This technique could be used with data from other countries in addition of
the present data that is taken for Japan.
=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

       YEAR

       BROAD

       CLAIMS

       DOMC

       PSC

       TOTRES

       LIQLB

Test mode:evaluate on training data
=== Model and evaluation on training set ===kMeans======

Number of iterations: 5

Within cluster sum of squared errors: 12.988387913678944

Missing values globally replaced with mean/mode



Cluster centroids:

                           Cluster#

Attribute      Full Data              0          1

                (945)           (929)          (16)

=================================================================

YEAR            1989.5         1989.2933        2001.5

BROAD          174.1633        173.4625      214.8525

CLAIMS          6.6645         6.8103         -1.7981

DOMC           242.2808        241.2956       299.4794

PSC            168.2627         167.8077       194.685

TOTRES      248907476505.9463 243675387834.3592           552695625000

LIQLB          175.2342        174.7166       205.2875

Time taken to build model (full training data) : 0.14 seconds

=== Model and evaluation on training set ===

Clustered    Instances

0           929 (98%)

1           16 (2%)

We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.
We get the following output:
DATA MINING WITH WEKA

More Related Content

XLSX
R tabel
PDF
BMW 740li report
PDF
Rio cojedes total mediciones
PDF
Hyundai hdf15 3 forklift truck service repair manual
PDF
106成績人數累計
PDF
107年指考各科成績累計
R tabel
BMW 740li report
Rio cojedes total mediciones
Hyundai hdf15 3 forklift truck service repair manual
106成績人數累計
107年指考各科成績累計

What's hot (16)

PDF
105指考成績人數累計
PDF
Normal dis
PDF
BREAKDOWN OF COST CALCULATIONS
PDF
109年指考各科成績人數百分比累計表
PDF
108成績人數累計
PDF
PDF
BENCHMARK PROJECT ALLOCATION
PDF
Periodic Table Graphs
PDF
DrayTek AP900 Airtime Fairness Webinar
PDF
SPICE MODEL of 2H15 in SPICE PARK
PDF
SPICE MODEL of 3E5 in SPICE PARK
PDF
SPICE MODEL of 3E6 in SPICE PARK
PDF
SPICE MODEL of 3E8 in SPICE PARK
PDF
Anchor bolt load capacities F1554
PDF
Silent gliss 2015 price list ver1
PDF
Weight chart for_hexagon_bolts_&amp;_nuts
105指考成績人數累計
Normal dis
BREAKDOWN OF COST CALCULATIONS
109年指考各科成績人數百分比累計表
108成績人數累計
BENCHMARK PROJECT ALLOCATION
Periodic Table Graphs
DrayTek AP900 Airtime Fairness Webinar
SPICE MODEL of 2H15 in SPICE PARK
SPICE MODEL of 3E5 in SPICE PARK
SPICE MODEL of 3E6 in SPICE PARK
SPICE MODEL of 3E8 in SPICE PARK
Anchor bolt load capacities F1554
Silent gliss 2015 price list ver1
Weight chart for_hexagon_bolts_&amp;_nuts
Ad

Similar to DATA MINING WITH WEKA (20)

PPTX
Linear regression an 80 year study of the dow jones industrial average
PPTX
Linear regression an 80 year study of the dow jones industrial average
PDF
Excel Model for Banking
PDF
Moyno pump 2000 dimensions g2
PDF
cobb500 broiler performance nutrition supplement 2022
PDF
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
 
PDF
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
PDF
recipes
DOCX
PDF
Futurevaluetables
PDF
Present Value and Future Value Tables
PDF
8 fv&amp;pv tables
PDF
WRI Operating Statement Detail
PDF
Hyundai hdf15 3 forklift truck service repair manual
PDF
Hyundai hdf18 3 forklift truck service repair manual
PDF
Hyundai hdf18 3 forklift truck service repair manual
PDF
Hyundai hdf15 3 forklift truck service repair manual
PDF
Hyundai hdf18 3 forklift truck service repair manual
PDF
Sample Calculations for solar rooftop project in India
Linear regression an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
Excel Model for Banking
Moyno pump 2000 dimensions g2
cobb500 broiler performance nutrition supplement 2022
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
 
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
recipes
Futurevaluetables
Present Value and Future Value Tables
8 fv&amp;pv tables
WRI Operating Statement Detail
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
Sample Calculations for solar rooftop project in India
Ad

More from Shubham Gupta (6)

PPTX
Marketing great dakota bank case - Harward Business School
PPTX
Understanding Customer Value - Marketing through 4P's and SAVE
PDF
Segmentation, Targeting and Positioning at an Election
PDF
The bose corporation: JIT II case solution
PPTX
Impure data analytics & visualization tool
PPTX
Impure data analytics & visualization tool
Marketing great dakota bank case - Harward Business School
Understanding Customer Value - Marketing through 4P's and SAVE
Segmentation, Targeting and Positioning at an Election
The bose corporation: JIT II case solution
Impure data analytics & visualization tool
Impure data analytics & visualization tool

Recently uploaded (20)

PPTX
5 Stages of group development guide.pptx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
How to Get Business Funding for Small Business Fast
DOCX
Business Management - unit 1 and 2
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
Types of control:Qualitative vs Quantitative
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PPT
Data mining for business intelligence ch04 sharda
PPTX
HR Introduction Slide (1).pptx on hr intro
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
Unit 1 Cost Accounting - Cost sheet
DOCX
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
MSPs in 10 Words - Created by US MSP Network
5 Stages of group development guide.pptx
Probability Distribution, binomial distribution, poisson distribution
Lecture (1)-Introduction.pptx business communication
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
How to Get Business Funding for Small Business Fast
Business Management - unit 1 and 2
Roadmap Map-digital Banking feature MB,IB,AB
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Types of control:Qualitative vs Quantitative
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Reconciliation AND MEMORANDUM RECONCILATION
Power and position in leadershipDOC-20250808-WA0011..pdf
Data mining for business intelligence ch04 sharda
HR Introduction Slide (1).pptx on hr intro
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
Unit 1 Cost Accounting - Cost sheet
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
Laughter Yoga Basic Learning Workshop Manual
MSPs in 10 Words - Created by US MSP Network

DATA MINING WITH WEKA

  • 1. Term paper on Data mining How to use Weka for data analysis Submitted by: Shubham Gupta (10BM60085) Vinod Gupta School of Management
  • 2. The first technique that we would do on weka is classification. The data below shows the financial situation in Japan. The data has been collected from 1970-2009. The columns represent: 1) BROAD: Broad money supplied in the economy 2) DOMC: Domestic consumption 3) PSC: Payment securities 4) CLAIMS: Represents the claims on the government. 5) TOTRES: Total Reserves 6) GDP: Gross domestic product 7) LIQLB: Liquid Liability We want to get a decision tree that would help us decide what values of independent variable may result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions better. Hence to get such rules we perform this analysis to generate a decision tree. YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP 1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000 1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000 1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000 1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000 1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000 1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000 1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000 1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000 1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000 1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000 1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000 1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000 1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000 1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000 1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000 1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000 1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000 1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000 1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000 1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000 1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000 1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000 1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000 1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000 1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000
  • 3. 1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000 1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000 1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000 1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000 1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000 2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000 2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000 2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000 2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000 2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000 2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000 2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000 2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000 2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000 2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000 Loading data in Weka is quite easy. Just click on the open file option and give the location of the file. Figure 1 Shows how to load data in Weka Weka software is used to classify the above data to find out how these economical factors be modified or fixed so as to get an 11% growth in the previous year’s GDP
  • 4. Figure 2 Diagram shows where you could the used tree technique The following shows the output by running the above data in Weka. The Classifier used is to create the required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some enhancements. M5Base. Implements base routines for generating M5 Model trees and rule the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for ‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten (1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown below: === Run information === Scheme: weka.classifiers.trees.M5P -M 4.0 Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1 Instances: 945 Attributes: 6 BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB Test mode: 10-fold cross-validation
  • 5. === Classifier model (full training set) === M5 pruned model tree: (Using smoothed linear models) BROAD <= 153.045 : LM1 (13/5.644%) BROAD > 153.045 : | PSC <= 203.02 : | | BROAD <= 177.275 : LM2 (5/0.653%) | | BROAD > 177.275 : | | | TOTRES <= 871108500000 : LM3 (11/8.309%) | | | TOTRES > 871108500000 : LM4 (4/1.446%) | PSC > 203.02 : LM5 (7/2.741%) LM num: 1 LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168 LM num: 2 LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097 LM num: 3 LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87 LM num: 4 LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563 LM num: 5 LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517 Number of Rules: 5 Time taken to build model: 0.08 seconds
  • 6. === Cross-validation === === Summary === Correlation coefficient 0.9882 Mean absolute error 3.412 Root mean squared error 5.4145 Relative absolute error 11.529 % Root relative squared error 15.1993 % Total Number of Instances 40 Ignored Class Unknown Instances 905 Interpretation of the Results: Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM) based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
  • 7. have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP values as shown in the figure above. Linear Regression with Weka The second technique is to conduct linear regression through Weka on the same data. When the outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique to consider. In the previous technique we created five linear models from the same data; hence M5P’s performance is slightly worse than any linear model. The idea is to express the class as a linear combination of the attributes with predetermined weights. From the previous data, we can also find linear regression equation between various parameters determining GDP. To run the regression, go to classify tab on Weka and choose linear regression from functions as shown. Figure 3 Shows where to find LR in Weka Following output is generated by the above analysis: === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD
  • 8. CLAIMS DOMC PSC TOTRES LIQLB Test mode: 10-fold cross-validation === Classifier model (full training set) === Linear Regression Model LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705 Time taken to build model: 0.2 seconds === Cross-validation === === Summary === Correlation coefficient 0.9738 Mean absolute error 4.8731 Root mean squared error 8.0404 Relative absolute error 16.4661 % Root relative squared error 22.5707 % Total Number of Instances 40 Ignored Class Unknown Instances 905 The above analysis gives as a mathematical relationship (linear) between various variables. The Value of the fifth variable (dependent) can be found out once other independent variable values are known. This equation also tells how these variables are related. A negative relation shows reciprocal relationship and vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The same is shown in the figure below.
  • 9. CLUSTERING IN WEKA Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans implementation clustering data use k-means, or the algorithm can decide using cross-validation- in which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the above data. The result is shown as table with rows that are attributes names and columns that correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of the output shows the result of applying the learned cluster model. In this case, it assigned each training set to one of the clusters, showing the same result as the parenthetical numbers at the top of each column. An alternative is to use a separate test set or a percentage split of training data, in which case figures would be different. This technique could be used with data from other countries in addition of the present data that is taken for Japan.
  • 10. === Run information === Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB Test mode:evaluate on training data
  • 11. === Model and evaluation on training set ===kMeans====== Number of iterations: 5 Within cluster sum of squared errors: 12.988387913678944 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 (945) (929) (16) ================================================================= YEAR 1989.5 1989.2933 2001.5 BROAD 174.1633 173.4625 214.8525 CLAIMS 6.6645 6.8103 -1.7981 DOMC 242.2808 241.2956 299.4794 PSC 168.2627 167.8077 194.685 TOTRES 248907476505.9463 243675387834.3592 552695625000 LIQLB 175.2342 174.7166 205.2875 Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 929 (98%) 1 16 (2%) We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize. We get the following output: