SlideShare a Scribd company logo
Weka tutorial
Speaker:楊明翰
What is Weka?
A collection of machine learning algorithms for data
mining tasks
Weka contains tools for
• data pre-processing,
• classification, regression,
• clustering,
• association rules, and
• visualization.
Suggestion: Version 3.5.8
What can it help in your hw1?
• Visualization
• Data analysis
• Easy to try different classifiers
But………..
If you want to get better performance, you still
have to implement many things ,such as cross
validation, parameters selection , and clustering .
P.S. You are free to use anything to complete the
homework.
Explorer
Classifier
Black : build in
Red: supported but need to
download by user
Installation guide for libsvm :
http://www.cs.iastate.edu/~
yasser/wlsvm/
Use Weka in your Java code
The most common components you might want to
use, are
– Instances - your data
– Filter - for pre-processing the data
– Classifier/Clusterer - is built on the processed
data
– Evaluating - how good is the classifier/clusterer?
– Attribute selection - removing irrelevant
attributes from your data
Arff format
@relation KDDCUP
@attribute Ground-Truth {-1.0,1.0}
@attribute Image-Finding-ID numeric
@attribute Study-Finding-ID numeric
@attribute Image-ID numeric
@attribute Study-ID numeric
@attribute LeftBreast {0.0,1.0}
@attribute MLO {0.0,1.0}
@attribute X-location numeric
@attribute Y-location numeric
@attribute X-nipple-location numeric
@attribute Y-nipple-location numeric
@attribute att1 numeric
@attribute att2 numeric
…
@attribute att117 numeric
@attribute serialNumber numeric
@data
-1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E-
1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E-
2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E-
1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E-
1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E-
1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,-
3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E-
1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E-
1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,-
6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E-
1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E-
1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
Instances
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
...
Instances data = new Instances( new BufferedReader( new
FileReader("/some/where/data.arff")));
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
// The class index indicate the target attribute used for
classification.
filters
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
...
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "1"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset AFTER
setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
classifier
import weka.classifiers.functions.LibSVM;
...
String[] options = String[] options =
weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5
-M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B");
LibSVM classifier = new LibSVM(); // new instance of tree
classifier.setOptions(options); // set the options
classifier.buildClassifier(data); // build classifier
Classifying instances
Instances unlabeled=…//load from somewhere
…
for (int i = 0; i < unlabeled.numInstances(); i++) {
Instance ins=unlabeled.instance(i);
clsLabel = classifier.classifyInstance(ins); //get predict label
double[] prob_array=classifier.distributionForInstance(ins);
//get probability for each category
}
Example:weka+libsvm+5 folds CV
public static void main(String[] args) throws Exception {
PrintWriter pw_score=new PrintWriter( new FileOutputStream ("c:tempscore.txt"));
PrintWriter pw_label=new PrintWriter(new FileOutputStream ("c:templabel.txt"));
PrintWriter pw_pid=new PrintWriter(new FileOutputStream ("c:temppid.txt"));
Instances data = new Instances(
new BufferedReader(
new FileReader("C:tempTrainSet_sn.arff")));
Remove remove = new Remove(); // new instance of filter
remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options
remove.setInputFormat(data); // inform filter about dataset AFTER setting options
Int seed = 2; // the seed for randomizing the data
int folds = 5; // the number of folds to generate, >=2
data.setClassIndex(0); // first attribute is groundtruth
Instances randData;
Random rand = new Random(seed); // create seeded number generator
randData = new Instances(data); // create copy of original data
randData.randomize(rand); // randomize data with number generator
for(int n=0;n<folds;n++){
Instances train = randData.trainCV(folds, n);
Instances test = randData.testCV(folds, n);
System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances());
String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C
1.0 -E 0.0010 -P 0.1 -B");
LibSVM classifier=new LibSVM();
classifier.setOptions(options);
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(remove);
fc.setClassifier(classifier);
fc.buildClassifier(train);
for(int i=0;i<test.numInstances();i++)
{
double[] tmp=(double[])fc.distributionForInstance(test.instance(i));
//tmp[0] :prob of negtive
//tmp[1] :prob of positive
pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground
truth
pw_score.println(tmp[1]); //predict value
pw_pid.println((int)test.instance(i).value(4)); //study-ID
}}
FROC
Algorithm:
1. Load “predicted score”, “ground truth”, and “patient id”.
2. Initialize :
“Detected_patients = [ ]
Sorting rows
( priority “predicted score” > “ground truth” > “patient id” in descending order).
3. For each row,
If ground truth is negative, x+=1
Else // get a positive point
If patient is not in “Detected_patients, //get a new positive patient
y+=1 and add patient_id to Detected_patients
else //patient is found before
do nothing
4. Normalize
x => 0~ average false alarm per image i.e. X is divided by total image numbers
y => 0~1 i.e. Y is divided by patients numbers
5. Calculate the area under the curve
FROC tools-JAVA
java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt
score.txt : predict label for each point . i.e. probability for being
positive
label.txt : ground truth for each point
pid.txt : patient ID for each point
FROC tools-Matlab
• Matlab matlab function
– [Pd_patient_wise,FA_per_image,AUC] =
get_ROC_KDD(p,Y,PID,fa_low,fa_high)
• Pd_patient_wise
– The y location of each point on the curve.
• FA_per_image
– The x location of each point on the curve.
• AUC
• p – Predicted label
• Y – Ground truth
• PID – Patient ID
– Plot(FA_per_image,Pd_patient_wise);
FROC curve example
The result of above example:
• AUC = 0.0782
Measurements by Points:
• TP = 237
• FN = 386
• FP = 108
• TN = 101563
• precision = 0.6870
• recall = 0.3804
• FScore = 0.4897
Reference:
Use weka in your java code
Generating cross-validation folds
Download:
Example code
Java roc code
matlab roc code

More Related Content

PDF
Important java programs(collection+file)
PDF
13 advanced-swing
PDF
OOPs & Inheritance Notes
PDF
Pavel kravchenko obj c runtime
PPT
Using xUnit as a Swiss-Aarmy Testing Toolkit
PPTX
Java Generics
PPT
Java class
Important java programs(collection+file)
13 advanced-swing
OOPs & Inheritance Notes
Pavel kravchenko obj c runtime
Using xUnit as a Swiss-Aarmy Testing Toolkit
Java Generics
Java class

What's hot (20)

PDF
Java OOP Programming language (Part 3) - Class and Object
PDF
Java Programming - 06 java file io
ZIP
Elementary Sort
PPTX
PDF
PDF
Magic methods
PPTX
Lecture 7 arrays
PPTX
Unit3 part1-class
PPSX
Java session4
PDF
iOS Development Methodology
DOCX
Decision tree handson
PPTX
PHP 5 Magic Methods
PDF
3 class definition
PPTX
Chap2 class,objects contd
PPTX
An Overview of the Java Programming Language
PPT
Jdbc oracle
PDF
Java 8 - An Introduction by Jason Swartz
PPTX
Spring data jpa
PPTX
Lecture02 class -_templatev2
PPTX
.NET Database Toolkit
Java OOP Programming language (Part 3) - Class and Object
Java Programming - 06 java file io
Elementary Sort
Magic methods
Lecture 7 arrays
Unit3 part1-class
Java session4
iOS Development Methodology
Decision tree handson
PHP 5 Magic Methods
3 class definition
Chap2 class,objects contd
An Overview of the Java Programming Language
Jdbc oracle
Java 8 - An Introduction by Jason Swartz
Spring data jpa
Lecture02 class -_templatev2
.NET Database Toolkit
Ad

Viewers also liked (8)

PDF
ITB tutorial WEKA Prabhat Agarwal
PDF
Wekatutorial
PDF
Weka
PPTX
Text classification with Weka
PPT
Text categorization
PPT
Weka presentation
PPT
WEKA Tutorial
ITB tutorial WEKA Prabhat Agarwal
Wekatutorial
Weka
Text classification with Weka
Text categorization
Weka presentation
WEKA Tutorial
Ad

Similar to saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ... (20)

PDF
TAO Fayan_ Introduction to WEKA
PPTX
Incremental Learning using WEKA
DOC
New Microsoft Word Document.doc
PPTX
264finalppt (1)
PDF
Data mining with Weka
PDF
Machine Learning with WEKA
PPTX
Analytics machine learning in weka
PPTX
Weka tutorial
PPT
data mining with weka application
PDF
wekapresentation-130107115704-phpapp02.pdf
PPT
1.5 weka an intoduction
PPT
Shraddha weka
PPT
Shraddha weka
PPTX
Weka_new_forthedataming_practicalss.pptx
PDF
PPT
Weka : A machine learning algorithms for data mining
PDF
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
PPT
[ppt]
TAO Fayan_ Introduction to WEKA
Incremental Learning using WEKA
New Microsoft Word Document.doc
264finalppt (1)
Data mining with Weka
Machine Learning with WEKA
Analytics machine learning in weka
Weka tutorial
data mining with weka application
wekapresentation-130107115704-phpapp02.pdf
1.5 weka an intoduction
Shraddha weka
Shraddha weka
Weka_new_forthedataming_practicalss.pptx
Weka : A machine learning algorithms for data mining
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
[ppt]

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

saihw1_weka_tutorial.pptx - Machine Discovery and Social Network ...

  • 2. What is Weka? A collection of machine learning algorithms for data mining tasks Weka contains tools for • data pre-processing, • classification, regression, • clustering, • association rules, and • visualization. Suggestion: Version 3.5.8
  • 3. What can it help in your hw1? • Visualization • Data analysis • Easy to try different classifiers But……….. If you want to get better performance, you still have to implement many things ,such as cross validation, parameters selection , and clustering . P.S. You are free to use anything to complete the homework.
  • 5. Classifier Black : build in Red: supported but need to download by user Installation guide for libsvm : http://www.cs.iastate.edu/~ yasser/wlsvm/
  • 6. Use Weka in your Java code The most common components you might want to use, are – Instances - your data – Filter - for pre-processing the data – Classifier/Clusterer - is built on the processed data – Evaluating - how good is the classifier/clusterer? – Attribute selection - removing irrelevant attributes from your data
  • 7. Arff format @relation KDDCUP @attribute Ground-Truth {-1.0,1.0} @attribute Image-Finding-ID numeric @attribute Study-Finding-ID numeric @attribute Image-ID numeric @attribute Study-ID numeric @attribute LeftBreast {0.0,1.0} @attribute MLO {0.0,1.0} @attribute X-location numeric @attribute Y-location numeric @attribute X-nipple-location numeric @attribute Y-nipple-location numeric @attribute att1 numeric @attribute att2 numeric … @attribute att117 numeric @attribute serialNumber numeric @data -1.0,0.0,0.0,0,150,0.0,0.0,1732.0,2380.0,1356.0,2106.0,-1.196111E-1,4.764423E-2,2.27225E-1,2.511147E-1,-6.94537E-2,-7.478557E-2,5.444844E- 1,8.050464E-1,4.708327E-2,1.310514E0,-1.871811E-1,-4.098435E-1,-2.669971E-1,2.50289E-1,-2.438625E-1,8.022098E-2,8.098504E-1,9.880441E- 2,3.374689E-4,-6.384426E-1,1.108627E0,1.043443E0,-1.612419E0,-5.633943E-1,-4.357306E-1,-4.572176E-1,8.236916E-2,5.218327E-1,1.922271E- 1,4.565068E-1,-8.969028E-1,-4.403602E-1,1.41807E-1,-2.252249E-1,2.34936E-1,6.527024E-1,-5.750284E-1,-5.676962E-1,-5.344064E-1,-1.513411E- 1,7.280352E-1,7.21983E-1,6.978422E-1,5.667439E-1,3.273161E-3,-6.958107E-2,7.912039E- 1,1.659563E0,1.192391E0,1.173782E0,1.145927E0,1.645195E0,-5.52926E-1,-1.424765E-1,-1.416166E-1,-1.396449E-1,-1.374919E-1,-5.500465E-1,- 3.0028E-2,2.788235E-1,1.178261E0,2.937468E-1,3.483202E-1,3.941773E-1,4.250069E-1,3.226059E-1,2.569432E-1,5.522287E- 1,1.811639E0,1.844379E0,1.188755E0,1.86738E0,-1.05269E0,1.434895E-2,5.235738E-3,-4.779273E-3,-9.884836E-2,-9.526174E-1,-3.106309E- 1,1.434759E0,1.486669E0,3.402836E-1,5.323643E-1,-3.38767E-1,-3.644332E-1,7.650664E-3,3.811143E-2,5.595391E-2,-3.589534E-1,-6.765502E-1,- 6.669187E-1,-6.591878E-1,-2.893004E-1,1.048242E0,-7.317548E-1,-1.985699E-1,4.513422E-1,1.06145E0,4.777854E- 1,1.267896E0,1.350758E0,1.337705E0,1.385917E0,1.091785E0,1.289325E0,5.511991E-1,-8.125907E-1,1.050196E0,-4.338815E-1,-4.664211E- 1,6.203229E-1,-6.020947E-1,5.299978E-1,2.989034E-1,-7.676021E-2,1.5216E-1,-3.001498E-1,0
  • 8. Instances import weka.core.Instances; import java.io.BufferedReader; import java.io.FileReader; ... Instances data = new Instances( new BufferedReader( new FileReader("/some/where/data.arff"))); // setting class attribute data.setClassIndex(data.numAttributes() - 1); // The class index indicate the target attribute used for classification.
  • 9. filters import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; ... String[] options = new String[2]; options[0] = "-R"; // "range" options[1] = "1"; // first attribute Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Instances newData = Filter.useFilter(data, remove); // apply filter
  • 10. classifier import weka.classifiers.functions.LibSVM; ... String[] options = String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier = new LibSVM(); // new instance of tree classifier.setOptions(options); // set the options classifier.buildClassifier(data); // build classifier
  • 11. Classifying instances Instances unlabeled=…//load from somewhere … for (int i = 0; i < unlabeled.numInstances(); i++) { Instance ins=unlabeled.instance(i); clsLabel = classifier.classifyInstance(ins); //get predict label double[] prob_array=classifier.distributionForInstance(ins); //get probability for each category }
  • 12. Example:weka+libsvm+5 folds CV public static void main(String[] args) throws Exception { PrintWriter pw_score=new PrintWriter( new FileOutputStream ("c:tempscore.txt")); PrintWriter pw_label=new PrintWriter(new FileOutputStream ("c:templabel.txt")); PrintWriter pw_pid=new PrintWriter(new FileOutputStream ("c:temppid.txt")); Instances data = new Instances( new BufferedReader( new FileReader("C:tempTrainSet_sn.arff"))); Remove remove = new Remove(); // new instance of filter remove.setOptions(weka.core.Utils.splitOptions("-R 2-11,129"));// set options remove.setInputFormat(data); // inform filter about dataset AFTER setting options Int seed = 2; // the seed for randomizing the data int folds = 5; // the number of folds to generate, >=2 data.setClassIndex(0); // first attribute is groundtruth Instances randData; Random rand = new Random(seed); // create seeded number generator randData = new Instances(data); // create copy of original data randData.randomize(rand); // randomize data with number generator
  • 13. for(int n=0;n<folds;n++){ Instances train = randData.trainCV(folds, n); Instances test = randData.testCV(folds, n); System.out.println("Fold "+n+"train "+train.numInstances()+"test "+test.numInstances()); String[] options = weka.core.Utils.splitOptions("-S 0 -K 2 -D 3 -G 0.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 -B"); LibSVM classifier=new LibSVM(); classifier.setOptions(options); FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(remove); fc.setClassifier(classifier); fc.buildClassifier(train); for(int i=0;i<test.numInstances();i++) { double[] tmp=(double[])fc.distributionForInstance(test.instance(i)); //tmp[0] :prob of negtive //tmp[1] :prob of positive pw_label.println(test.instance(i).attribute(0).value((int)test.instance(i).value(0))); //ground truth pw_score.println(tmp[1]); //predict value pw_pid.println((int)test.instance(i).value(4)); //study-ID }}
  • 14. FROC Algorithm: 1. Load “predicted score”, “ground truth”, and “patient id”. 2. Initialize : “Detected_patients = [ ] Sorting rows ( priority “predicted score” > “ground truth” > “patient id” in descending order). 3. For each row, If ground truth is negative, x+=1 Else // get a positive point If patient is not in “Detected_patients, //get a new positive patient y+=1 and add patient_id to Detected_patients else //patient is found before do nothing 4. Normalize x => 0~ average false alarm per image i.e. X is divided by total image numbers y => 0~1 i.e. Y is divided by patients numbers 5. Calculate the area under the curve
  • 15. FROC tools-JAVA java -cp bin mslab.kddcup2008.roc.ROC score.txt label.txt pid.txt score.txt : predict label for each point . i.e. probability for being positive label.txt : ground truth for each point pid.txt : patient ID for each point
  • 16. FROC tools-Matlab • Matlab matlab function – [Pd_patient_wise,FA_per_image,AUC] = get_ROC_KDD(p,Y,PID,fa_low,fa_high) • Pd_patient_wise – The y location of each point on the curve. • FA_per_image – The x location of each point on the curve. • AUC • p – Predicted label • Y – Ground truth • PID – Patient ID – Plot(FA_per_image,Pd_patient_wise);
  • 18. The result of above example: • AUC = 0.0782 Measurements by Points: • TP = 237 • FN = 386 • FP = 108 • TN = 101563 • precision = 0.6870 • recall = 0.3804 • FScore = 0.4897
  • 19. Reference: Use weka in your java code Generating cross-validation folds Download: Example code Java roc code matlab roc code