SlideShare a Scribd company logo
Guide for reproducing
results of Bioassay paper
using Weka
Important points to remember before
starting a run:
ď‚·   All datasets should be in ARFF format, otherwise weka will complain for incompatible
    format during training and testing.
ď‚·   Standard classifiers are used for confirmatory screen data as it is smaller and less im-
    balanced, whereas cost-sensitive classifiers are used with primary & mixed datasets as
    they are more imbalanced.
ď‚·   We have two goals-
       1. To find most robust and versatile classifier for imbalanced bioassay data.
       2. To find out optimal misclassification cost setting for a classifier.
ď‚·   The misclassification cost for False Negatives has to be set in order to achieve maxi-
    mum number of True Positives with a False Positive rate less than 20%.
ď‚·   The datasets are randomly split into 80% training and validation set and 20% independ-
    ent test set, so we should have two files for each dataset one for training the classifier
    and one for testing the model built by that classifier.
ď‚·   Use 5 fold cross-validation for larger datasets i.e. primary and mixed screens and use
    10 fold cross–validation for smaller datasets i.e. confirmatory screens.
ď‚·   CostSensitiveClassifier is used for base classifiers NaĂŻve Bayes, SMO (Sequential Minimal
    Optimization) and Random Forest, as it outperforms other meta-learners.
ď‚·   MetaCost with J48 produces bettet results than other meta-learners.
ď‚·   For NaĂŻve Bayes and Random Forest, default options are used.
ď‚·   For SMO, option BuildLogisticModels was set to true.
ď‚·   For J48, option Unpruned was set to true.
ď‚·   For more details please refer the paper.
Step wise guide to set-up a weka run:
1. Start weka explorer.
2. In Preprocess tab go to open file…
3. Open a training file in ARFF format.




                                              Click open




4. For example, AID1608red_train.arff.
5. After opening the file should look like:
6. Now click on classify tab in the menu bar.
7. We will first train a model using NaĂŻve Bayes classifier, as we are using confirmatory
  screen AID1608 we will first apply standard classifiers and if there will be less than 20%
  False Positive rate than cost-sensitive classifiers is used.
8. Click on Choose button to select a classifier. From Bayes folder choose NaĂŻve Bayes.




9. Your window should appear as below with cross-validation selected with 10 folds:
10. Now click on start button, model will start building.
11. Since we have used 10 fold cross-validation so it will build models for 10 folds.




                               Check status here




               Run completed
12. Look at the output section scroll to bottom section as shown:




13. This is the model generated by NaĂŻve Bayes classifier by using training set
    AID1608red_train.
14. Next step is to test this model on the independent test set AID1608red_test.
15. Go to section test options select Supplied test set and click on set.
16. Open the test file AID1608red_test.
17. After reading the file close the Test instances dialog by clicking on close.
18. Now right-click on your model in result list and choose Re-evaluate model on current
test set.




                                      Click here
19. Within fraction of a second results are produced in the same output window.




                            False positive


         True positive



                                             False negative
                         True negative




20. We have obtained a False Positive rate of 14.5% which is less than 20% and a True posi-
tive rate of 15.4% which is very low. Now, we will set cost-sensitive classifier to improve
the results.
21. As mentioned in page 2 of this tutorial for Naïve Bayes we will use Weka’s CostSensi-
tiveClassifier.
22. The author has used incremental costing where cost was increased in stages from 2 to
    1000000, until a 20% False positive rate was reached.
23. So, we will set up a cost matrix by starting with a misclassification cost of 2.
24. Go to choose button, select CostSensitiveClassifier from meta folder.




25. Click on the text box to open the GenericObjectEditor dialog box as shown:




     Click here and this
    dialog box will open
             up
26. In this dialog box, select NaĂŻve Bayes from choose classifier.
27. Next, click on costMatrix to set up misclassification cost.




28. We have 2 classes in our dataset i.e. actives and inactives so we will set up a 2X2
     Matrix. ( For TP, FP, TN, FN).




ď‚·   In classes enter 2.
ď‚·   Click resize to cre-
ate a 2X2 matrix.
ď‚·   Change misclassi-
fication cost for false
negatives to 2.
ď‚·   Then close the
dialog box.




                                                                              Write 2 in place of 1
29. Leave all other options default and now close GenericObjectEditor dialog by clicking OK
30. Click start to begin building cost-sensitive model.
31. Repeat steps 13-19 as described above for testing.




32. See improved results, True Positives has increased within a 20% limit for False
    Positives.
33. We stop here as we have achieved our goal.
34. Similarly, you can build models using SMO, Random Forest and J48. Check their
    settings as mentioned on page 2 of this tutorial before starting the run.

More Related Content

DOCX
Itb weka
PDF
multiple linear regression in spss (procedure and output)
PDF
chi square test of independence or test of association (procedre ad output)
PDF
linear regression analysis in spss (procedure and output)
PDF
Two way anova in spss (procedure and output)
PDF
chi square goodness of fit test (equal ratio) (procedure and output)
PPTX
Empowerment Technology Lesson 4
PDF
chi square goodness of fit test (expected ratio) (procedure and output)
Itb weka
multiple linear regression in spss (procedure and output)
chi square test of independence or test of association (procedre ad output)
linear regression analysis in spss (procedure and output)
Two way anova in spss (procedure and output)
chi square goodness of fit test (equal ratio) (procedure and output)
Empowerment Technology Lesson 4
chi square goodness of fit test (expected ratio) (procedure and output)

What's hot (10)

PPTX
Slides for a workshop to build the pharma competition Living Business Model
PDF
One sample t test (procedure and output in SPSS)
PDF
Paired sample t test (procedure and output)
PDF
One way anova in spss (procedure and output)
PDF
Independent sample t test in spss (procedure and output)
DOCX
Basic abap oo
PPTX
XL-MINER:Partition
PPTX
GIMP BASICS by Aedam Ampongan
PPTX
XL-MINER: Data Utilities
PPTX
Multiply-and-divide-in-excel
Slides for a workshop to build the pharma competition Living Business Model
One sample t test (procedure and output in SPSS)
Paired sample t test (procedure and output)
One way anova in spss (procedure and output)
Independent sample t test in spss (procedure and output)
Basic abap oo
XL-MINER:Partition
GIMP BASICS by Aedam Ampongan
XL-MINER: Data Utilities
Multiply-and-divide-in-excel
Ad

Viewers also liked (9)

PDF
Consumer Credit Scoring Using Logistic Regression and Random Forest
PPTX
Test
 
ODP
SPIPNOZ 2013 : le plugin evaluations
PDF
Parameter Optimisation for Automated Feature Point Detection
PDF
Conistency of random forests
PDF
Accelerating Random Forests in Scikit-Learn
PDF
CVPR2015 reading "Global refinement of random forest"
PPTX
Random forest
PPTX
Random forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
Test
 
SPIPNOZ 2013 : le plugin evaluations
Parameter Optimisation for Automated Feature Point Detection
Conistency of random forests
Accelerating Random Forests in Scikit-Learn
CVPR2015 reading "Global refinement of random forest"
Random forest
Random forest
Ad

Similar to Weka guide (20)

DOCX
AI Builder - Text Classification
PDF
OLT open script
PDF
Normal Modal Analysis in Hypermesh
PPTX
Skill enhancement course SOLVER ADD IN Data analysis
PDF
Lab report watson
DOC
Lab 10.doc
 
DOC
Lab 10.doc
 
DOCX
Bank of pecunia mortgage risk model
PDF
Easy Pivot Tutorial June 2020
PPTX
Advance Excel Session__ Scenario Manager.pptx
PDF
Tutorials.pdf
PDF
CedCommerce Walmart Marketplace Repricer Extension for Magento Store
DOCX
Weka Term Paper_VGSoM_10BM60011
DOCX
AI Builder - Binary Classification
DOCX
Scoring documentation
PPTX
Advanced Computer Programming..pptx
PDF
Predictive Modeling with Enterprise Miner
PDF
Predictive Modeling with Enterprise Miner
PDF
Weka term paper(siddharth 10 bm60086)
PDF
CIS 1403 lab 4 selection
AI Builder - Text Classification
OLT open script
Normal Modal Analysis in Hypermesh
Skill enhancement course SOLVER ADD IN Data analysis
Lab report watson
Lab 10.doc
 
Lab 10.doc
 
Bank of pecunia mortgage risk model
Easy Pivot Tutorial June 2020
Advance Excel Session__ Scenario Manager.pptx
Tutorials.pdf
CedCommerce Walmart Marketplace Repricer Extension for Magento Store
Weka Term Paper_VGSoM_10BM60011
AI Builder - Binary Classification
Scoring documentation
Advanced Computer Programming..pptx
Predictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise Miner
Weka term paper(siddharth 10 bm60086)
CIS 1403 lab 4 selection

More from Abhik Seal (20)

PDF
Chemical data
PPTX
Clinicaldataanalysis in r
PDF
Virtual Screening in Drug Discovery
PDF
Data manipulation on r
PDF
Data handling in r
PPTX
Networks
PDF
Modeling Chemical Datasets
PPTX
Introduction to Adverse Drug Reactions
PPTX
Mapping protein to function
PPTX
Sequencedatabases
PPTX
Chemical File Formats for storing chemical data
PPTX
Understanding Smiles
PDF
Learning chemistry with google
PPTX
3 d virtual screening of pknb inhibitors using data
PPTX
Poster
DOCX
R scatter plots
PDF
Indo us 2012
PDF
Q plot tutorial
PPTX
Pharmacohoreppt
PDF
Document1
Chemical data
Clinicaldataanalysis in r
Virtual Screening in Drug Discovery
Data manipulation on r
Data handling in r
Networks
Modeling Chemical Datasets
Introduction to Adverse Drug Reactions
Mapping protein to function
Sequencedatabases
Chemical File Formats for storing chemical data
Understanding Smiles
Learning chemistry with google
3 d virtual screening of pknb inhibitors using data
Poster
R scatter plots
Indo us 2012
Q plot tutorial
Pharmacohoreppt
Document1

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
 
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PPT
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
 
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
Teaching material agriculture food technology

Weka guide

  • 1. Guide for reproducing results of Bioassay paper using Weka
  • 2. Important points to remember before starting a run: ď‚· All datasets should be in ARFF format, otherwise weka will complain for incompatible format during training and testing. ď‚· Standard classifiers are used for confirmatory screen data as it is smaller and less im- balanced, whereas cost-sensitive classifiers are used with primary & mixed datasets as they are more imbalanced. ď‚· We have two goals- 1. To find most robust and versatile classifier for imbalanced bioassay data. 2. To find out optimal misclassification cost setting for a classifier. ď‚· The misclassification cost for False Negatives has to be set in order to achieve maxi- mum number of True Positives with a False Positive rate less than 20%. ď‚· The datasets are randomly split into 80% training and validation set and 20% independ- ent test set, so we should have two files for each dataset one for training the classifier and one for testing the model built by that classifier. ď‚· Use 5 fold cross-validation for larger datasets i.e. primary and mixed screens and use 10 fold cross–validation for smaller datasets i.e. confirmatory screens. ď‚· CostSensitiveClassifier is used for base classifiers NaĂŻve Bayes, SMO (Sequential Minimal Optimization) and Random Forest, as it outperforms other meta-learners. ď‚· MetaCost with J48 produces bettet results than other meta-learners. ď‚· For NaĂŻve Bayes and Random Forest, default options are used. ď‚· For SMO, option BuildLogisticModels was set to true. ď‚· For J48, option Unpruned was set to true. ď‚· For more details please refer the paper.
  • 3. Step wise guide to set-up a weka run: 1. Start weka explorer. 2. In Preprocess tab go to open file… 3. Open a training file in ARFF format. Click open 4. For example, AID1608red_train.arff. 5. After opening the file should look like:
  • 4. 6. Now click on classify tab in the menu bar. 7. We will first train a model using NaĂŻve Bayes classifier, as we are using confirmatory screen AID1608 we will first apply standard classifiers and if there will be less than 20% False Positive rate than cost-sensitive classifiers is used. 8. Click on Choose button to select a classifier. From Bayes folder choose NaĂŻve Bayes. 9. Your window should appear as below with cross-validation selected with 10 folds:
  • 5. 10. Now click on start button, model will start building. 11. Since we have used 10 fold cross-validation so it will build models for 10 folds. Check status here Run completed
  • 6. 12. Look at the output section scroll to bottom section as shown: 13. This is the model generated by NaĂŻve Bayes classifier by using training set AID1608red_train. 14. Next step is to test this model on the independent test set AID1608red_test. 15. Go to section test options select Supplied test set and click on set. 16. Open the test file AID1608red_test.
  • 7. 17. After reading the file close the Test instances dialog by clicking on close. 18. Now right-click on your model in result list and choose Re-evaluate model on current test set. Click here
  • 8. 19. Within fraction of a second results are produced in the same output window. False positive True positive False negative True negative 20. We have obtained a False Positive rate of 14.5% which is less than 20% and a True posi- tive rate of 15.4% which is very low. Now, we will set cost-sensitive classifier to improve the results. 21. As mentioned in page 2 of this tutorial for NaĂŻve Bayes we will use Weka’s CostSensi- tiveClassifier. 22. The author has used incremental costing where cost was increased in stages from 2 to 1000000, until a 20% False positive rate was reached. 23. So, we will set up a cost matrix by starting with a misclassification cost of 2.
  • 9. 24. Go to choose button, select CostSensitiveClassifier from meta folder. 25. Click on the text box to open the GenericObjectEditor dialog box as shown: Click here and this dialog box will open up
  • 10. 26. In this dialog box, select NaĂŻve Bayes from choose classifier. 27. Next, click on costMatrix to set up misclassification cost. 28. We have 2 classes in our dataset i.e. actives and inactives so we will set up a 2X2 Matrix. ( For TP, FP, TN, FN). ď‚· In classes enter 2. ď‚· Click resize to cre- ate a 2X2 matrix. ď‚· Change misclassi- fication cost for false negatives to 2. ď‚· Then close the dialog box. Write 2 in place of 1
  • 11. 29. Leave all other options default and now close GenericObjectEditor dialog by clicking OK 30. Click start to begin building cost-sensitive model. 31. Repeat steps 13-19 as described above for testing. 32. See improved results, True Positives has increased within a 20% limit for False Positives. 33. We stop here as we have achieved our goal. 34. Similarly, you can build models using SMO, Random Forest and J48. Check their settings as mentioned on page 2 of this tutorial before starting the run.