SlideShare a Scribd company logo
Modeling using WEKA
Index
• WEKA Introduction
• WEKA file formats
• Loading data
• Univariate analysis
• Data Manipulation
• Feature Selection
• Creating Training, Validation and Test Sets
• Model Execution - Logistic Regression
• Model Analysis - ROC Curve
• Model Analysis – Cost/Benefit Analysis
• Re-apply model on new data
• Weka Plus and Negatives
Introduction
• Weka is a collection of machine learning algorithms for data mining tasks
• The algorithms can either be applied directly to a dataset or called from your own Java code.
• ARFF – Attribute relation file format
Dataset File formats
Load Data Set
Univariate Analysis
Univariate Analysis
• Current Relation – Dataset name, number of records, number of attributes in dataset.
• Attribute details- All attributes to select for univariate analysis
Univariate Analysis
• Selected attribute –
• Provides information about attribute type, Missing values, Distinct values, etc.
Univariate Analysis
• Selected attribute – histogram
• Dispersion of attribute.
Univariate Analysis
• All attribute visualization/plots
Data Manipulation
• Changing data type of field
• Missing values update
• Creating BINS from data
• Standardize data
• Outlier Treatment
• Creating new calculated fields
1. Convert NA to 0
• Flow > Preprocess > Edit > Right Click on Attribute > Replace Values
1. Convert NA to 0
2. Changing data types of attribute
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute >
2. Changing data types
3. Creating BINS
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Discretize
3. Creating BINS
• Provide attribute number, Number of BINs to be created and click on ‘Apply’
3. Creating BINS
• Click on attribute to see the bins and distribution
•
3. Creating Custom BINS
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression
• ifelse(a2 > 0, ifelse(a2 > 10,ifelse(a2 > 20,4, 3), 2), 1)
4. Standardize data
• To convert all numeric attributes in data to zero mean and unit variance.
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Standardize
•
4. Standardize data
4. Standardize data- Log values
• To convert specific numeric attributes to log.
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Numeric Transform
•
4. Standardize data- Log values
• Provide value for attribute number which is to be converted to log value.
• Also provide method name – log. Here we can provide any other methods such as abs,round,floor
•
5. Identify Outliers
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Interquartile Range
• Outliers can be identified for separate attribute or for all together
5. Identify Outliers
5. Remove Outliers
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Instance > RemoveWithValues
5. Remove Outliers
• Params : attributeIndices - Attribute number, NominalIndices=Nominal value of outlier in Outlier
attribute
5. Transform Outliers
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression – this option will
create new field e.g : ifelse(a2 > 1000,200, 1)
•
6. New Calculated fields
• This is helpful In case any new field is to be derived from existing fields
• Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Add Expression
6. New Calculated fields
• This is helpful In case any new field is to be derived from existing fields
• Provide expression/equation and new field name
Feature Selection/Attribute selection
1. Info Gain
2. Correlation
Feature Selection – Info Gain
• Flow > Select Attribute > Attribute evaluator > Choose >
Feature Selection – Correlation
• Flow > Select Attribute > Attribute evaluator > Choose >
Features for Model
• Features selected for model
Creating Training, Validation and Test Sets
Creating data sets
1. Dividing data into 60-20-20 % (Train-Test-Evaluate)
2. Weka inbuilt methods
Creating data sets
• For 60%-20%-20%
• Step 1-
• Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample
Creating data sets
• Step 2- Parameters for resample
• Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample >
• Check noReplacement = True, sample size percent – 60 > ok > Apply
Creating data sets
• Step 3 – Check
• After apply we can check the current relation for number of records selected
•
• Step 4 – Save the result as filename_train.arff
• Step 5 – Click on ‘Undo’ to get to original data set
• Step 6 – Change the Resample parameters again
• Parameters - > Invert selection = True, noReplacement = True, sampleSizePercent = 60
Creating data sets
• Step 7 – Apply and check results as below
Creating data sets
• Step 8 – Don’t save the results
• Step 9 – Open Resample parameters set below parameters
• Invert selection = False, noReplacement = True, sampleSizePercent = 50
• OK > Apply Check the results
Creating data sets
• Step 10 – Check the results
• Step 11 – Save the results as Test Data.
• Step 13 – Click on Undo to get earlier 40% of dataset
• Step 13 – Parameters invertSelection = True, noReplacement=True , sampleSizePercent = 50
• Step 14 – Ok > Apply
• Step 15 – Save as Evaluation Data
Creating data sets
• .
Creating data sets
2. Weka inbuilt – Flow > Classify > Test Options >
Use Training Set – With this option selected data set will be used as training set to create model
Creating data sets
2. Weka inbuilt – Flow > Classify > Test Options >
Use Supplied Test Set – With this option selected data set will be used as test set to create model
Creating data sets
2. Weka inbuilt – Flow > Classify > Test Options >
Use Supplied Test Set – With this option selected data set will be used as test set to test model
Creating data sets
2. Weka inbuilt – Flow > Classify > Test Options >
Cross Validation – With this option selected data set will be divided into 10 folds create model
internally and weka will take average of all these models to show final model on UI
Creating data sets
2. Weka inbuilt – Flow > Classify > Test Options >
Percentage Split – With this option selected data set will be divided into Training and Test set for
model creation
Logistic Regression
Logistic Regression
• Flow > Classify > Functions > Logistic
Logistic Regression
• Parameter selection
Logistic Regression
• Model Results:
Model Analysis
• Flow > Right Click on Model > Visualize threshold curve > ROC Curve
Model Analysis- ROC Curve
Model Analysis- Cost Benefit Analysis
• Flow > Classify > Right click model > Cost/Benefit Analysis
Model Analysis- Cost/Benefit Analysis
• Flow > Classify > Right click model > Cost/Benefit Analysis > Threshold Bar
• Sliding the bar under Threshold label will change the accuracy and threshold curve
Save prediction output to file
• Flow > Classify > Test Options > More Options > Output Predictions > Text Bar to provide file name
• Parameters: Choose: to provide file type, Attributes : First-last to get all fields, outPutFile : File
name to save data
Re-apply model on new data
WEKA Pluses:
• Platform independent and portable, java library can be invoked from any program in any language
• User friendly GUI, with built in visualization, Simpler to use than R, large collection of different data
mining algorithms
• Better results for classification and cluster modeling
• Ease of designing solutions.
• Provides 3 ways to use the software: the GUI, a Java API, and a command line interface (CLI)
• Can work with Spark, BigData using other packages on Experimenter or in batch mode.
WEKA Limitation:
• Visualizations can be managed better in R with different packages like ggplot
• Not really flexible for data manipulation
• Accepts only limited file format lile CSV,ARFF
• Limited documentation available on Explorer.
THANK YOU !
Decision Tree
Decision Tree: Algorithm selection
Decision Tree: Setting params for algo
Decision Tree: Execution
Tree Visualization

More Related Content

PPTX
XGBOOST [Autosaved]12.pptx
PPTX
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
PPT
programming with python ppt
PDF
Python Intro
PPT
Create and analyse programs
PPTX
PART 1 - Python Tutorial | Variables and Data Types in Python
PDF
PDF
Python Spyder IDE | Edureka
XGBOOST [Autosaved]12.pptx
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
programming with python ppt
Python Intro
Create and analyse programs
PART 1 - Python Tutorial | Variables and Data Types in Python
Python Spyder IDE | Edureka

What's hot (13)

PDF
Gnuplotあれこれ
PDF
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
PDF
Python Variable Types, List, Tuple, Dictionary
PPTX
Python Exception Handling
PDF
NumPyが物足りない人へのCython入門
PPTX
Beginner's Guide to Diffusion Models..pptx
PPTX
structured programming
PPTX
Em Algorithm | Statistics
PDF
Python Debugging Fundamentals
PDF
Lecture10 - Naïve Bayes
PPSX
Programming with Python
PDF
Text Classification.pdf
PDF
PFIセミナー 2013/02/28 「プログラミング言語の今」
Gnuplotあれこれ
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
Python Variable Types, List, Tuple, Dictionary
Python Exception Handling
NumPyが物足りない人へのCython入門
Beginner's Guide to Diffusion Models..pptx
structured programming
Em Algorithm | Statistics
Python Debugging Fundamentals
Lecture10 - Naïve Bayes
Programming with Python
Text Classification.pdf
PFIセミナー 2013/02/28 「プログラミング言語の今」
Ad

Similar to Analytics machine learning in weka (20)

PPTX
Introduction to SoapUI day 2
PPTX
Database Testing
PPT
Test Automation Framework Designs
PPTX
Principles and patterns for test driven development
PPTX
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
PPTX
Dynamic Publishing with Arbortext Data Merge
PPTX
Rapid Miner
PPTX
Soap UI - Lesson2
PDF
Taming the shrew Power BI
PDF
IBM SPSS Statistics Subscription (월 구독) 제품 구성
PPTX
DATA WAREHOUSE -- ETL testing Plan
PDF
Test Automation for Data Warehouses
PPTX
Informatica overview
PPTX
Informatica overview
PPTX
data reduction techniques-data minig.pptx
PDF
QUALITY CENTER SYLLABUS
PDF
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
PDF
Testing Tools Online Training.pdf
PDF
ASP.NET MVC 2.0
PPTX
Running with Elephants: Predictive Analytics with HDInsight
Introduction to SoapUI day 2
Database Testing
Test Automation Framework Designs
Principles and patterns for test driven development
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Dynamic Publishing with Arbortext Data Merge
Rapid Miner
Soap UI - Lesson2
Taming the shrew Power BI
IBM SPSS Statistics Subscription (월 구독) 제품 구성
DATA WAREHOUSE -- ETL testing Plan
Test Automation for Data Warehouses
Informatica overview
Informatica overview
data reduction techniques-data minig.pptx
QUALITY CENTER SYLLABUS
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
Testing Tools Online Training.pdf
ASP.NET MVC 2.0
Running with Elephants: Predictive Analytics with HDInsight
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
.pdf is not working space design for the following data for the following dat...
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
Fluorescence-microscope_Botany_detailed content

Analytics machine learning in weka

  • 2. Index • WEKA Introduction • WEKA file formats • Loading data • Univariate analysis • Data Manipulation • Feature Selection • Creating Training, Validation and Test Sets • Model Execution - Logistic Regression • Model Analysis - ROC Curve • Model Analysis – Cost/Benefit Analysis • Re-apply model on new data • Weka Plus and Negatives
  • 3. Introduction • Weka is a collection of machine learning algorithms for data mining tasks • The algorithms can either be applied directly to a dataset or called from your own Java code. • ARFF – Attribute relation file format
  • 7. Univariate Analysis • Current Relation – Dataset name, number of records, number of attributes in dataset. • Attribute details- All attributes to select for univariate analysis
  • 8. Univariate Analysis • Selected attribute – • Provides information about attribute type, Missing values, Distinct values, etc.
  • 9. Univariate Analysis • Selected attribute – histogram • Dispersion of attribute.
  • 10. Univariate Analysis • All attribute visualization/plots
  • 11. Data Manipulation • Changing data type of field • Missing values update • Creating BINS from data • Standardize data • Outlier Treatment • Creating new calculated fields
  • 12. 1. Convert NA to 0 • Flow > Preprocess > Edit > Right Click on Attribute > Replace Values
  • 14. 2. Changing data types of attribute • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute >
  • 16. 3. Creating BINS • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Discretize
  • 17. 3. Creating BINS • Provide attribute number, Number of BINs to be created and click on ‘Apply’
  • 18. 3. Creating BINS • Click on attribute to see the bins and distribution •
  • 19. 3. Creating Custom BINS • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression • ifelse(a2 > 0, ifelse(a2 > 10,ifelse(a2 > 20,4, 3), 2), 1)
  • 20. 4. Standardize data • To convert all numeric attributes in data to zero mean and unit variance. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Standardize •
  • 22. 4. Standardize data- Log values • To convert specific numeric attributes to log. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Numeric Transform •
  • 23. 4. Standardize data- Log values • Provide value for attribute number which is to be converted to log value. • Also provide method name – log. Here we can provide any other methods such as abs,round,floor •
  • 24. 5. Identify Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Interquartile Range • Outliers can be identified for separate attribute or for all together
  • 26. 5. Remove Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Instance > RemoveWithValues
  • 27. 5. Remove Outliers • Params : attributeIndices - Attribute number, NominalIndices=Nominal value of outlier in Outlier attribute
  • 28. 5. Transform Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression – this option will create new field e.g : ifelse(a2 > 1000,200, 1) •
  • 29. 6. New Calculated fields • This is helpful In case any new field is to be derived from existing fields • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Add Expression
  • 30. 6. New Calculated fields • This is helpful In case any new field is to be derived from existing fields • Provide expression/equation and new field name
  • 31. Feature Selection/Attribute selection 1. Info Gain 2. Correlation
  • 32. Feature Selection – Info Gain • Flow > Select Attribute > Attribute evaluator > Choose >
  • 33. Feature Selection – Correlation • Flow > Select Attribute > Attribute evaluator > Choose >
  • 34. Features for Model • Features selected for model
  • 36. Creating data sets 1. Dividing data into 60-20-20 % (Train-Test-Evaluate) 2. Weka inbuilt methods
  • 37. Creating data sets • For 60%-20%-20% • Step 1- • Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample
  • 38. Creating data sets • Step 2- Parameters for resample • Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample > • Check noReplacement = True, sample size percent – 60 > ok > Apply
  • 39. Creating data sets • Step 3 – Check • After apply we can check the current relation for number of records selected • • Step 4 – Save the result as filename_train.arff • Step 5 – Click on ‘Undo’ to get to original data set • Step 6 – Change the Resample parameters again • Parameters - > Invert selection = True, noReplacement = True, sampleSizePercent = 60
  • 40. Creating data sets • Step 7 – Apply and check results as below
  • 41. Creating data sets • Step 8 – Don’t save the results • Step 9 – Open Resample parameters set below parameters • Invert selection = False, noReplacement = True, sampleSizePercent = 50 • OK > Apply Check the results
  • 42. Creating data sets • Step 10 – Check the results • Step 11 – Save the results as Test Data. • Step 13 – Click on Undo to get earlier 40% of dataset • Step 13 – Parameters invertSelection = True, noReplacement=True , sampleSizePercent = 50 • Step 14 – Ok > Apply • Step 15 – Save as Evaluation Data
  • 44. Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Training Set – With this option selected data set will be used as training set to create model
  • 45. Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to create model
  • 46. Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to test model
  • 47. Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Cross Validation – With this option selected data set will be divided into 10 folds create model internally and weka will take average of all these models to show final model on UI
  • 48. Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Percentage Split – With this option selected data set will be divided into Training and Test set for model creation
  • 50. Logistic Regression • Flow > Classify > Functions > Logistic
  • 53. Model Analysis • Flow > Right Click on Model > Visualize threshold curve > ROC Curve
  • 55. Model Analysis- Cost Benefit Analysis • Flow > Classify > Right click model > Cost/Benefit Analysis
  • 56. Model Analysis- Cost/Benefit Analysis • Flow > Classify > Right click model > Cost/Benefit Analysis > Threshold Bar • Sliding the bar under Threshold label will change the accuracy and threshold curve
  • 57. Save prediction output to file • Flow > Classify > Test Options > More Options > Output Predictions > Text Bar to provide file name • Parameters: Choose: to provide file type, Attributes : First-last to get all fields, outPutFile : File name to save data
  • 58. Re-apply model on new data
  • 59. WEKA Pluses: • Platform independent and portable, java library can be invoked from any program in any language • User friendly GUI, with built in visualization, Simpler to use than R, large collection of different data mining algorithms • Better results for classification and cluster modeling • Ease of designing solutions. • Provides 3 ways to use the software: the GUI, a Java API, and a command line interface (CLI) • Can work with Spark, BigData using other packages on Experimenter or in batch mode.
  • 60. WEKA Limitation: • Visualizations can be managed better in R with different packages like ggplot • Not really flexible for data manipulation • Accepts only limited file format lile CSV,ARFF • Limited documentation available on Explorer.
  • 64. Decision Tree: Setting params for algo