SlideShare a Scribd company logo
CMP: Data Mining and Statistics within the Health Services                                                                                                                                        19/02/2010




        Data Mining and Statistics
                                                                                                               Content
        Within the Health Services
                                                                                                                    1.      Introduction to Weka
                                     Tutorial for Weka                                                              2.      Data Mining Functions and Tools
                                                                                                                    3.      Data Format
                                             a data mining tool                                                     4.      Hands-on Demos
                                                                                                                         4.1 Weka Explorer
                                              Dr. Wenjia Wang                                                            • Classification
                                                                                                                         • Attribute( feature) Selection
                                        School of Computing Sciences                                                     4.2 Weka Experimenter
                                          University of East Anglia                                                      4.3 Weka KnowledgeFlow
                                                                                                                    5. Summary
            Data                  Pre-processing                 Data Mining                   Knowledge


       Data Mining & Statistics within the Health Services
                                                                                                               Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      2




        1. Introduction to WEKA                                                                                Weka Main Features
        • A collection of open source of many data                                                              • 49 data preprocessing tools
          mining and machine learning algorithms,                                                               • 76 classification/regression algorithms
          including                                                                                             • 8 clustering algorithms
              – pre-processing on data                                                                          • 15 attribute/subset evaluators + 10 search
              – Classification:                                                                                   algorithms for feature selection.
              – clustering                                                                                      • 3 algorithms for finding association rules
              – association rule extraction                                                                     • 3 graphical user interfaces
                                                                                                                      – “The Explorer” (exploratory data analysis)
        • Created by researchers at the University of
                                                                                                                      – “The Experimenter” (experimental environment)
          Waikato in New Zealand                                                                                      – “The KnowledgeFlow” (new process model inspired
        • Java based (also open source).                                                                                interface)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               3   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      4




        Weka: Download and Installation                                                                        Start the Weka

        • Download Weka (the stable version) from                                                              • From windows desktop,
               http://www.cs.waikato.ac.nz/ml/weka/                                                                   – click “Start”, choose “All programs”,
               – Choose a self-extracting executable (including Java VM)                                              – Choose “Weka 3.6” to start Weka
                                                                                                                      – Then the first interface
               – (If you are interested in modifying/extending weka there                                               window appears:
                 is a developer version that includes the source code)
                                                                                                                         Weka GUI Chooser.
        • After download is completed, run the self-
          extracting file to install Weka, and use the default
          set-ups.

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               5   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      6




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                    1
CMP: Data Mining and Statistics within the Health Services                                                                                                                             19/02/2010




       WEKA Application Interfaces                                                                   Weka Application Interfaces
                                                                                                    • Explorer
                                                                                                        – preprocessing, attribute selection, learning, visualiation
                                                                                                    • Experimenter
                                                                                                        – testing and evaluating machine learning algorithms
                                                                                                    • Knowledge Flow
                                                                                                        – visual design of KDD process
                                                                                                        – Explorer
                                                                                                    • Simple Command-line
                                                                                                        – A simple interface for typing commands


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    7   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)       8




                                                                                                     Load data file and
        2. Weka Functions and Tools
                                                                                                     Preprocessing
        •    Preprocessing Filters                                                                   • Load data file in formats: ARFF, CSV, C4.5,
                                                                                                       binary
        •    Attribute selection
                                                                                                     • Import from URL or SQL database (using JDBC)
        •    Classification/Regression
                                                                                                     • Preprocessing filters
        •    Clustering                                                                                    –    Adding/removing attributes
        •    Association discovery                                                                         –    Attribute value substitution
                                                                                                           –    Discretization
        •    Visualization
                                                                                                           –    Time series filters (delta, shift)
                                                                                                           –    Sampling, randomization
                                                                                                           –    Missing value management
                                                                                                           –    Normalization and other numeric transformations
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    9   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      10




        Feature Selection                                                                            Classification
       • Very flexible: arbitrary combination of search and                                         • Predicted target must be categorical
         evaluation methods                                                                         • Implemented methods
       • Search methods                                                                                  –    decision trees(J48, etc.) and rules
            – best-first                                                                                 –    Naïve Bayes
            – genetic                                                                                    –    neural networks
            – ranking ...
                                                                                                         –    instance-based classifiers …
       • Evaluation measures
                                                                                                    • Evaluation methods
            – ReliefF
                                                                                                         – test data set
            – information gain
            – gain ratio                                                                                 – crossvalidation
       • Demo data: weather_nominal.arff                                                            • Demo data: iris, contact lenses, labor, soybeans,
                                                                                                      etc.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   11   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      12




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                          2
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




        Clustering                                                                                    Regression
       • Implemented methods
           –    k-Means                                                                               • Predicted target is continuous
           –    EM                                                                                    • Methods
           –    Cobweb
           –    X-means                                                                                     – linear regression
           –    FarthestFirst…                                                                              – neural networks
       • Clusters can be visualized and compared to “true”                                                  – regression trees …
         clusters (if given)
       • Demo data:                                                                                   • Demo data: cpu.arff,
           – any classification data may be used for clustering when
             its class attribute is filtered out.



       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   13    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          14




        Weka: Pros and cons                                                                           3. WEKA data formats
        • pros                                                                                        • Data can be imported from a file in various
               – Open source,                                                                           formats:
                     • Free                                                                                 – ARFF (Attribute Relation File Format) has two sections:
                     • Extensible                                                                                  • the Header information defines attribute name, type and
                     • Can be integrated into other java packages                                                    relations.
               – GUIs (Graphic User Interfaces)                                                                    • the Data section lists the data records.
                     • Relatively easier to use                                                             – CSV: Comma Separated Values (text file)
               – Features                                                                                   – C4.5: A format used by a decision induction algorithm
                     • Run individual experiment, or                                                          C4.5, requires two separated files
                     • Build KDD phases                                                                            • Name file: defines the names of the attributes
        • Cons                                                                                                     • Date file: lists the records (samples)
               – Lack of proper and adequate documentations                                                 – binary
               – Systems are updated constantly (Kitchen Sink Syndrome)                               • Data can also be read from a URL or from an
                                                                                                        SQL database (using JDBC)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   15    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          16




        Attribute Relation File Format (arff)                                                         Breast Cancer data in ARFF
                                                                                                    % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence-
        An ARFF file consists of two distinct sections:                                               events: 85)
                                                                                                    % Part 1: Definitions of attribute name, types and relations
        • the Header section defines attribute name, type                                           @relation breast-cancer
                                                                                                       @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
                                                                                                       @attribute menopause {'lt40','ge40','premeno'}
          and relations, start with a keyword.                                                         @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-
                                                                                                       49','50-54','55-59'}
               @Relation <data-name>                                                                   @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-
                                                                                                       32','33-35','36-39'}
               @attribute <attribute-name> <type> or {range}                                           @attribute node-caps {'yes','no'}
                                                                                                       @attribute deg-malig {'1','2','3'}
                                                                                                       @attribute breast {'left','right'}
        • the Data section lists the data records, starts with                                         @attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
                                                                                                       @attribute 'irradiat' {'yes','no'}
               @Data                                                                                   @attribute 'Class' {'no-recurrence-events','recurrence-events'}

               list of data instances                                                               % Part 2: data section
                                                                                                    @data
        • Any line start with % is the comments.                                                       '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
                                                                                                       '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
                                                                                                       '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
                                                                                                       ……
                                                                                                     * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   17    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          18




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                   3
CMP: Data Mining and Statistics within the Health Services                                                                                                                                 19/02/2010




        4.1 WEKA Explorer                                                                               Weka Explorer: open data file
                                                                                                    •       Open
            • Click the Explorer on Weka GUI Chooser                                                        Breast
                                                                                                            Cancer
            • On the Explorer window,                                                                       data
                  – click button “Open File” to open a data file                                    •       Click an
                    from                                                                                    attribute,
                                                                                                            e.g. age,
                         • the folder where your data files stored.                                         then its
                           e.g. Breast Cancer data: breast_cancer.arff                                      distributio
                                                                                                            n will be
                         Or (if you don’t have this data set),                                              displayed
                         • the data folder provided by the weka package:                                    in a
                                                                                                            histogra
                           e.g. C:Program FilesWeka-3-6data                                              m.
                                 using “iris.arff” or “weather_nominal.arff”


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   19       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      20




      Weka Explorer: training classifiers                                                           Results

        After loaded a data file, click “Classify”                                                      • Testing
        • Choose a classifier,                                                                            results:
              – Under “Classifier”: click “choose”, then a drop-down                                    • 97 cases
                                                                                                          used in
                menu appears,
                                                                                                          test.
              – Click “trees” and select “J48” – a decision tree                                        Correct:
                algorithm                                                                                66 (68%)
        • Select a test option                                                                          Wrong:
                                                                                                         31 (32%)
              – Select “percentage split”
                     • with default ratio 66% for training and 34% for testing
        • Click “Start” to train and test the classifier.
              – The training and testing information will be displayed
                in classifier output window.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   21       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      22




        Options for results and model                                                                   View the tree
        •    Point to                                                                                   •     Point to
             result                                                                                           result list
             list                                                                                             window,
             window,                                                                                          and right
             and
             right
                                                                                                              click
             click                                                                                            mouse,
             mouse.                                                                                     •     Choose
                                                                                                              “visualiz
        •    A menu                                                                                           e tree”,
             will pop                                                                                         then the
             out to                                                                                           tree will
             show all                                                                                         be
             the                                                                                              displayed
             options                                                                                          in
             availabl                                                                                         another
             e about
             the
                                                                                                              window.
             model.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   23       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      24




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                              4
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




         View classifier errors                                                                          Save the model and results
         •    right click the                                                                            •     Right
              result list,                                                                                     click on
                                                                                                               the
         •    Choose                                                                                           result
              “visualize                                                                                       list
              classifier
              error”, then a                                                                             •     Choose
              new window will                                                                                  “save
                                                                                                               model”
              be popped out                                                                                    and
              to display the                                                                                   “save
              classifier’s error.                                                                              result
                                                                                                               buffer”
               – Correctly                                                                                     to save
                 predicted                                                                                     the
                                                                                                               classifie
                 cases                                                                                         r and
               – Wrong                                                                                         the
                                                                                                               results
                 cases                                                                                         to the
                                                                                                               disk
                                                                                                               folder.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   25       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      26




         Train a neural net                                                                              View the model’s ROC curve
      Click “Choose”
          to select
                                                                                                         •     Right click
          another                                                                                              the result:
          function,                                                                                            “Multiplaye
      e.g. “Multilayer                                                                                         rPerceptro
          Perceptron”
          - a type of                                                                                          n”
          neural net.                                                                                    •     Choose
      Then click “Start”                                                                                       “visualize
          to train and
          test it. (note:                                                                                      threshold
          the training                                                                                         curve” and
          may take
          much longer
                                                                                                               “recurrent
          time.)                                                                                               events”;
                                                                                                         •     The ROC
      The results
         seem better
                                                                                                               curve will
         than the tree                                                                                         be
         classifier.                                                                                           displayed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   27       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      28




         Select Attributes                                                                               4.2 Weka Experimenter
       • Click “Select                                                                               •  you can use Experimenter
         Attributes”                                                                                    to carry out experiments
                                                                                                        for multiple data sets
       • Choose an                                                                                      using multiple methods,
         “attribute                                                                                  e.g. classifying
         evaluator”                                                                                  • two data sets
             – e.g. chiSquare                                                                                – Breast cancer
       • Choose a                                                                                            – Iris
         “Search                                                                                     •    Using two methods
         Method”                                                                                             – Decision Tree: J48
       • Then click                                                                                          – Logistic
         “Start”                                                                                     •    The experiment is “Setup”
                                                                                                          as shown in the
       • The selected                                                                                     screenshot.
         attributes are                                                                              •    Then click “Run”
         listed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   29       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      30




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                               5
CMP: Data Mining and Statistics within the Health Services                                                                                                                                     19/02/2010




        Analysis of the results                                                                             4.3 KnowledgeFlow
        •  Click                                                                                        • Click KnowledgeFlow on Weka GUI Chooser
           “analysis” to
           analyse the                                                                                  • A new window opened for buidling KDD process.
           results,
        E.g.
           paired t-test
           significance
        • Click
           “Experiment”
        • Configure
           test: choosing
           appropriate
           test and
           parameters
        • Click
           “Perform test”
           and the test
           results are
           listed.
       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     31       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         32




        Steps for building a KDD                                                                            A KDD process for Breast
        process                                                                                             Cancer
       Major steps for building a process
       1. Adding required nodes
            1) Add nodes
            2) Add a data source node from “DataSources”
                   1) Right click to configure it with a data set
            3)   Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
            4)   Add a classifier, e.g. J48, from Classifiers
            5)   Add a classiferPerformanceEvaluator node from “Evaluation”
            6)   Add a text viewer from “Visualisation”
       2. Connect the nodes
            – Right click “DataSource” node and choose DataSet, then connect it to the
              ClassAssigner node,
            – do the same or similar for connecting between the other nodes.
       3. Run the process (using the default setups for each node)
            – Right click DataSource node and choose “Start loading”, the process should run and
              “Status” window should indicate if the run is correct and completed.
       4. View the results:
            – If the run is correctly completed, right click “Text Viewer” node and choose “Show
              results”, then another window pops out to show the results.



       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     33       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         34




        Results of the KDD process                                                                          5. Weka Tutorial Summary
                                                                                                        Weka is open source data mining software that offers
        •    right click
             “Text                                                                                      • Some GUI interfaces for data mining
             Viewer”                                                                                            – Explorer
             node and                                                                                           – Experimenter
             choose                                                                                             – KnowledgeFlow
             “Show                                                                                      •      Many functions and tools that include
             results”,                                                                                          – Methods for classification:
             then                                                                                                      decision trees, rule learners, naive Bayes, decision tables, locally weighted
                                                                                                                         regression, SVMs, instance-based learners, logistic regression, multi-layer
             another                                                                                                     perceptron
             window                                                                                             – methods for regression/prediction:
             pops out                                                                                                  linear regression, model tree generators, locally weighted regression, instance-
             to show                                                                                                      based learners, decision tables, multi-layer perceptron
             the                                                                                                – Ensemble schemes
                                                                                                                       • Bagging, boosting, stacking, RandomFrest
             results.
                                                                                                                – Methods for clustering:
                                                                                                                       • K-means, EM and Cobweb
                                                                                                                – Methods for feature selection

       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     35       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         36




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                     6

More Related Content

PPTX
Incremental Learning using WEKA
PDF
Functional Scala
PDF
Bab 1 pendahuluan weka
PDF
Cara pemakaian weka
PDF
Ostinato FOSS.IN 2010
PDF
Dpdk accelerated Ostinato
PDF
Microsoft PowerPoint - weka [Read-Only]
PPT
Weka a tool_for_exploratory_data_mining
Incremental Learning using WEKA
Functional Scala
Bab 1 pendahuluan weka
Cara pemakaian weka
Ostinato FOSS.IN 2010
Dpdk accelerated Ostinato
Microsoft PowerPoint - weka [Read-Only]
Weka a tool_for_exploratory_data_mining

Similar to Wekatutorial (20)

PPT
Shraddha weka
PPT
Shraddha weka
PPTX
A simple introduction to weka
PPTX
WEKA Tutorial and Introduction Data mining
PPTX
PPT
Weka presentation
PPT
WEKA Tutorial
PPT
An Introduction To Weka
PPT
An Introduction To Weka
PPT
Weka : A machine learning algorithms for data mining
PDF
PPTX
Introduction to Weka- beginner tutorial.pptx
PDF
1352 004 oer submission
PPTX
Weka presentation
PDF
weka-190429184259.pdf
PDF
wekapresentation-130107115704-phpapp02.pdf
PPTX
Weka_new_forthedataming_practicalss.pptx
PDF
Weka_Manual_Sagar
DOC
Data mining techniques using weka
PPT
Introduction to Weka and Preprocessing.ppt
Shraddha weka
Shraddha weka
A simple introduction to weka
WEKA Tutorial and Introduction Data mining
Weka presentation
WEKA Tutorial
An Introduction To Weka
An Introduction To Weka
Weka : A machine learning algorithms for data mining
Introduction to Weka- beginner tutorial.pptx
1352 004 oer submission
Weka presentation
weka-190429184259.pdf
wekapresentation-130107115704-phpapp02.pdf
Weka_new_forthedataming_practicalss.pptx
Weka_Manual_Sagar
Data mining techniques using weka
Introduction to Weka and Preprocessing.ppt
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
Assigned Numbers - 2025 - Bluetooth® Document
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Ad

Wekatutorial

  • 1. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Data Mining and Statistics Content Within the Health Services 1. Introduction to Weka Tutorial for Weka 2. Data Mining Functions and Tools 3. Data Format a data mining tool 4. Hands-on Demos 4.1 Weka Explorer Dr. Wenjia Wang • Classification • Attribute( feature) Selection School of Computing Sciences 4.2 Weka Experimenter University of East Anglia 4.3 Weka KnowledgeFlow 5. Summary Data Pre-processing Data Mining Knowledge Data Mining & Statistics within the Health Services Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 2 1. Introduction to WEKA Weka Main Features • A collection of open source of many data • 49 data preprocessing tools mining and machine learning algorithms, • 76 classification/regression algorithms including • 8 clustering algorithms – pre-processing on data • 15 attribute/subset evaluators + 10 search – Classification: algorithms for feature selection. – clustering • 3 algorithms for finding association rules – association rule extraction • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) • Created by researchers at the University of – “The Experimenter” (experimental environment) Waikato in New Zealand – “The KnowledgeFlow” (new process model inspired • Java based (also open source). interface) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 3 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 4 Weka: Download and Installation Start the Weka • Download Weka (the stable version) from • From windows desktop, http://www.cs.waikato.ac.nz/ml/weka/ – click “Start”, choose “All programs”, – Choose a self-extracting executable (including Java VM) – Choose “Weka 3.6” to start Weka – Then the first interface – (If you are interested in modifying/extending weka there window appears: is a developer version that includes the source code) Weka GUI Chooser. • After download is completed, run the self- extracting file to install Weka, and use the default set-ups. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 5 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 6 Dr. Wenjia Wang: Tutorial for DM tool Weka 1
  • 2. CMP: Data Mining and Statistics within the Health Services 19/02/2010 WEKA Application Interfaces Weka Application Interfaces • Explorer – preprocessing, attribute selection, learning, visualiation • Experimenter – testing and evaluating machine learning algorithms • Knowledge Flow – visual design of KDD process – Explorer • Simple Command-line – A simple interface for typing commands Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 7 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 8 Load data file and 2. Weka Functions and Tools Preprocessing • Preprocessing Filters • Load data file in formats: ARFF, CSV, C4.5, binary • Attribute selection • Import from URL or SQL database (using JDBC) • Classification/Regression • Preprocessing filters • Clustering – Adding/removing attributes • Association discovery – Attribute value substitution – Discretization • Visualization – Time series filters (delta, shift) – Sampling, randomization – Missing value management – Normalization and other numeric transformations Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 9 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 10 Feature Selection Classification • Very flexible: arbitrary combination of search and • Predicted target must be categorical evaluation methods • Implemented methods • Search methods – decision trees(J48, etc.) and rules – best-first – Naïve Bayes – genetic – neural networks – ranking ... – instance-based classifiers … • Evaluation measures • Evaluation methods – ReliefF – test data set – information gain – gain ratio – crossvalidation • Demo data: weather_nominal.arff • Demo data: iris, contact lenses, labor, soybeans, etc. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 11 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 12 Dr. Wenjia Wang: Tutorial for DM tool Weka 2
  • 3. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Clustering Regression • Implemented methods – k-Means • Predicted target is continuous – EM • Methods – Cobweb – X-means – linear regression – FarthestFirst… – neural networks • Clusters can be visualized and compared to “true” – regression trees … clusters (if given) • Demo data: • Demo data: cpu.arff, – any classification data may be used for clustering when its class attribute is filtered out. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 13 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 14 Weka: Pros and cons 3. WEKA data formats • pros • Data can be imported from a file in various – Open source, formats: • Free – ARFF (Attribute Relation File Format) has two sections: • Extensible • the Header information defines attribute name, type and • Can be integrated into other java packages relations. – GUIs (Graphic User Interfaces) • the Data section lists the data records. • Relatively easier to use – CSV: Comma Separated Values (text file) – Features – C4.5: A format used by a decision induction algorithm • Run individual experiment, or C4.5, requires two separated files • Build KDD phases • Name file: defines the names of the attributes • Cons • Date file: lists the records (samples) – Lack of proper and adequate documentations – binary – Systems are updated constantly (Kitchen Sink Syndrome) • Data can also be read from a URL or from an SQL database (using JDBC) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 15 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 16 Attribute Relation File Format (arff) Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- An ARFF file consists of two distinct sections: events: 85) % Part 1: Definitions of attribute name, types and relations • the Header section defines attribute name, type @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} and relations, start with a keyword. @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45- 49','50-54','55-59'} @Relation <data-name> @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30- 32','33-35','36-39'} @attribute <attribute-name> <type> or {range} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} • the Data section lists the data records, starts with @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute 'irradiat' {'yes','no'} @Data @attribute 'Class' {'no-recurrence-events','recurrence-events'} list of data instances % Part 2: data section @data • Any line start with % is the comments. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 17 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 18 Dr. Wenjia Wang: Tutorial for DM tool Weka 3
  • 4. CMP: Data Mining and Statistics within the Health Services 19/02/2010 4.1 WEKA Explorer Weka Explorer: open data file • Open • Click the Explorer on Weka GUI Chooser Breast Cancer • On the Explorer window, data – click button “Open File” to open a data file • Click an from attribute, e.g. age, • the folder where your data files stored. then its e.g. Breast Cancer data: breast_cancer.arff distributio n will be Or (if you don’t have this data set), displayed • the data folder provided by the weka package: in a histogra e.g. C:Program FilesWeka-3-6data m. using “iris.arff” or “weather_nominal.arff” Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 19 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 20 Weka Explorer: training classifiers Results After loaded a data file, click “Classify” • Testing • Choose a classifier, results: – Under “Classifier”: click “choose”, then a drop-down • 97 cases used in menu appears, test. – Click “trees” and select “J48” – a decision tree Correct: algorithm 66 (68%) • Select a test option Wrong: 31 (32%) – Select “percentage split” • with default ratio 66% for training and 34% for testing • Click “Start” to train and test the classifier. – The training and testing information will be displayed in classifier output window. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 21 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 22 Options for results and model View the tree • Point to • Point to result result list list window, window, and right and right click click mouse, mouse. • Choose “visualiz • A menu e tree”, will pop then the out to tree will show all be the displayed options in availabl another e about the window. model. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 23 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 24 Dr. Wenjia Wang: Tutorial for DM tool Weka 4
  • 5. CMP: Data Mining and Statistics within the Health Services 19/02/2010 View classifier errors Save the model and results • right click the • Right result list, click on the • Choose result “visualize list classifier error”, then a • Choose new window will “save model” be popped out and to display the “save classifier’s error. result buffer” – Correctly to save predicted the classifie cases r and – Wrong the results cases to the disk folder. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 25 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 26 Train a neural net View the model’s ROC curve Click “Choose” to select • Right click another the result: function, “Multiplaye e.g. “Multilayer rPerceptro Perceptron” - a type of n” neural net. • Choose Then click “Start” “visualize to train and test it. (note: threshold the training curve” and may take much longer “recurrent time.) events”; • The ROC The results seem better curve will than the tree be classifier. displayed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 27 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 28 Select Attributes 4.2 Weka Experimenter • Click “Select • you can use Experimenter Attributes” to carry out experiments for multiple data sets • Choose an using multiple methods, “attribute e.g. classifying evaluator” • two data sets – e.g. chiSquare – Breast cancer • Choose a – Iris “Search • Using two methods Method” – Decision Tree: J48 • Then click – Logistic “Start” • The experiment is “Setup” as shown in the • The selected screenshot. attributes are • Then click “Run” listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 29 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 30 Dr. Wenjia Wang: Tutorial for DM tool Weka 5
  • 6. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Analysis of the results 4.3 KnowledgeFlow • Click • Click KnowledgeFlow on Weka GUI Chooser “analysis” to analyse the • A new window opened for buidling KDD process. results, E.g. paired t-test significance • Click “Experiment” • Configure test: choosing appropriate test and parameters • Click “Perform test” and the test results are listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 31 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 32 Steps for building a KDD A KDD process for Breast process Cancer Major steps for building a process 1. Adding required nodes 1) Add nodes 2) Add a data source node from “DataSources” 1) Right click to configure it with a data set 3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node 4) Add a classifier, e.g. J48, from Classifiers 5) Add a classiferPerformanceEvaluator node from “Evaluation” 6) Add a text viewer from “Visualisation” 2. Connect the nodes – Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner node, – do the same or similar for connecting between the other nodes. 3. Run the process (using the default setups for each node) – Right click DataSource node and choose “Start loading”, the process should run and “Status” window should indicate if the run is correct and completed. 4. View the results: – If the run is correctly completed, right click “Text Viewer” node and choose “Show results”, then another window pops out to show the results. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 33 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 34 Results of the KDD process 5. Weka Tutorial Summary Weka is open source data mining software that offers • right click “Text • Some GUI interfaces for data mining Viewer” – Explorer node and – Experimenter choose – KnowledgeFlow “Show • Many functions and tools that include results”, – Methods for classification: then decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, multi-layer another perceptron window – methods for regression/prediction: pops out linear regression, model tree generators, locally weighted regression, instance- to show based learners, decision tables, multi-layer perceptron the – Ensemble schemes • Bagging, boosting, stacking, RandomFrest results. – Methods for clustering: • K-means, EM and Cobweb – Methods for feature selection Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 35 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 36 Dr. Wenjia Wang: Tutorial for DM tool Weka 6