SlideShare a Scribd company logo
Dealing with large datasets
Avoiding the dangers
Adrien Ickowicz, Ross Sparks




MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au
Managing the data

       Can the input be massaged to make it more amenable for learning
       methods? (and how can you do it safely)


          Attribute Selection                       Attribute Discretization
                  – Scheme independent selection        – Unsupervized discretization
                  – Searching the attribute space       – Entropy-based discretization
                  – Scheme specific selection            – Other methods




          Data Transformation                       Data Cleansing
                  – Linear and Non-linear PCA           – Improving Decision Tree

                  – Random projections                  – Robust Regression

                  – Time Series                         – Detecting anomalies




Dealing with large datasets: Slide 2 of 17
Attribute Selection




                                                                                                  Ju
                                                                                                     st
                                                                                                       ifi
                                                                                                         ca
       An irrelevant attribute will often distract the performance




                                                                                                           tio
       of state-of-the-art decision tree and rule learners...




                                                                                                              n
                    ¯ Example: Random binary attribute
                             – Deteriorates the classification performance 5% to 10% of the time




       But a relevant attribute can be harmful as well...


                    ¯ Example: 65% same-class-value binary attribute
                             – Deteriorates the classification performance 1% to 5% of the time




Dealing with large datasets: Slide 3 of 17
Attribute Selection

   1 - Scheme-independant selection
       • No universal relevance measure
       • Beware of overfitting and model redundancy
       • Make sure that the attributes scales are the same
                                    2 - Searching the attribute space
                                        • Exhaustive search impractical
                                        • Forward, backward, ... : Need an expert to set alg. param.


                                                           3 - Scheme-specific selection
                                                              • Time consuming
                                                              • ”Burns” one classification method




Dealing with large datasets: Slide 4 of 17
Attribute Discretization




                                                                                         Ju
                                                                                            st
                                                                                              ifi
                                                                                                ca
       Deal with both continuous and discretized data




                                                                                                  tio
                                                                                                     n
       Handle the extreme values


       Some algorithms assume a unrealistic hypothesis on
       the attribute values...
                                             ¯ Example: normal distribution assumption

       ... or slow down the process.


                                             ¯ Example: need to sort the attribute values



Dealing with large datasets: Slide 5 of 17
Attribute Discretization

   1 - Unsupervized discretization
       • Avoid big differences in bin-frequencies
       • Avoid small sized bins


                                    2 - Entropy-based discretization
                                        • Recursive, so need a stopping criterion


                                                        3 - Other methods
                                                           • In practice, do not perform better than E-B-D.
                                                           • Some are time consuming




Dealing with large datasets: Slide 6 of 17
Data Transformation




                                                                             Ju
                                                                                st
                                                                                  ifi
                                                                                    ca
       Data often calls for general mathematical transforma-




                                                                                      tio
       tions of a set of attributes...




                                                                                         n
                    ¯ Example: Two date attributes may lead to a third attribute
                         representing age


       Test the robustness of a learning algorithm...


                    ¯ Example: add noise or change a given percentage of a nom-
                         inal attribute values




Dealing with large datasets: Slide 7 of 17
Data Transformation

   1 - Linear and Non-linear PCA
       • Dimension reduction technique: there is a loose in information
       • Very costly in high dimension


                                    2 - Random projections
                                        • Perform worse than PCA
                                        • Preserve distance relationship well on average


                                                           3 - Time Series
                                                             • Pay attention to the sampling




Dealing with large datasets: Slide 8 of 17
Application Example

       - What is the difference between theory and practice?
       - There is no difference ... in theory. But in practice, there is.


                    ¯ Example 1: Attribute Selection (Backward vs Filter)
                    ¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down)
                    ¯ Example 3: Data Transformation




Dealing with large datasets: Slide 9 of 17
Example 1

       Data Set : Wine quality Data


                  Description of the data: 1599 obs. of 12 variables




                                              Question : What makes a good (red) wine?




Dealing with large datasets: Slide 10 of 17
Example 1

       How many features do we keep?

              Backward  RMSE




                     Number of features: 5




Dealing with large datasets: Slide 11 of 17
Example 1

        How many features do we keep?




Filter  RMSE




 Dealing with large datasets: Slide 12 of 17
Example 2

       How do we discretize the features?

                     Chi-2 discretization     MDL discretization




Dealing with large datasets: Slide 13 of 17
Example 2

       How do we discretize the features?

                     Chi-2 Merge discretization   Top-down discretization




Dealing with large datasets: Slide 14 of 17
Example 3

       How do we transform the data?




       Principal Component Analysis




Dealing with large datasets: Slide 15 of 17
Example 3
   How do we transform the data?



                                    Projection Pursuit
                                    Regression




Dealing with large datasets: Slide 16 of 17
CSIRO Mathematics, Informatics and Statistics   CSIRO Mathematics, Informatics and Statistics
Adrien Ickowicz                                 Ross Sparks
t   +61 2 9325 3260                             t   +61 2 9325 3262
e Adrien.Ickowicz@csiro.au                      e   Ross.Sparks@csiro.au
w Mathematics, Informatics and Statistics web   w   Mathematics, Informatics and Statistics web




MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au

More Related Content

DOCX
Performance analysis of machine learning algorithms on self localization system1
PDF
CCIA'2008: On the dimensions of data complexity through synthetic data sets
PDF
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
PDF
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
PDF
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
PDF
Lecture8 - From CBR to IBk
PDF
Lecture6 - C4.5
PDF
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
Performance analysis of machine learning algorithms on self localization system1
CCIA'2008: On the dimensions of data complexity through synthetic data sets
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
Lecture8 - From CBR to IBk
Lecture6 - C4.5
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...

Viewers also liked (13)

PDF
Kitchen Accessories & Features
PPTX
KITCHEN REMODEL by TOC design
PPTX
Presentacion revista
DOCX
Proyekto sa araling panlipunan
PDF
YSC 2013
PPS
Ekoloski protiv komaraca
PPT
Pecesimportados
PPTX
La vista on the green
PPTX
Health education for diabetics type 2 nachi-taroudannt
PPTX
Se vende casas reposeidas en Panama
PDF
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
PDF
Fusion 2012
PPTX
The definitive guide to custom hinges
Kitchen Accessories & Features
KITCHEN REMODEL by TOC design
Presentacion revista
Proyekto sa araling panlipunan
YSC 2013
Ekoloski protiv komaraca
Pecesimportados
La vista on the green
Health education for diabetics type 2 nachi-taroudannt
Se vende casas reposeidas en Panama
Hydrogen fuel enhancement_for_vehicles_From_www.computerittech.com
Fusion 2012
The definitive guide to custom hinges
Ad

Similar to Big Data Workshop (20)

PPT
Data1
PPT
Data1
PDF
Machine Learning.pdf
PDF
Machine learning Mind Map
PDF
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
PPT
Classification
PDF
Interscience discovering knowledge in data an introduction to data mining
PPT
Data preprocessing 2
PPT
1.6.data preprocessing
PDF
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPT
Preprocessing.ppt
PPTX
Boston hug
PDF
Data preprocessing in Data Mining
PPTX
Data For Datamining
PPTX
Data For Datamining
DOCX
Data Mining DataLecture Notes for Chapter 2Introduc
Data1
Data1
Machine Learning.pdf
Machine learning Mind Map
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Classification
Interscience discovering knowledge in data an introduction to data mining
Data preprocessing 2
1.6.data preprocessing
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Preprocessing.ppt
Boston hug
Data preprocessing in Data Mining
Data For Datamining
Data For Datamining
Data Mining DataLecture Notes for Chapter 2Introduc
Ad

Big Data Workshop

  • 1. Dealing with large datasets Avoiding the dangers Adrien Ickowicz, Ross Sparks MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au
  • 2. Managing the data Can the input be massaged to make it more amenable for learning methods? (and how can you do it safely) Attribute Selection Attribute Discretization – Scheme independent selection – Unsupervized discretization – Searching the attribute space – Entropy-based discretization – Scheme specific selection – Other methods Data Transformation Data Cleansing – Linear and Non-linear PCA – Improving Decision Tree – Random projections – Robust Regression – Time Series – Detecting anomalies Dealing with large datasets: Slide 2 of 17
  • 3. Attribute Selection Ju st ifi ca An irrelevant attribute will often distract the performance tio of state-of-the-art decision tree and rule learners... n ¯ Example: Random binary attribute – Deteriorates the classification performance 5% to 10% of the time But a relevant attribute can be harmful as well... ¯ Example: 65% same-class-value binary attribute – Deteriorates the classification performance 1% to 5% of the time Dealing with large datasets: Slide 3 of 17
  • 4. Attribute Selection 1 - Scheme-independant selection • No universal relevance measure • Beware of overfitting and model redundancy • Make sure that the attributes scales are the same 2 - Searching the attribute space • Exhaustive search impractical • Forward, backward, ... : Need an expert to set alg. param. 3 - Scheme-specific selection • Time consuming • ”Burns” one classification method Dealing with large datasets: Slide 4 of 17
  • 5. Attribute Discretization Ju st ifi ca Deal with both continuous and discretized data tio n Handle the extreme values Some algorithms assume a unrealistic hypothesis on the attribute values... ¯ Example: normal distribution assumption ... or slow down the process. ¯ Example: need to sort the attribute values Dealing with large datasets: Slide 5 of 17
  • 6. Attribute Discretization 1 - Unsupervized discretization • Avoid big differences in bin-frequencies • Avoid small sized bins 2 - Entropy-based discretization • Recursive, so need a stopping criterion 3 - Other methods • In practice, do not perform better than E-B-D. • Some are time consuming Dealing with large datasets: Slide 6 of 17
  • 7. Data Transformation Ju st ifi ca Data often calls for general mathematical transforma- tio tions of a set of attributes... n ¯ Example: Two date attributes may lead to a third attribute representing age Test the robustness of a learning algorithm... ¯ Example: add noise or change a given percentage of a nom- inal attribute values Dealing with large datasets: Slide 7 of 17
  • 8. Data Transformation 1 - Linear and Non-linear PCA • Dimension reduction technique: there is a loose in information • Very costly in high dimension 2 - Random projections • Perform worse than PCA • Preserve distance relationship well on average 3 - Time Series • Pay attention to the sampling Dealing with large datasets: Slide 8 of 17
  • 9. Application Example - What is the difference between theory and practice? - There is no difference ... in theory. But in practice, there is. ¯ Example 1: Attribute Selection (Backward vs Filter) ¯ Example 2: Attribute Discretization (Chi-2 based vs Top-down) ¯ Example 3: Data Transformation Dealing with large datasets: Slide 9 of 17
  • 10. Example 1 Data Set : Wine quality Data Description of the data: 1599 obs. of 12 variables Question : What makes a good (red) wine? Dealing with large datasets: Slide 10 of 17
  • 11. Example 1 How many features do we keep? Backward RMSE Number of features: 5 Dealing with large datasets: Slide 11 of 17
  • 12. Example 1 How many features do we keep? Filter RMSE Dealing with large datasets: Slide 12 of 17
  • 13. Example 2 How do we discretize the features? Chi-2 discretization MDL discretization Dealing with large datasets: Slide 13 of 17
  • 14. Example 2 How do we discretize the features? Chi-2 Merge discretization Top-down discretization Dealing with large datasets: Slide 14 of 17
  • 15. Example 3 How do we transform the data? Principal Component Analysis Dealing with large datasets: Slide 15 of 17
  • 16. Example 3 How do we transform the data? Projection Pursuit Regression Dealing with large datasets: Slide 16 of 17
  • 17. CSIRO Mathematics, Informatics and Statistics CSIRO Mathematics, Informatics and Statistics Adrien Ickowicz Ross Sparks t +61 2 9325 3260 t +61 2 9325 3262 e Adrien.Ickowicz@csiro.au e Ross.Sparks@csiro.au w Mathematics, Informatics and Statistics web w Mathematics, Informatics and Statistics web MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au