SlideShare a Scribd company logo
Effective and
Unsupervised
Fractal-based
Feature Selection
for Very Large
Datasets
Removing linear and non-linear attribute correlations
Antonio Canabrava Fraideinberze
Jose F Rodrigues-Jr
Robson Leonardo Ferreira Cordeiro
Databases and Images Group
University of São Paulo
São Carlos - SP - Brazil
2
Terabytes?
…
How to analyze that data?
3
Terabytes?
Parallel processing
and dimensionality
reduction, for
sure...
…
How to analyze that data?
How to analyze that data?
4
Terabytes?
, but how to remove
linear and non-linear
attribute correlations,
besides irrelevant
attributes?
…
How to analyze that data?
5
Terabytes?
, and how to reduce
dimensionality without
human supervision
and being task
independent?
…
6
Terabytes?
Curl-Remover
Medium-
dimensionality
…
How to analyze that data?
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
7
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
8
Fundamental Concepts
Fractal Theory
...
...
...
...
9
Fundamental Concepts
Fractal Theory
...
...
...
...
10
Fundamental Concepts
Fractal Theory
11
Fundamental Concepts
Fractal Theory
12
Fundamental Concepts
Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Fractal Correlation Dimension ≅ Intrinsic Dimension
13
Fundamental Concepts
Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Embedded dimension ≅ 3
Intrinsic dimension ≅ 1
Embedded dimension ≅ 3
Intrinsic dimension ≅ 2
14
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
15
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
16
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
log(r)
17
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
log(r)
18
Fundamental Concepts
Fractal Theory
Fractal Correlation Dimension - Box Counting
19
Multidimensional
Quad-tree
[Traina Jr. et al, 2000]
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
20
Related Work
Dimensionality Reduction - Taxonomy 1
Dimensionality
Reduction
Supervised Algorithms
Unsupervised
Algorithms
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
21
Related Work
Dimensionality Reduction - Taxonomy 2
Dimensionality
Reduction
Feature ExtractionFeature Selection
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
EmbeddedFilterWrapper
22
Related Work
23
Terabytes?
Existing methods need supervision,
miss non-linear correlations, cannot
handle Big Data or work for
classification only
…
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
24
General Idea
25
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
General Idea
26
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds partial trees
for the full dataset
and for its E
(E-1)-dimensional
projections
General Idea
27
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
TreeID
+
cell
spatial
position
Partial
count of
points
General Idea
28
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sums partial point
counts and reports
log(r) and log(sum2)
for each tree
General Idea
29
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Computes D2 for
the full dataset and
pD2 for each of its E
(E-1)-dimensional
projections
General Idea
30
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
The least relevant
attribute, i.e., the one
not in the projection
that minimizes
| D2 - pD2 |
General Idea
31
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Spots the second
least relevant
attribute …
General Idea
3 Main Issues
32
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
General Idea
3 Main Issues
33
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
1° Too much data to
be shuffled – one
data pair per cell/tree
General Idea
3 Main Issues
34
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
2° One
data pass
per
irrelevant
attribute
General Idea
3 Main Issues
35
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
3° Not enough
memory for mappers
Proposed Method
Curl-Remover
36
1° Issue - Too much data to be shuffled; one data pair per
cell/tree;
Our solution - Two-phase dimensionality reduction:
a) Serial feature selection in a tiny data sample (one reducer). Used to
speed-up processing only;
b) All mappers project data into a fixed subspace
37
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.Builds/reports N (2 or
3) tree levels of
lowest resolution…
Proposed Method
Curl-Remover
38
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
… plus the points
projected into the M (2
or 3) most relevant
attributes of sample
Proposed Method
Curl-Remover
39
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds the full trees from
their low resolution level
cells and the projected
points
Proposed Method
Curl-Remover
40
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
High resolution cells
are never shuffled
Proposed Method
Curl-Remover
41
2° Issue - One data pass per irrelevant attribute;
Our solution – Stores/reads the tree level of highest
resolution, instead of the original data.
42
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Rdb = cost to read dataset;
TWRtree = cost to transfer,
write and read the last tree
level in next reduce step;
If (Rdb > TWRtree)
then writes tree;
Proposed Method
Curl-Remover
43
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
44
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Writes tree’s last level in
HDFS
Proposed Method
Curl-Remover
45
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Reads tree’s last level
from HDFS
Proposed Method
Curl-Remover
46
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
Reads dataset
only twice
Proposed Method
Curl-Remover
47
3° Issue - Not enough memory for mappers;
Our solution – Sorts data in mappers and reports “tree slices”
whenever needed.
48
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sorts its local points and
builds “tree slices”
monitoring memory
consumption
Proposed Method
Curl-Remover
Proposed Method
Curl-Remover
49
Y
X
Proposed Method
Curl-Remover
50
Reports “tree slices”
with very little overlap
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
51
Evaluation
Datasets
Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-
linearly correlated. 5 attributes, 1.1 billion points;
Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2
random attributes. 5 attributes, 1.1 billion points;
Yahoo! Network Flows - communication patterns between end-users in the web. 12
attributes, 562 million points;
Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;
Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5
million points;
Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5
million points.
52
Evaluation
Fractal Dimension
Hepmass
53
Evaluation
Fractal Dimension
Hepmass Duplicated
54
Evaluation
Comparison with sPCA - Classification
55
Evaluation
Comparison with sPCA - Classification
56
8% more accurate,
7.5% faster
Evaluation
Comparison with sPCA
Percentage of Fractal Dimension after selection
57
Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
58
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
59
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
60
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
61
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
62
Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
 Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
 Generality - it suits for analytical tasks in general, and not only for
classification;
63
Questions?
robson@icmc.usp.br
Hepmass Duplicated

More Related Content

PDF
Pattern Mining in large time series databases
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
PDF
Generative Adversarial Networks : Basic architecture and variants
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PPT
002.decision trees
PDF
Mahoney mlconf-nov13
PPTX
Ml10 dimensionality reduction-and_advanced_topics
Pattern Mining in large time series databases
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
Generative Adversarial Networks : Basic architecture and variants
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
002.decision trees
Mahoney mlconf-nov13
Ml10 dimensionality reduction-and_advanced_topics

What's hot (18)

PDF
Data exploration validation and sanitization
PDF
Introduction to the R Statistical Computing Environment
PPTX
Decision Tree - C4.5&CART
PPTX
07 learning
PPTX
Machine learning Algorithms with a Sagemaker demo
PDF
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
PDF
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
PPTX
08 neural networks
PDF
[系列活動] Machine Learning 機器學習課程
PPTX
Algorithms Design Patterns
PPTX
Basics of Machine Learning
PDF
Jan vitek distributedrandomforest_5-2-2013
PPTX
Matrix decomposition and_applications_to_nlp
PPTX
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
PPTX
Fuzzy logic member functions
PDF
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
PDF
Deep Feed Forward Neural Networks and Regularization
PDF
MS CS - Selecting Machine Learning Algorithm
Data exploration validation and sanitization
Introduction to the R Statistical Computing Environment
Decision Tree - C4.5&CART
07 learning
Machine learning Algorithms with a Sagemaker demo
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
08 neural networks
[系列活動] Machine Learning 機器學習課程
Algorithms Design Patterns
Basics of Machine Learning
Jan vitek distributedrandomforest_5-2-2013
Matrix decomposition and_applications_to_nlp
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Fuzzy logic member functions
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Deep Feed Forward Neural Networks and Regularization
MS CS - Selecting Machine Learning Algorithm
Ad

Viewers also liked (19)

PPT
Visualization tree multiple linked analytical decisions
PDF
An introduction to MongoDB
PPT
Frequency plot and relevance plot to enhance visual data exploration
PPT
SuperGraph visualization
PDF
Unveiling smoke in social images with the SmokeBlock approach
PDF
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
PDF
Apresentacao vldb
PPT
Reviewing Data Visualization: an Analytical Taxonomical Study
PPTX
On the Support of a Similarity-Enabled Relational Database Management System ...
PDF
StructMatrix: large-scale visualization of graphs by means of structure detec...
PDF
Supervised-Learning Link Recommendation in the DBLP co-authoring network
PDF
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
PDF
Techniques for effective and efficient fire detection from social media images
PDF
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
PPT
Graph-based Relational Data Visualization
PDF
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
PDF
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
PPT
Dawarehouse e OLAP
PPT
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Visualization tree multiple linked analytical decisions
An introduction to MongoDB
Frequency plot and relevance plot to enhance visual data exploration
SuperGraph visualization
Unveiling smoke in social images with the SmokeBlock approach
6 7-metodologia depesquisaemcienciadacomputacao-escritadeartigocientifico-plagio
Apresentacao vldb
Reviewing Data Visualization: an Analytical Taxonomical Study
On the Support of a Similarity-Enabled Relational Database Management System ...
StructMatrix: large-scale visualization of graphs by means of structure detec...
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Techniques for effective and efficient fire detection from social media images
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Graph-based Relational Data Visualization
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Dawarehouse e OLAP
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Ad

Similar to Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations (20)

PPTX
Data analytics concepts
PPTX
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
PPTX
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
PPTX
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
PPTX
Neural Learning to Rank
PDF
Machine Learning Algorithms Introduction.pdf
PPTX
Deep learning from mashine learning AI..
PPTX
Branch And Bound and Beam Search Feature Selection Algorithms
PDF
Machine Learning Notes for beginners ,Step by step
PDF
Matrix Factorization In Recommender Systems
PDF
Dimensionality Reduction
PPTX
Building and deploying analytics
PPTX
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
PPTX
background.pptx
PPTX
principle component analysis pca - machine learning - unsupervised learning
PPTX
PCA_csep546.pptx
PPTX
PCA_csep546.pptx
PPTX
Dimensionality Reduction and feature extraction.pptx
PDF
Massive Matrix Factorization : Applications to collaborative filtering
PDF
Dictionary Learning for Massive Matrix Factorization
Data analytics concepts
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
Neural Learning to Rank
Machine Learning Algorithms Introduction.pdf
Deep learning from mashine learning AI..
Branch And Bound and Beam Search Feature Selection Algorithms
Machine Learning Notes for beginners ,Step by step
Matrix Factorization In Recommender Systems
Dimensionality Reduction
Building and deploying analytics
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
background.pptx
principle component analysis pca - machine learning - unsupervised learning
PCA_csep546.pptx
PCA_csep546.pptx
Dimensionality Reduction and feature extraction.pptx
Massive Matrix Factorization : Applications to collaborative filtering
Dictionary Learning for Massive Matrix Factorization

More from Universidade de São Paulo (11)

PDF
A gentle introduction to Deep Learning
PPT
Computação: carreira e mercado de trabalho
PDF
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
PPT
Metric s plat - a platform for quick development testing and visualization of...
PPT
Hierarchical visual filtering pragmatic and epistemic actions for database vi...
PDF
Java generics-basics
PDF
Java collections-basic
PDF
Java network-sockets-etc
PDF
Infovis tutorial
PDF
A gentle introduction to Deep Learning
Computação: carreira e mercado de trabalho
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Metric s plat - a platform for quick development testing and visualization of...
Hierarchical visual filtering pragmatic and epistemic actions for database vi...
Java generics-basics
Java collections-basic
Java network-sockets-etc
Infovis tutorial

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to machine learning and Linear Models
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Knowledge Engineering Part 1
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Database Infoormation System (DBIS).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
Introduction to machine learning and Linear Models
ISS -ESG Data flows What is ESG and HowHow
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
annual-report-2024-2025 original latest.
Business Ppt On Nestle.pptx huunnnhhgfvu
Foundation of Data Science unit number two notes
Introduction to Knowledge Engineering Part 1

Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

  • 1. Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets Removing linear and non-linear attribute correlations Antonio Canabrava Fraideinberze Jose F Rodrigues-Jr Robson Leonardo Ferreira Cordeiro Databases and Images Group University of São Paulo São Carlos - SP - Brazil
  • 3. 3 Terabytes? Parallel processing and dimensionality reduction, for sure... … How to analyze that data?
  • 4. How to analyze that data? 4 Terabytes? , but how to remove linear and non-linear attribute correlations, besides irrelevant attributes? …
  • 5. How to analyze that data? 5 Terabytes? , and how to reduce dimensionality without human supervision and being task independent? …
  • 7. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 7
  • 8. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 8
  • 13. Fundamental Concepts Fractal Theory Embedded, Intrinsic and Fractal Correlation Dimension Fractal Correlation Dimension ≅ Intrinsic Dimension 13
  • 14. Fundamental Concepts Fractal Theory Embedded, Intrinsic and Fractal Correlation Dimension Embedded dimension ≅ 3 Intrinsic dimension ≅ 1 Embedded dimension ≅ 3 Intrinsic dimension ≅ 2 14
  • 15. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 15
  • 16. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 16
  • 17. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting log(r) 17
  • 18. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting log(r) 18
  • 19. Fundamental Concepts Fractal Theory Fractal Correlation Dimension - Box Counting 19 Multidimensional Quad-tree [Traina Jr. et al, 2000]
  • 20. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 20
  • 21. Related Work Dimensionality Reduction - Taxonomy 1 Dimensionality Reduction Supervised Algorithms Unsupervised Algorithms Principal Component Analysis Singular Vector Decomposition Fractal Dimension Reduction 21
  • 22. Related Work Dimensionality Reduction - Taxonomy 2 Dimensionality Reduction Feature ExtractionFeature Selection Principal Component Analysis Singular Vector Decomposition Fractal Dimension Reduction EmbeddedFilterWrapper 22
  • 23. Related Work 23 Terabytes? Existing methods need supervision, miss non-linear correlations, cannot handle Big Data or work for classification only …
  • 24. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 24
  • 25. General Idea 25 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.
  • 26. General Idea 26 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Builds partial trees for the full dataset and for its E (E-1)-dimensional projections
  • 27. General Idea 27 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. TreeID + cell spatial position Partial count of points
  • 28. General Idea 28 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Sums partial point counts and reports log(r) and log(sum2) for each tree
  • 29. General Idea 29 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Computes D2 for the full dataset and pD2 for each of its E (E-1)-dimensional projections
  • 30. General Idea 30 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. The least relevant attribute, i.e., the one not in the projection that minimizes | D2 - pD2 |
  • 31. General Idea 31 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Spots the second least relevant attribute …
  • 32. General Idea 3 Main Issues 32 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.
  • 33. General Idea 3 Main Issues 33 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 1° Too much data to be shuffled – one data pair per cell/tree
  • 34. General Idea 3 Main Issues 34 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 2° One data pass per irrelevant attribute
  • 35. General Idea 3 Main Issues 35 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. 3° Not enough memory for mappers
  • 36. Proposed Method Curl-Remover 36 1° Issue - Too much data to be shuffled; one data pair per cell/tree; Our solution - Two-phase dimensionality reduction: a) Serial feature selection in a tiny data sample (one reducer). Used to speed-up processing only; b) All mappers project data into a fixed subspace
  • 37. 37 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance.Builds/reports N (2 or 3) tree levels of lowest resolution… Proposed Method Curl-Remover
  • 38. 38 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. … plus the points projected into the M (2 or 3) most relevant attributes of sample Proposed Method Curl-Remover
  • 39. 39 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Builds the full trees from their low resolution level cells and the projected points Proposed Method Curl-Remover
  • 40. 40 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover High resolution cells are never shuffled
  • 41. Proposed Method Curl-Remover 41 2° Issue - One data pass per irrelevant attribute; Our solution – Stores/reads the tree level of highest resolution, instead of the original data.
  • 42. 42 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Rdb = cost to read dataset; TWRtree = cost to transfer, write and read the last tree level in next reduce step; If (Rdb > TWRtree) then writes tree; Proposed Method Curl-Remover
  • 43. 43 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover
  • 44. 44 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Writes tree’s last level in HDFS Proposed Method Curl-Remover
  • 45. 45 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Reads tree’s last level from HDFS Proposed Method Curl-Remover
  • 46. 46 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Proposed Method Curl-Remover Reads dataset only twice
  • 47. Proposed Method Curl-Remover 47 3° Issue - Not enough memory for mappers; Our solution – Sorts data in mappers and reports “tree slices” whenever needed.
  • 48. 48 Removes the E - ⌈D2⌉ least relevant attributes, one at a time in ascending order of relevance. Sorts its local points and builds “tree slices” monitoring memory consumption Proposed Method Curl-Remover
  • 50. Proposed Method Curl-Remover 50 Reports “tree slices” with very little overlap
  • 51. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 51
  • 52. Evaluation Datasets Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non- linearly correlated. 5 attributes, 1.1 billion points; Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2 random attributes. 5 attributes, 1.1 billion points; Yahoo! Network Flows - communication patterns between end-users in the web. 12 attributes, 562 million points; Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points; Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5 million points; Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5 million points. 52
  • 55. Evaluation Comparison with sPCA - Classification 55
  • 56. Evaluation Comparison with sPCA - Classification 56 8% more accurate, 7.5% faster
  • 57. Evaluation Comparison with sPCA Percentage of Fractal Dimension after selection 57
  • 58. Agenda Fundamental Concepts Related Work Proposed Method Evaluation Conclusion 58
  • 59. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA; Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points; Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 59
  • 60. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points; Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 60
  • 61. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set; Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 61
  • 62. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set;  Semantics - it is a feature selection method, thus maintaining the semantics of the attributes; Generality - it suits for analytical tasks in general, and not only for classification; 62
  • 63. Conclusions  Accuracy - eliminates both linear and non-linear attribute correlations, besides irrelevant attributes; 8% better than sPCA;  Scalability – linear scalability on the data size (theoretical analysis); experiments with up to 1.1 billion points;  Unsupervised - it does not require the user to guess the number of attributes to be removed neither requires a training set;  Semantics - it is a feature selection method, thus maintaining the semantics of the attributes;  Generality - it suits for analytical tasks in general, and not only for classification; 63