Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

Effective and
Unsupervised
Fractal-based
Feature Selection
for Very Large
Datasets
Removing linear and non-linear attribute correlations
Antonio Canabrava Fraideinberze
Jose F Rodrigues-Jr
Robson Leonardo Ferreira Cordeiro
Databases and Images Group
University of São Paulo
São Carlos - SP - Brazil

2
Terabytes?
…
How to analyze that data?

3
Terabytes?
Parallel processing
and dimensionality
reduction, for
sure...
…

4
Terabytes?
, but how to remove
linear and non-linear
attribute correlations,
besides irrelevant
attributes?
…

5
Terabytes?
, and how to reduce
dimensionality without
human supervision
and being task
independent?
…

6
Terabytes?
Curl-Remover
Medium-
dimensionality
…

Agenda
Fundamental Concepts
Related Work
Proposed Method
Evaluation
Conclusion
7

Agenda
Related Work
Proposed Method
Evaluation
Conclusion
8

Fractal Theory
...
...
...
...
9

Fractal Theory
...
...
...
...
10

Fractal Theory
11

Fractal Theory
12

Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Fractal Correlation Dimension ≅ Intrinsic Dimension
13

Fractal Theory
Embedded, Intrinsic and Fractal Correlation Dimension
Embedded dimension ≅ 3
Intrinsic dimension ≅ 1
Embedded dimension ≅ 3
Intrinsic dimension ≅ 2
14

Fractal Theory
Fractal Correlation Dimension - Box Counting
15

Fractal Theory
16

Fractal Theory
log(r)
17

Fractal Theory
log(r)
18

Fractal Theory
19
Multidimensional
Quad-tree
[Traina Jr. et al, 2000]

Agenda
Related Work
Proposed Method
Evaluation
Conclusion
20

Related Work
Dimensionality Reduction - Taxonomy 1
Dimensionality
Reduction
Supervised Algorithms
Unsupervised
Algorithms
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
21

Related Work
Dimensionality Reduction - Taxonomy 2
Dimensionality
Reduction
Feature ExtractionFeature Selection
Principal Component
Analysis
Singular Vector
Decomposition
Fractal Dimension
Reduction
EmbeddedFilterWrapper
22

Related Work
23
Terabytes?
Existing methods need supervision,
miss non-linear correlations, cannot
handle Big Data or work for
classification only
…

Agenda
Related Work
Proposed Method
Evaluation
Conclusion
24

General Idea
25
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.

General Idea
26
Builds partial trees
for the full dataset
and for its E
(E-1)-dimensional
projections

General Idea
27
TreeID
+
cell
spatial
position
Partial
count of
points

General Idea
28
Sums partial point
counts and reports
log(r) and log(sum2)
for each tree

General Idea
29
Computes D2 for
the full dataset and
pD2 for each of its E
(E-1)-dimensional
projections

General Idea
30
The least relevant
attribute, i.e., the one
not in the projection
that minimizes
| D2 - pD2 |

General Idea
31
Spots the second
least relevant
attribute …

General Idea
3 Main Issues
32

General Idea
3 Main Issues
33
1° Too much data to
be shuffled – one
data pair per cell/tree

General Idea
3 Main Issues
34
2° One
data pass
per
irrelevant
attribute

General Idea
3 Main Issues
35
3° Not enough
memory for mappers

Proposed Method
Curl-Remover
36
1° Issue - Too much data to be shuffled; one data pair per
cell/tree;
Our solution - Two-phase dimensionality reduction:
a) Serial feature selection in a tiny data sample (one reducer). Used to
speed-up processing only;
b) All mappers project data into a fixed subspace

37
in ascending order of relevance.Builds/reports N (2 or
3) tree levels of
lowest resolution…
Proposed Method
Curl-Remover

38
… plus the points
projected into the M (2
or 3) most relevant
attributes of sample
Proposed Method
Curl-Remover

39
Builds the full trees from
their low resolution level
cells and the projected
points
Proposed Method
Curl-Remover

40
Proposed Method
Curl-Remover
High resolution cells
are never shuffled

Proposed Method
Curl-Remover
41
2° Issue - One data pass per irrelevant attribute;
Our solution – Stores/reads the tree level of highest
resolution, instead of the original data.

42
Rdb = cost to read dataset;
TWRtree = cost to transfer,
write and read the last tree
level in next reduce step;
If (Rdb > TWRtree)
then writes tree;
Proposed Method
Curl-Remover

43
Proposed Method
Curl-Remover

44
Writes tree’s last level in
HDFS
Proposed Method
Curl-Remover

45
Reads tree’s last level
from HDFS
Proposed Method
Curl-Remover

46
Proposed Method
Curl-Remover
Reads dataset
only twice

Proposed Method
Curl-Remover
47
3° Issue - Not enough memory for mappers;
Our solution – Sorts data in mappers and reports “tree slices”
whenever needed.

48
Sorts its local points and
builds “tree slices”
monitoring memory
consumption
Proposed Method
Curl-Remover

Proposed Method
Curl-Remover
49
Y
X

Proposed Method
Curl-Remover
50
Reports “tree slices”
with very little overlap

Agenda
Related Work
Proposed Method
Evaluation
Conclusion
51

Evaluation
Datasets
Sierpinski - Sierpinski Triangle + 1 attribute linearly correlated + 2 attributes non-
linearly correlated. 5 attributes, 1.1 billion points;
Sierpinski Hybrid - Sierpinski Triangle + 1 attribute non-linearly correlated + 2
random attributes. 5 attributes, 1.1 billion points;
Yahoo! Network Flows - communication patterns between end-users in the web. 12
attributes, 562 million points;
Astro - high-resolution cosmological simulation. 6 attributes, 1 billion points;
Hepmass - physics-related dataset with particles of unknown mass. 28 attributes, 10.5
million points;
Hepmass Duplicated – Hepmass + 28 correlated attributes. 56 attributes, 10.5
million points.
52

Evaluation
Fractal Dimension
Hepmass
53

Evaluation
Fractal Dimension
Hepmass Duplicated
54

Evaluation
Comparison with sPCA - Classification
55

Evaluation
Comparison with sPCA - Classification
56
8% more accurate,
7.5% faster

Evaluation
Comparison with sPCA
Percentage of Fractal Dimension after selection
57

Agenda
Related Work
Proposed Method
Evaluation
Conclusion
58

Conclusions
 Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
59

Conclusions
 Scalability – linear scalability on the data size (theoretical analysis);
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
the attributes;
classification;
60

Conclusions
 Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
the attributes;
classification;
61

Conclusions
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
classification;
62

Conclusions
 Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
 Generality - it suits for analytical tasks in general, and not only for
classification;
63

Questions?
robson@icmc.usp.br
Hepmass Duplicated

Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations

More Related Content

What's hot (18)

Viewers also liked (19)

Similar to Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations (20)

More from Universidade de São Paulo (11)

Recently uploaded (20)

Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations