Heatmaps best practices Strata Hadoop

Alex Priem (@_alex_priem_)
Edwin de Jonge (@edwindjonge)
Strata, 21 nov 2014, Barcelona
Patterns and meta patterns
in Income Tax Data

Who are we?
Statistical consultants / Data scientists
working @ R&D department of Statistics Netherlands
Statistics Netherlands (SN):
-Government agency
-Produces all official statistics of The Netherlands
3

Income statistics based on Tax data
4

Income Tax data
–Contains all income tax records for the Netherlands
–Approx 17M records with 550 variables.
–Used to produce income statistics!
Analysis is not trivial
–Income Tax is complex (at least in the Netherlands)
‐stages of progressive tax
‐Complex Tax deductions (mortgage, flex workers)
‐Complex Tax benefits (child care, social benefits)
5

Tax data (2)
-550 variables (for each person in NL):
-15 identificators/unique keys
-Dwelling, person id, etc.
-70 categorical
-250 numerical variables from the income tax form
->200 derived variables (useful for analysis)
-E.g. expandable income, income of dwelling/household
6

Income/tax distributions
Income (re)distribution hot topic since Piketty
So how are income/tax/benefits distributed?
-Look at 1D distributions: histograms
-Look at 2D distributions: heatmaps
-Problem: potentially 0.5 n(n-1) > 100k heatmaps!
-even more when categorical included
7

Heatmap Patterns
–What defines a pattern in heatmap?
‐Peak/Spike? (mode, 0D point)
‐Stripe (1D):
•Horizontal Line?
•Vertical Line?
•Band?
•Ridge?
‐Blob (2D)
‐Similarity between distributions (2D)
9

Meta pattern?
Meta patterns constitutes of repeating pattern in:
‐different subpopulations
•E.g. Male/female, Social economic status, Works in branch of Industry
‐different pairs of variables
•Income x age
•Benefits x age
•Etc.
So patterns that are generic over different heatmaps.
10

Looking for patterns
Subpopulations:
– Generate heatmap per category e.g. Age x Gross Income per social economic status
–Automatic cluster heatmaps on distribution simularity
Pairs of variables:
-Generate heatmaps for all pairs
-Prune: remove heatmaps with low support
1. Use image classification to cluster them
2. Or Cluster on extracted mode/line (wip)
You will still need to look at the result!
11

Anscombes quartet…
13
DS1 x
y
DS2
x
y
DS3
x
y
DS4
x
y
10
8.04
10
9.14
10
7.46
8
6.58
8
6.95
8
8.14
8
6.77
8
5.76
13
7.58
13
8.74
13
12.74
8
7.71
9
8.81
9
8.77
9
7.11
8
8.84
11
8.33
11
9.26
11
7.81
8
8.47
14
9.96
14
8.1
14
8.84
8
7.04
6
7.24
6
6.13
6
6.08
8
5.25
4
4.26
4
3.1
4
5.39
19
12.5
12
10.84
12
9.13
12
8.15
8
5.56
7
4.82
7
7.26
7
6.42
8
7.91
5
5.68
5
4.74
5
5.73
8
6.89

Anscombe’s quartet
Property
Value
Mean of x1, x2, x3, x4
All equal: 9
Variance of x1, x2, x3, x4
All equal: 11
Mean of y1, y2, y3, y4
All equal: 7.50
Variance of y1, y2, y3, y4
All equal: 4.1
Correlation for ds1, ds2, ds3, ds4
All equal 0.816
Linear regression for ds1, ds2, ds3, ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?

Machine learning
So clustering (machine learning) different?
16

Visualization helps to …
–Test your (hidden model) assumptions!
– To find structure in data, e.g.
“How is my data distributed?”
–Visually explore patterns:
‐Are there clusters?
‐Are there outliers?
18

20
1. Take two numerical variables x and y
2. Determine range 푟푥=[min푥,max⁡(푥)]
3. Chop 푟푥 in 푛푥 equal pieces
4. Repeat for y
5. We now have 푛푥⁡.푛푦 bins
6. Count # records in each bin
7. Assign colors to counts
8. Plot matrix
9. Enjoy!

Easy as pie?
Best practices and problems with heatmaps:
-Resolution
-Rescaling
-Zooming
-Outliers
-Color scales
21

23
2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)]
4. Repeat for y
8. Plot matrix
9. Enjoy!

Range: Outliers? (1D)
24
+5M€
-1M€
Gross Income

Range: outliers removed (1% removed)
25
Gross Income
+150k€

Range: outliers…
Does your data contain outliers?
-If so: most pixels are empty
-Most cases: outliers have low mass and are barely visible
Truncate range: in x or y direction: e.g. 99% quantile
-Interactively: allow for zoom and pan.
26

Range: data skewed?
27
–Many variables are not normal distributed:
‐Power law: 풙훼
‐Exponential: 푒푎풙+푏
So rescale x or y or both

28
2. Determine range rx=[minx,max⁡(x)]
3. Chop 풓풙 in 풏풙 equal pieces
4. Repeat for y
8. Plot matrix
9. Enjoy!

30
Chop: resolution
Resolution matters

Chop: Too small / Too big
If #bins too small:
- patterns are hidden
If #bins too large:
- heatmap is noisy (signal vs noise)
Optimal nr bins depends on data.
(kernel based approx), but always play with bin size / resolution!
36

38
4. Repeat for y
8. Plot matrix
9. Enjoy!

Count: zero counts
Not every variable is relevant for each person!
39

Count: exclude zero values
40

42
4. Repeat for y
8. Plot matrix
9. Enjoy!

Colors: scales
–Color ‘intensity’ implies value
–Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff)
–Different models for color response:
‐RGB (models computer monitor)
‐HSV
‐HCL
‐CIELAB (models human eye)
–Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker
44

Colors
–Color has two functions in heatmap:
‐Show ‘counts’ in your data
‐Show ‘patterns’
At least, use a perceptually uniform gradient
-Libs: chroma.js, colorbrewer (R)
…but patterns need distinct colors
45

Color scales
–Range of color scale depends on distribution of data.
–Often have multiple populations/distributions in data
–Severe spikes/stripes drown the smaller distributions:
‐We suggest log scale
‐Sometimes log scale is not enough
–In practice, linear scale with low maximum cut-off works well
–Effect is best understood in 3D (!).
46

Linear gradient with cut-off
50

Perceptually uniform gradient
51

Colors: background/missings matters
52

Heatmaps side-by-side: gross income, men vs women
53
men

Meta pattern
54

Heatmaps decomposed in subpopulations:
55

Gross income by socioeconomic status
56

Gross income, men, categorized by socioeconomic status
57

Patterns
–Stripes are real, not outliers:
–Corresponds with benefits, tax breaks
–Needs paradigm shift: data is not normally distributed (but we knew that).
58

Meta pattern
59

Image classification of heatmaps
60

No Domain knowledge required?
61

Pattern removal: Effect of weighting
65

Summary
Heatmaps:
–ideal tool for analyzing big datasets
–Be aware of perceptual and data biases!
66

Questions?
Thank you for your attention!
More info?
ah.priem@cbs.nl / @_alex_priem
e.dejonge@cbs.nl / @edwindjonge
Heatmapping code available at
https://github.com/alexpriem/heatmapr
67

Heatmaps best practices Strata Hadoop

More Related Content

What's hot (7)

Viewers also liked (8)

Similar to Heatmaps best practices Strata Hadoop (20)

More from Edwin de Jonge (13)

Recently uploaded (20)

Heatmaps best practices Strata Hadoop