SlideShare a Scribd company logo
Alex Priem (@_alex_priem_) 
Edwin de Jonge (@edwindjonge) 
Strata, 21 nov 2014, Barcelona 
Patterns and meta patterns 
in Income Tax Data
Age vs mortgage debt (men)
Who are we? 
Statistical consultants / Data scientists 
working @ R&D department of Statistics Netherlands 
Statistics Netherlands (SN): 
-Government agency 
-Produces all official statistics of The Netherlands 
3
Income statistics based on Tax data 
4
Income Tax data 
–Contains all income tax records for the Netherlands 
–Approx 17M records with 550 variables. 
–Used to produce income statistics! 
Analysis is not trivial 
–Income Tax is complex (at least in the Netherlands) 
‐stages of progressive tax 
‐Complex Tax deductions (mortgage, flex workers) 
‐Complex Tax benefits (child care, social benefits) 
5
Tax data (2) 
-550 variables (for each person in NL): 
-15 identificators/unique keys 
-Dwelling, person id, etc. 
-70 categorical 
-250 numerical variables from the income tax form 
->200 derived variables (useful for analysis) 
-E.g. expandable income, income of dwelling/household 
6
Income/tax distributions 
Income (re)distribution hot topic since Piketty 
So how are income/tax/benefits distributed? 
-Look at 1D distributions: histograms 
-Look at 2D distributions: heatmaps 
-Problem: potentially 0.5 n(n-1) > 100k heatmaps! 
-even more when categorical included 
7
Let look at Patterns… 
8
Heatmap Patterns 
–What defines a pattern in heatmap? 
‐Peak/Spike? (mode, 0D point) 
‐Stripe (1D): 
•Horizontal Line? 
•Vertical Line? 
•Band? 
•Ridge? 
‐Blob (2D) 
‐Similarity between distributions (2D) 
9
Meta pattern? 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
•E.g. Male/female, Social economic status, Works in branch of Industry 
‐different pairs of variables 
•Income x age 
•Benefits x age 
•Etc. 
So patterns that are generic over different heatmaps. 
10
Looking for patterns 
Subpopulations: 
– Generate heatmap per category e.g. Age x Gross Income per social economic status 
–Automatic cluster heatmaps on distribution simularity 
Pairs of variables: 
-Generate heatmaps for all pairs 
-Prune: remove heatmaps with low support 
1. Use image classification to cluster them 
2. Or Cluster on extracted mode/line (wip) 
You will still need to look at the result! 
11
Why Visualization?
Anscombes quartet… 
13 
DS1 x 
y 
DS2 
x 
y 
DS3 
x 
y 
DS4 
x 
y 
10 
8.04 
10 
9.14 
10 
7.46 
8 
6.58 
8 
6.95 
8 
8.14 
8 
6.77 
8 
5.76 
13 
7.58 
13 
8.74 
13 
12.74 
8 
7.71 
9 
8.81 
9 
8.77 
9 
7.11 
8 
8.84 
11 
8.33 
11 
9.26 
11 
7.81 
8 
8.47 
14 
9.96 
14 
8.1 
14 
8.84 
8 
7.04 
6 
7.24 
6 
6.13 
6 
6.08 
8 
5.25 
4 
4.26 
4 
3.1 
4 
5.39 
19 
12.5 
12 
10.84 
12 
9.13 
12 
8.15 
8 
5.56 
7 
4.82 
7 
7.26 
7 
6.42 
8 
7.91 
5 
5.68 
5 
4.74 
5 
5.73 
8 
6.89
Anscombe’s quartet 
Property 
Value 
Mean of x1, x2, x3, x4 
All equal: 9 
Variance of x1, x2, x3, x4 
All equal: 11 
Mean of y1, y2, y3, y4 
All equal: 7.50 
Variance of y1, y2, y3, y4 
All equal: 4.1 
Correlation for ds1, ds2, ds3, ds4 
All equal 0.816 
Linear regression for ds1, ds2, ds3, ds4 
All equal: y = 3.00 + 0.500x 
Looks the same, right?
Lets plot!
Machine learning 
So clustering (machine learning) different? 
16
17
Visualization helps to … 
–Test your (hidden model) assumptions! 
– To find structure in data, e.g. 
“How is my data distributed?” 
–Visually explore patterns: 
‐Are there clusters? 
‐Are there outliers? 
18
19 
Heatmap recipe
20 
1. Take two numerical variables x and y 
2. Determine range 푟푥=[min푥,max⁡(푥)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Easy as pie? 
Best practices and problems with heatmaps: 
-Resolution 
-Rescaling 
-Zooming 
-Outliers 
-Color scales 
21
22
23 
1. Take two numerical variables x and y 
2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Range: Outliers? (1D) 
24 
+5M€ 
-1M€ 
Gross Income
Range: outliers removed (1% removed) 
25 
Gross Income 
+150k€
Range: outliers… 
Does your data contain outliers? 
-If so: most pixels are empty 
-Most cases: outliers have low mass and are barely visible 
Truncate range: in x or y direction: e.g. 99% quantile 
-Interactively: allow for zoom and pan. 
26
Range: data skewed? 
27 
–Many variables are not normal distributed: 
‐Power law: 풙훼 
‐Exponential: 푒푎풙+푏 
So rescale x or y or both
28 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 풓풙 in 풏풙 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Chop: AKA “Binning” 
29
30 
Chop: resolution 
Resolution matters
31 
25 x 25
32 
50 x 50
33 
100 x 100
34 
250 x 250
35 
500 x 500
Chop: Too small / Too big 
If #bins too small: 
- patterns are hidden 
If #bins too large: 
- heatmap is noisy (signal vs noise) 
Optimal nr bins depends on data. 
(kernel based approx), but always play with bin size / resolution! 
36
Chop: integers… 
37
38 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
Count: zero counts 
Not every variable is relevant for each person! 
39
Count: exclude zero values 
40
Assign colors! 
41
42 
1. Take two numerical variables x and y 
2. Determine range rx=[minx,max⁡(x)] 
3. Chop 푟푥 in 푛푥 equal pieces 
4. Repeat for y 
5. We now have 푛푥⁡.푛푦 bins 
6. Count # records in each bin 
7. Assign colors to counts 
8. Plot matrix 
9. Enjoy!
43
Colors: scales 
–Color ‘intensity’ implies value 
–Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) 
–Different models for color response: 
‐RGB (models computer monitor) 
‐HSV 
‐HCL 
‐CIELAB (models human eye) 
–Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 
44
Colors 
–Color has two functions in heatmap: 
‐Show ‘counts’ in your data 
‐Show ‘patterns’ 
At least, use a perceptually uniform gradient 
-Libs: chroma.js, colorbrewer (R) 
…but patterns need distinct colors 
45
Color scales 
–Range of color scale depends on distribution of data. 
–Often have multiple populations/distributions in data 
–Severe spikes/stripes drown the smaller distributions: 
‐We suggest log scale 
‐Sometimes log scale is not enough 
–In practice, linear scale with low maximum cut-off works well 
–Effect is best understood in 3D (!). 
46
Peaks are best cut-off 
47
Example: Linear gradient 
48
Log-gradient 
49
Linear gradient with cut-off 
50
Perceptually uniform gradient 
51
Colors: background/missings matters 
52
Heatmaps side-by-side: gross income, men vs women 
53 
men
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
54
Heatmaps decomposed in subpopulations: 
55
Gross income by socioeconomic status 
56
Gross income, men, categorized by socioeconomic status 
57
Patterns 
–Stripes are real, not outliers: 
–Corresponds with benefits, tax breaks 
–Needs paradigm shift: data is not normally distributed (but we knew that). 
58
Meta pattern 
Meta patterns constitutes of repeating pattern in: 
‐different subpopulations 
‐different pairs of variables 
So patterns that are generic over different heatmaps. 
59
Image classification of heatmaps 
60
No Domain knowledge required? 
61
62
Salary pay structure 
63
Domain knowledge, take II 
64
Pattern removal: Effect of weighting 
65
Summary 
Heatmaps: 
–ideal tool for analyzing big datasets 
–Be aware of perceptual and data biases! 
66
Questions? 
Thank you for your attention! 
More info? 
ah.priem@cbs.nl / @_alex_priem 
e.dejonge@cbs.nl / @edwindjonge 
Heatmapping code available at 
https://github.com/alexpriem/heatmapr 
67

More Related Content

PDF
20141216 heatmaps eindhoven
PPTX
Data handling
PPTX
A reminder on multiplying and dividing by powers of 10
PPT
Hexadecimal numbers
KEY
PPT
Solving Two Step Equations
PPT
Solving Two Step Equations
PPT
Multiply And Divide Decimals By Powers Of 10
20141216 heatmaps eindhoven
Data handling
A reminder on multiplying and dividing by powers of 10
Hexadecimal numbers
Solving Two Step Equations
Solving Two Step Equations
Multiply And Divide Decimals By Powers Of 10

What's hot (7)

PPT
Multiply by 10, 100, 1000, etc...
PDF
Cours Stats 5E
PPTX
Hexadecimal (Calculations and Explanations)
PPTX
8 4 scientific notation - day 1
PDF
Stem-and-Leaf Plot and Line Plot
PDF
Teoria y problemas de tabla de frecuencias tf221 ccesa007
KEY
4.5 multiplying and dividng by powers of 10
Multiply by 10, 100, 1000, etc...
Cours Stats 5E
Hexadecimal (Calculations and Explanations)
8 4 scientific notation - day 1
Stem-and-Leaf Plot and Line Plot
Teoria y problemas de tabla de frecuencias tf221 ccesa007
4.5 multiplying and dividng by powers of 10
Ad

Viewers also liked (8)

PDF
Big Data Visualization
PDF
ffbase, statistical functions for large datasets
PDF
Managing large datasets in R – ff examples and concepts
PDF
Chunked, dplyr for large text files
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PDF
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PDF
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Big Data Visualization
ffbase, statistical functions for large datasets
Managing large datasets in R – ff examples and concepts
Chunked, dplyr for large text files
Using Hadoop to build a Data Quality Service for both real-time and batch data
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Security and Data Governance using Apache Ranger and Apache Atlas
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ad

Similar to Heatmaps best practices Strata Hadoop (20)

PPTX
Data visualization
PPTX
Datamining data visualization
PDF
Data visualization
PPTX
Data Visualization1.pptx
PDF
Data Visualization using matplotlib
PDF
The Heatmap
 - Why is Security Visualization so Hard?
PPTX
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
PDF
An Introduction to the Heatmap / Histogram Plugin
PPTX
Exploratory Data Analysis week 4
PPTX
Heat map
DOCX
Pg. 01Question Three Assignment 1Deadline Satur.docx
PPT
Chapter 2. Know Your Data.ppt
PDF
Vivarana fyp report
PPTX
visual representation with BOX PLOT,BAR PLOTS
PPTX
Data Exploration.pptx
PPT
Data mining techniques in data mining with examples
PPTX
Module2.5_Heat Map.pptx
PPT
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
PDF
12. Map Visualization .pdf
PDF
05 Scalar Visualization
Data visualization
Datamining data visualization
Data visualization
Data Visualization1.pptx
Data Visualization using matplotlib
The Heatmap
 - Why is Security Visualization so Hard?
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
An Introduction to the Heatmap / Histogram Plugin
Exploratory Data Analysis week 4
Heat map
Pg. 01Question Three Assignment 1Deadline Satur.docx
Chapter 2. Know Your Data.ppt
Vivarana fyp report
visual representation with BOX PLOT,BAR PLOTS
Data Exploration.pptx
Data mining techniques in data mining with examples
Module2.5_Heat Map.pptx
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
12. Map Visualization .pdf
05 Scalar Visualization

More from Edwin de Jonge (13)

PDF
sdcSpatial user!2019
PDF
Validatetools, resolve and simplify contradictive or data validation rules
PDF
Data error! But where?
PDF
Daff: diff, patch and merge for data.frame
PDF
Uncertainty visualisation
PDF
Docopt, beautiful command-line options for R, user2014
PPTX
Big data experiments
PPTX
StatMine
PDF
Tabplotd3, interactive inspection of large data
PPTX
Big data as a source for official statistics
PPT
Statmine, Visuele dataexploratie
PPTX
StatMine (New Technologies and Techniques for Statistics)
PPT
StatMine, visual exploration of output data
sdcSpatial user!2019
Validatetools, resolve and simplify contradictive or data validation rules
Data error! But where?
Daff: diff, patch and merge for data.frame
Uncertainty visualisation
Docopt, beautiful command-line options for R, user2014
Big data experiments
StatMine
Tabplotd3, interactive inspection of large data
Big data as a source for official statistics
Statmine, Visuele dataexploratie
StatMine (New Technologies and Techniques for Statistics)
StatMine, visual exploration of output data

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Transcultural that can help you someday.
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Leprosy and NLEP programme community medicine
PDF
Introduction to the R Programming Language
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPT
Predictive modeling basics in data cleaning process
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Data Science and Data Analysis
Qualitative Qantitative and Mixed Methods.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
IMPACT OF LANDSLIDE.....................
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
A Complete Guide to Streamlining Business Processes
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
retention in jsjsksksksnbsndjddjdnFPD.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Transcultural that can help you someday.
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
ISS -ESG Data flows What is ESG and HowHow
Optimise Shopper Experiences with a Strong Data Estate.pdf
Leprosy and NLEP programme community medicine
Introduction to the R Programming Language
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Predictive modeling basics in data cleaning process
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Heatmaps best practices Strata Hadoop

  • 1. Alex Priem (@_alex_priem_) Edwin de Jonge (@edwindjonge) Strata, 21 nov 2014, Barcelona Patterns and meta patterns in Income Tax Data
  • 2. Age vs mortgage debt (men)
  • 3. Who are we? Statistical consultants / Data scientists working @ R&D department of Statistics Netherlands Statistics Netherlands (SN): -Government agency -Produces all official statistics of The Netherlands 3
  • 4. Income statistics based on Tax data 4
  • 5. Income Tax data –Contains all income tax records for the Netherlands –Approx 17M records with 550 variables. –Used to produce income statistics! Analysis is not trivial –Income Tax is complex (at least in the Netherlands) ‐stages of progressive tax ‐Complex Tax deductions (mortgage, flex workers) ‐Complex Tax benefits (child care, social benefits) 5
  • 6. Tax data (2) -550 variables (for each person in NL): -15 identificators/unique keys -Dwelling, person id, etc. -70 categorical -250 numerical variables from the income tax form ->200 derived variables (useful for analysis) -E.g. expandable income, income of dwelling/household 6
  • 7. Income/tax distributions Income (re)distribution hot topic since Piketty So how are income/tax/benefits distributed? -Look at 1D distributions: histograms -Look at 2D distributions: heatmaps -Problem: potentially 0.5 n(n-1) > 100k heatmaps! -even more when categorical included 7
  • 8. Let look at Patterns… 8
  • 9. Heatmap Patterns –What defines a pattern in heatmap? ‐Peak/Spike? (mode, 0D point) ‐Stripe (1D): •Horizontal Line? •Vertical Line? •Band? •Ridge? ‐Blob (2D) ‐Similarity between distributions (2D) 9
  • 10. Meta pattern? Meta patterns constitutes of repeating pattern in: ‐different subpopulations •E.g. Male/female, Social economic status, Works in branch of Industry ‐different pairs of variables •Income x age •Benefits x age •Etc. So patterns that are generic over different heatmaps. 10
  • 11. Looking for patterns Subpopulations: – Generate heatmap per category e.g. Age x Gross Income per social economic status –Automatic cluster heatmaps on distribution simularity Pairs of variables: -Generate heatmaps for all pairs -Prune: remove heatmaps with low support 1. Use image classification to cluster them 2. Or Cluster on extracted mode/line (wip) You will still need to look at the result! 11
  • 13. Anscombes quartet… 13 DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89
  • 14. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  • 16. Machine learning So clustering (machine learning) different? 16
  • 17. 17
  • 18. Visualization helps to … –Test your (hidden model) assumptions! – To find structure in data, e.g. “How is my data distributed?” –Visually explore patterns: ‐Are there clusters? ‐Are there outliers? 18
  • 20. 20 1. Take two numerical variables x and y 2. Determine range 푟푥=[min푥,max⁡(푥)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 21. Easy as pie? Best practices and problems with heatmaps: -Resolution -Rescaling -Zooming -Outliers -Color scales 21
  • 22. 22
  • 23. 23 1. Take two numerical variables x and y 2. Determine range 퐫퐱=[퐦퐢퐧퐱,퐦퐚퐱⁡(퐱)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 24. Range: Outliers? (1D) 24 +5M€ -1M€ Gross Income
  • 25. Range: outliers removed (1% removed) 25 Gross Income +150k€
  • 26. Range: outliers… Does your data contain outliers? -If so: most pixels are empty -Most cases: outliers have low mass and are barely visible Truncate range: in x or y direction: e.g. 99% quantile -Interactively: allow for zoom and pan. 26
  • 27. Range: data skewed? 27 –Many variables are not normal distributed: ‐Power law: 풙훼 ‐Exponential: 푒푎풙+푏 So rescale x or y or both
  • 28. 28 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 풓풙 in 풏풙 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 30. 30 Chop: resolution Resolution matters
  • 31. 31 25 x 25
  • 32. 32 50 x 50
  • 33. 33 100 x 100
  • 34. 34 250 x 250
  • 35. 35 500 x 500
  • 36. Chop: Too small / Too big If #bins too small: - patterns are hidden If #bins too large: - heatmap is noisy (signal vs noise) Optimal nr bins depends on data. (kernel based approx), but always play with bin size / resolution! 36
  • 38. 38 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 39. Count: zero counts Not every variable is relevant for each person! 39
  • 40. Count: exclude zero values 40
  • 42. 42 1. Take two numerical variables x and y 2. Determine range rx=[minx,max⁡(x)] 3. Chop 푟푥 in 푛푥 equal pieces 4. Repeat for y 5. We now have 푛푥⁡.푛푦 bins 6. Count # records in each bin 7. Assign colors to counts 8. Plot matrix 9. Enjoy!
  • 43. 43
  • 44. Colors: scales –Color ‘intensity’ implies value –Percieved response depends on ‘color’ and ‘color lightness’ (compare #00ff00 with #0000ff) –Different models for color response: ‐RGB (models computer monitor) ‐HSV ‐HCL ‐CIELAB (models human eye) –Gradient generator: http://davidjohnstone.net/pages/lch-lab-colour-gradient-picker 44
  • 45. Colors –Color has two functions in heatmap: ‐Show ‘counts’ in your data ‐Show ‘patterns’ At least, use a perceptually uniform gradient -Libs: chroma.js, colorbrewer (R) …but patterns need distinct colors 45
  • 46. Color scales –Range of color scale depends on distribution of data. –Often have multiple populations/distributions in data –Severe spikes/stripes drown the smaller distributions: ‐We suggest log scale ‐Sometimes log scale is not enough –In practice, linear scale with low maximum cut-off works well –Effect is best understood in 3D (!). 46
  • 47. Peaks are best cut-off 47
  • 50. Linear gradient with cut-off 50
  • 53. Heatmaps side-by-side: gross income, men vs women 53 men
  • 54. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 54
  • 55. Heatmaps decomposed in subpopulations: 55
  • 56. Gross income by socioeconomic status 56
  • 57. Gross income, men, categorized by socioeconomic status 57
  • 58. Patterns –Stripes are real, not outliers: –Corresponds with benefits, tax breaks –Needs paradigm shift: data is not normally distributed (but we knew that). 58
  • 59. Meta pattern Meta patterns constitutes of repeating pattern in: ‐different subpopulations ‐different pairs of variables So patterns that are generic over different heatmaps. 59
  • 60. Image classification of heatmaps 60
  • 61. No Domain knowledge required? 61
  • 62. 62
  • 65. Pattern removal: Effect of weighting 65
  • 66. Summary Heatmaps: –ideal tool for analyzing big datasets –Be aware of perceptual and data biases! 66
  • 67. Questions? Thank you for your attention! More info? ah.priem@cbs.nl / @_alex_priem e.dejonge@cbs.nl / @edwindjonge Heatmapping code available at https://github.com/alexpriem/heatmapr 67