1
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
A TOOL AGNOSTIC APPROACH
2
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE A DATASET
3
Each row has details about an employee who has left the organization.
Just “reading” the dataset is quite informative.
DESCRIBE THE DATA IN A STRUCTURED WAY
4
5
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
CATEGORICAL COLUMNS YIELD VERY LITTLE DATA
6
There’s not much information in one column.
The values are not quantitative,
so a distribution is not meaningful.
The values are not even ordered.
In fact, the only thing we have is the list of values
and their count.
... or is there more to this?
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
... BUT RANK FREQUENCY IS STILL POSSIBLE
7
The rank of the row provides additional
information.
With this, we can explore the distribution
of the rank against the count.
These distributions are called rank-
frequency distributions.
Rank Region Count
1 India 10780
2 Headstrong 1554
3 China 1130
4 Philippines 1030
5 US 792
6 Romania 788
7 Mexico 324
8 Guatemala 233
9 Poland 124
10 Brazil 45
11 Hungary 41
12 Colombia 38
13 Netherlands 33
14 South Africa 30
15 UK 18
16 UAE 15
17 GMS India 15
18 Japan 11
19 CZECH Republic 10
20 Kenya 9
REGION SHOWS A POWER LAW DISTRIBUTION
8
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
Rank on a log scale
Frequencyonalogscale
COST CODE SHOWS A POWER LAW DISTRIBUTION
9
Cost Code Count
105 9542
121 1757
125 875
122 796
3001 654
3310 635
124 435
131 415
115 336
nan 207
101 205
127 173
109 148
116 91
126 66
...
LE SHOWS A POWER LAW DISTRIBUTION
10
LE Count
D84 11487
GPL 853
RM1 789
LC2 565
GMR 323
D95 247
GUT 233
ML1 223
CTK 184
AXE 127
A38 98
A21 79
EMP 61
BRL 45
A66 43
...
11
WHAT CAUSES
POWER LAW DISTRIBUTIONS?
PREFERENTIAL
ATTACHMENT
EXPONENTIAL
GROWTH
NO. OF FOLLOWERS ON GITHUB
12
Username Count
slidenerd 1700
astaxie 1320
MugunthKumar 1081
honcheng 870
arunoda 827
csjaba 670
cheeaun 658
timoxley 600
karlseguin 600
hemanth 514
arvindr21 400
yuvipanda 335
mbrochh 330
anandology 330
sayanee 314
zz85 314
sanand0 309
captn3m0 300
sameersbn 300
...
NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE
13
Person Count
Lata Mangeshkar 824
Asha Bhosle 810
Shakti Kapoor 589
Kishore Kumar 585
Mohammed Rafi 527
Sunidhi Chauhan 515
Alka Yagnik 451
Udit Narayan 435
Kader Khan 430
Sonu Nigam 405
Sameer 398
Asrani 397
Helen 395
Shaan 377
Aruna Irani 375
Anupam Kher 367
Shreya Ghoshal 357
Gulshan Grover 341
...
PARTIES IN PARLIAMENT ELECTIONS
14
Name Count
IND 44704
INC 7213
BJP 3354
BSP 2628
SP 1311
CPI 1102
JD 943
CPM 914
DDP 716
JNP 676
BJS 657
JP 563
NOTA 543
PSP 538
INC(I) 492
SHS 467
AAP 432
SWA 410
...
CANDIDATE NAMES IN ASSEMBLY ELECTIONS
15
Name Count
NONE OF THE ABOVE 629
OM PRAKASH 478
ASHOK KUMAR 411
RAM SINGH 362
RAJ KUMAR 294
ANIL KUMAR 271
AMAR SINGH 248
MOHAN LAL 235
RAM KUMAR 224
BABU LAL 218
RAM PRASAD 213
JAGDISH 210
VIJAY KUMAR 207
RAJENDRA SINGH 196
VINOD KUMAR 195
SHYAM LAL 193
RAJESH KUMAR 186
SITA RAM 186
RAM LAL 171
...
STUDENT NAMES IN SSA SURVEY
16
Name Count
M.MANIKANDAN 99
S.PAVITHRA 84
S.MANIKANDAN 84
R.RAMYA 82
S.SANGEETHA 70
R.MANIKANDAN 69
S.DIVYA 68
M.PAVITHRA 68
S.SANTHIYA 67
S.VIGNESH 67
M.PRIYA 67
M.MAHALAKSHMI 64
S.SARANYA 63
S.SURYA 60
K.MANIKANDAN 60
P.PAVITHRA 56
S.GAYATHRI 56
P.MANIKANDAN 55
...
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
NOT EVERYTHING IS POWER-LAW, THOUGH
18
Need to understand what drives these distributions from their behaviours
ORDERED CATEGORICALS HAVE MORE INFORMATION
19
CORPORATE BAND
20
LE Count
5 12247
4 4449
3 205
2 63
Not Mapped 24
1 22
SVP 10
LOCAL BAND
21
LE Count
5A 7483
5B 4764
4A 1683
4B 1612
4C 747
4D 407
3 205
2 63
Not Mapped 24
1 22
SVP 10
QUANTITIES HAVE EVEN MORE INFORMATION
22
AGE DISTRIBUTION IS LOG-NORMAL
23
DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
24
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion
of some form with the
customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly.
Here are such customers’
meter readings.
Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of
fraud” as the percentage
excess of the 100 unit
meter reading, the
value varies
considerably
across sections,
and time
New section
manager arrives
… and is
transferred out
… with some
explainable
anomalies.
Why would
these happen?
25
PREDICTING MARKS
“
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction
matter?
Does community or religion
matter?
Does their birthday matter?
Does the first letter of their name
matter?
EDUCATION
26
TN CLASS X: ENGLISH
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
TN CLASS X: LANGUAGE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
TN CLASS X: SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
ICSE 2013 CLASS XII: TOTAL MARKS
32
CBSE 2013 CLASS XII: ENGLISH MARKS
33
CBSE 2013 CLASS XII: PHYSICS MARKS
34
35
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE ONE DAY CRICKET DATA
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
36
Against which countries are
higher averages scored?
Which countries’ players
score more per match?
37
Which player scores the
most per ball?
The player with the highest strike
rate is an obscure South African
whose name most of us have never
heard of.
In fact, this list is filled with players
we have never heard of.
38
Most analysis answers the question
“Which is are the top 10 X”?
Which are my top products?
Which are my top branches?
Who are my best sales people?
Which vendors have the highest cost per unit?
Which divisions are spending the most money?
In which hours does the under 12 segment watch TV most?
Which customer segment has the highest revenue per user?
39
THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
Take every column in the data
Find the top value by that column
Country South Africa has the highest strike rate of 76%
Player Johann Louw has the highest strike rate of 329%
Runs 164 runs has the highest strike rate of 156%
MatchDate 12-03-2006 has the highest strike rate of 136%
Ground AC-VDCA Stadium has the highest strike rate of 98%
Versus United States has the highest strike rate of 104%
40
What do the children in schools know and can do at
different stages of elementary education?
Have the inputs made into the elementary education
system had a beneficial effect or not?
41
HAVING BOOKS IMPROVES READING ABILITY
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
Number of students sampled
What is the impact? How many more marks
can having more books fetch?
Circle size indicates number of students with
this response. Few students have no books.
Is this response (“25+ books”) good or bad?
Small red bars indicate low marks. Large
green bars indicate high marks. Students
having 25+ books tend to score high marks.
The most common response is marked in
blue. This is also the circle.
The graphic is summarized in words
Indicates whether the best response is the
most popular. Blue means that it is not.
Green means that it is. Red means that the
worst level is the most popular response.
42
CHILDREN LIKE GAMES, AND THEY’RE GOOD
… but playing daily hurts reading ability
43
WATCHING TV OCCASIONALLY IS GOOD
Children who watch TV
every day don’t do as well
as children who watch TV
only once a week.
But children who never
watch TV fare the worst.
Watching TV every day
helps improve children’s
reading ability a little bit
more…
… but mathematical
abilities fall dramatically at
that point
44
WE HAVE A WEBSITE THAT YOU CAN EXPLORE
GRAMENER.COM/NAS
45
46
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS

More Related Content

PPT
Classification
PPTX
Pictures through Numbers, OpenDataCamp 2012 Bangalore
PDF
Companies financial result updated on 05 april 2016
PDF
Scipy, numpy and friends
PPTX
My Life with Data
PDF
Entering the Data Analytics industry
PPTX
Building Digital Capability of the Company
Classification
Pictures through Numbers, OpenDataCamp 2012 Bangalore
Companies financial result updated on 05 april 2016
Scipy, numpy and friends
My Life with Data
Entering the Data Analytics industry
Building Digital Capability of the Company

Similar to Automating Data Exploration SciPy 2016 (20)

PPTX
Making Big Data relevant: Importance of Data Visualization and Analytics
PPTX
Econ stat1
PPTX
HYDSPIN Dec14 visual story telling
PPTX
Storytelling through data
PPT
Econometrics Project
PDF
healthcare healthcare statistics.pdf
PDF
MLSEV Virtual. Anomaly Detection Examples
PPTX
Editors Lab Delhi
PPT
histgram[1].ppt
PDF
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
PDF
AP Statistics - Confidence Intervals with Means - One Sample
DOCX
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
PPTX
Automating Analysis and Visualizing Machine Learning
PPTX
Database Marketing - Dominick's stores in Chicago distric
PPTX
lesson 3 presentation of data and frequency distribution
PPTX
Statistical quality control
PPT
4 5b Histograms
PDF
Derivative daily report
PPTX
Active portfolio Management and Construction - With an investment Strategy.....
Making Big Data relevant: Importance of Data Visualization and Analytics
Econ stat1
HYDSPIN Dec14 visual story telling
Storytelling through data
Econometrics Project
healthcare healthcare statistics.pdf
MLSEV Virtual. Anomaly Detection Examples
Editors Lab Delhi
histgram[1].ppt
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
AP Statistics - Confidence Intervals with Means - One Sample
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
Automating Analysis and Visualizing Machine Learning
Database Marketing - Dominick's stores in Chicago distric
lesson 3 presentation of data and frequency distribution
Statistical quality control
4 5b Histograms
Derivative daily report
Active portfolio Management and Construction - With an investment Strategy.....
Ad

More from Gramener (20)

PPTX
6 Methods to Improve Your Manufacturing Process with Computer Vision
PDF
Detecting Manufacturing Defects with Computer Vision
PDF
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
PDF
Automated Barcode Generation System in Manufacturing
PDF
The Role of Technology to Save Biodiversity
PPTX
Enable Storytelling with Power BI & Comicgen Plugin
PDF
The Most Effective Method For Selecting Data Science Projects
PPTX
Low Code Platform To Build Data & AI Products
PPTX
5 Key Foundations To Build An Effective CX Program
PPTX
Using Power BI To Improve Media Buying & Ad Performance
PPSX
Recession Proofing With Data : Webinar
PPTX
Engage Your Audience With PowerPoint Decks: Webinar
PPTX
Structure Your Data Science Teams For Best Outcomes
PPTX
Dawn Of Geospatial AI - Webinar
PPTX
5 Steps To Become A Data-Driven Organization : Webinar
PPTX
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
PPTX
Saving Lives with Geospatial AI - Pycon Indonesia 2020
PPTX
Driving Transformation in Industries with Artificial Intelligence (AI)
PPTX
The Art of Storytelling Using Data Science
PPTX
Storyfying your Data: How to go from Data to Insights to Stories
6 Methods to Improve Your Manufacturing Process with Computer Vision
Detecting Manufacturing Defects with Computer Vision
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
Automated Barcode Generation System in Manufacturing
The Role of Technology to Save Biodiversity
Enable Storytelling with Power BI & Comicgen Plugin
The Most Effective Method For Selecting Data Science Projects
Low Code Platform To Build Data & AI Products
5 Key Foundations To Build An Effective CX Program
Using Power BI To Improve Media Buying & Ad Performance
Recession Proofing With Data : Webinar
Engage Your Audience With PowerPoint Decks: Webinar
Structure Your Data Science Teams For Best Outcomes
Dawn Of Geospatial AI - Webinar
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Driving Transformation in Industries with Artificial Intelligence (AI)
The Art of Storytelling Using Data Science
Storyfying your Data: How to go from Data to Insights to Stories
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
Predictive modeling basics in data cleaning process
PPTX
chrmotography.pptx food anaylysis techni
PDF
Microsoft 365 products and services descrption
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
Managing Community Partner Relationships
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Introduction to Inferential Statistics.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
[EN] Industrial Machine Downtime Prediction
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Predictive modeling basics in data cleaning process
chrmotography.pptx food anaylysis techni
Microsoft 365 products and services descrption
SAP 2 completion done . PRESENTATION.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Managing Community Partner Relationships
IMPACT OF LANDSLIDE.....................
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Introduction to Inferential Statistics.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
DU, AIS, Big Data and Data Analytics.ppt
retention in jsjsksksksnbsndjddjdnFPD.pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Microsoft Core Cloud Services powerpoint
Topic 5 Presentation 5 Lesson 5 Corporate Fin
[EN] Industrial Machine Downtime Prediction

Automating Data Exploration SciPy 2016

  • 1. 1 AUTOMATING DATA EXPLORATION A structured approach to analysing data A TOOL AGNOSTIC APPROACH
  • 2. 2 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 3. LET’S TAKE A DATASET 3 Each row has details about an employee who has left the organization. Just “reading” the dataset is quite informative.
  • 4. DESCRIBE THE DATA IN A STRUCTURED WAY 4
  • 5. 5 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 6. CATEGORICAL COLUMNS YIELD VERY LITTLE DATA 6 There’s not much information in one column. The values are not quantitative, so a distribution is not meaningful. The values are not even ordered. In fact, the only thing we have is the list of values and their count. ... or is there more to this? Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9
  • 7. ... BUT RANK FREQUENCY IS STILL POSSIBLE 7 The rank of the row provides additional information. With this, we can explore the distribution of the rank against the count. These distributions are called rank- frequency distributions. Rank Region Count 1 India 10780 2 Headstrong 1554 3 China 1130 4 Philippines 1030 5 US 792 6 Romania 788 7 Mexico 324 8 Guatemala 233 9 Poland 124 10 Brazil 45 11 Hungary 41 12 Colombia 38 13 Netherlands 33 14 South Africa 30 15 UK 18 16 UAE 15 17 GMS India 15 18 Japan 11 19 CZECH Republic 10 20 Kenya 9
  • 8. REGION SHOWS A POWER LAW DISTRIBUTION 8 Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9 Rank on a log scale Frequencyonalogscale
  • 9. COST CODE SHOWS A POWER LAW DISTRIBUTION 9 Cost Code Count 105 9542 121 1757 125 875 122 796 3001 654 3310 635 124 435 131 415 115 336 nan 207 101 205 127 173 109 148 116 91 126 66 ...
  • 10. LE SHOWS A POWER LAW DISTRIBUTION 10 LE Count D84 11487 GPL 853 RM1 789 LC2 565 GMR 323 D95 247 GUT 233 ML1 223 CTK 184 AXE 127 A38 98 A21 79 EMP 61 BRL 45 A66 43 ...
  • 11. 11 WHAT CAUSES POWER LAW DISTRIBUTIONS? PREFERENTIAL ATTACHMENT EXPONENTIAL GROWTH
  • 12. NO. OF FOLLOWERS ON GITHUB 12 Username Count slidenerd 1700 astaxie 1320 MugunthKumar 1081 honcheng 870 arunoda 827 csjaba 670 cheeaun 658 timoxley 600 karlseguin 600 hemanth 514 arvindr21 400 yuvipanda 335 mbrochh 330 anandology 330 sayanee 314 zz85 314 sanand0 309 captn3m0 300 sameersbn 300 ...
  • 13. NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE 13 Person Count Lata Mangeshkar 824 Asha Bhosle 810 Shakti Kapoor 589 Kishore Kumar 585 Mohammed Rafi 527 Sunidhi Chauhan 515 Alka Yagnik 451 Udit Narayan 435 Kader Khan 430 Sonu Nigam 405 Sameer 398 Asrani 397 Helen 395 Shaan 377 Aruna Irani 375 Anupam Kher 367 Shreya Ghoshal 357 Gulshan Grover 341 ...
  • 14. PARTIES IN PARLIAMENT ELECTIONS 14 Name Count IND 44704 INC 7213 BJP 3354 BSP 2628 SP 1311 CPI 1102 JD 943 CPM 914 DDP 716 JNP 676 BJS 657 JP 563 NOTA 543 PSP 538 INC(I) 492 SHS 467 AAP 432 SWA 410 ...
  • 15. CANDIDATE NAMES IN ASSEMBLY ELECTIONS 15 Name Count NONE OF THE ABOVE 629 OM PRAKASH 478 ASHOK KUMAR 411 RAM SINGH 362 RAJ KUMAR 294 ANIL KUMAR 271 AMAR SINGH 248 MOHAN LAL 235 RAM KUMAR 224 BABU LAL 218 RAM PRASAD 213 JAGDISH 210 VIJAY KUMAR 207 RAJENDRA SINGH 196 VINOD KUMAR 195 SHYAM LAL 193 RAJESH KUMAR 186 SITA RAM 186 RAM LAL 171 ...
  • 16. STUDENT NAMES IN SSA SURVEY 16 Name Count M.MANIKANDAN 99 S.PAVITHRA 84 S.MANIKANDAN 84 R.RAMYA 82 S.SANGEETHA 70 R.MANIKANDAN 69 S.DIVYA 68 M.PAVITHRA 68 S.SANTHIYA 67 S.VIGNESH 67 M.PRIYA 67 M.MAHALAKSHMI 64 S.SARANYA 63 S.SURYA 60 K.MANIKANDAN 60 P.PAVITHRA 56 S.GAYATHRI 56 P.MANIKANDAN 55 ...
  • 18. NOT EVERYTHING IS POWER-LAW, THOUGH 18 Need to understand what drives these distributions from their behaviours
  • 19. ORDERED CATEGORICALS HAVE MORE INFORMATION 19
  • 20. CORPORATE BAND 20 LE Count 5 12247 4 4449 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 21. LOCAL BAND 21 LE Count 5A 7483 5B 4764 4A 1683 4B 1612 4C 747 4D 407 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 22. QUANTITIES HAVE EVEN MORE INFORMATION 22
  • 23. AGE DISTRIBUTION IS LOG-NORMAL 23
  • 24. DETECTING FRAUD “ We know meter readings are incorrect, for various reasons. We don’t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns. ENERGY UTILITY 24
  • 25. This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the tariff slab boundaries. This clearly shows collusion of some form with the customers. Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 217 219 200 200 200 200 200 200 200 350 200 200 250 200 200 200 201 200 200 200 250 200 200 150 250 150 150 200 200 200 200 200 200 200 200 150 150 200 200 200 200 200 200 200 200 200 200 50 200 200 200 150 180 150 50 100 50 70 100 100 100 100 100 100 100 100 100 100 100 100 110 100 100 150 123 123 50 100 50 100 100 100 100 100 0 111 100 100 100 100 100 100 100 100 50 50 0 100 27 100 50 100 100 100 100 100 70 100 1 1 1 100 99 50 100 100 100 100 100 100 This happens with specific customers, not randomly. Here are such customers’ meter readings. Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109% Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54% Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34% Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14% Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15% Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33% Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14% Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17% Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11% If we define the “extent of fraud” as the percentage excess of the 100 unit meter reading, the value varies considerably across sections, and time New section manager arrives … and is transferred out … with some explainable anomalies. Why would these happen? 25
  • 26. PREDICTING MARKS “ What determines a child’s marks? Do girls score better than boys? Does the choice of subject matter? Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter? EDUCATION 26
  • 27. TN CLASS X: ENGLISH 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
  • 28. TN CLASS X: SOCIAL SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
  • 29. TN CLASS X: LANGUAGE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
  • 30. TN CLASS X: SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
  • 31. TN CLASS X: MATHEMATICS 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
  • 32. ICSE 2013 CLASS XII: TOTAL MARKS 32
  • 33. CBSE 2013 CLASS XII: ENGLISH MARKS 33
  • 34. CBSE 2013 CLASS XII: PHYSICS MARKS 34
  • 35. 35 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 36. LET’S TAKE ONE DAY CRICKET DATA Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe 36
  • 37. Against which countries are higher averages scored? Which countries’ players score more per match? 37
  • 38. Which player scores the most per ball? The player with the highest strike rate is an obscure South African whose name most of us have never heard of. In fact, this list is filled with players we have never heard of. 38
  • 39. Most analysis answers the question “Which is are the top 10 X”? Which are my top products? Which are my top branches? Who are my best sales people? Which vendors have the highest cost per unit? Which divisions are spending the most money? In which hours does the under 12 segment watch TV most? Which customer segment has the highest revenue per user? 39
  • 40. THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe Take every column in the data Find the top value by that column Country South Africa has the highest strike rate of 76% Player Johann Louw has the highest strike rate of 329% Runs 164 runs has the highest strike rate of 156% MatchDate 12-03-2006 has the highest strike rate of 136% Ground AC-VDCA Stadium has the highest strike rate of 98% Versus United States has the highest strike rate of 104% 40
  • 41. What do the children in schools know and can do at different stages of elementary education? Have the inputs made into the elementary education system had a beneficial effect or not? 41
  • 42. HAVING BOOKS IMPROVES READING ABILITY Having more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled What is the impact? How many more marks can having more books fetch? Circle size indicates number of students with this response. Few students have no books. Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks. The most common response is marked in blue. This is also the circle. The graphic is summarized in words Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response. 42
  • 43. CHILDREN LIKE GAMES, AND THEY’RE GOOD … but playing daily hurts reading ability 43
  • 44. WATCHING TV OCCASIONALLY IS GOOD Children who watch TV every day don’t do as well as children who watch TV only once a week. But children who never watch TV fare the worst. Watching TV every day helps improve children’s reading ability a little bit more… … but mathematical abilities fall dramatically at that point 44
  • 45. WE HAVE A WEBSITE THAT YOU CAN EXPLORE GRAMENER.COM/NAS 45
  • 46. 46 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS

Editor's Notes

  • #26: We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.) As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries. Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands. It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers. When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.) The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny. Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10. We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.