SlideShare a Scribd company logo
Big Data:
the weakest link
Vivek Nair, Tim Menzies
{vivekaxl,tim.menzies}@gmail.com
HPCC Eng. Summit - Sept 29, 2015
Where is the weakest link?
2
Where is the weakest link?
3
Where is the weakest link?
4
Where is the weakest link?
5
Where is the weakest link?
6
Premise of Big Data
Analysis is a “systems” task?
• Better conclusions =
same algorithms + more
data + more cpu
• If so, then …
– No role for human error
– All insight is auto-generated
from CPUs.
Analysis is a “human” task?
• Current results on “software
analytics”
– A human-intensive process
7
Q: Is Big Data a “Systems” or “Human”-task?
A: Yes
8
Code used in my
last paper
(1100 LOC of Python
calling scikitlearn)
9
Use a Higher-Level languages?
• ECL solves this problem?
• But if you can write it quick,
– you can write it wrong, quick.
10
Is this really a problem?
• Q: What would we expect
to see if…
– Top experts, publishing in top
journals
– Many of the same data sets
– 8 years of trying
• A:
– Perhaps some upward
progress
– Perhaps a little less variance
11
So, what do
we see?
• Software analytics
– Defect prediction
– Many of the same learners,
– Many of the same data sets
• 42 papers,
top journals,
• 23 author groups
• 2002 to 2010
• Y-axis measures
mean performance
12
Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd,
David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
13
http://fivethirt
yeight.com/fe
atures/science
-isnt-broken/
A little theory
• James D. Herbsleb, CMU
• Socio-Technical Coordination
• A predictor for higher defects:
– Groups of programmers
working on similar functions
then,
– but do not sharing that
expertise
14
Q: How to find expertise groups
within the HPCC community?
A: using data mining
15
Static features and commit history
can act as a cue for expertise
● Our motivation
o “relation between embodiment and language
acquisition by locating the ‘minimal set of
necessary features’ that enable language of any
kind to be learned” - The Philosophy of Expertise
16
Software analytics results:
learn predictors for expertise
● “...counts of the cumulative number of different
developers changing a file over its lifetime can help
to improve defect predictions…”[1]
● “Quantify person's experience with a part of code
using change history of the code”[2]
● “RevFinder, a file location-based code-reviewer
recommendation approach” [3]
● “30% of its code entities has more than 0.3 of
similarity with at least one developer vocabulary”
[4]
17
[1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell.
"Programmer-based fault prediction." Proceedings of the 6th
International Conference on Predictive Models in Software Engineering.
ACM, 2010.
[2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a
quantitative approach to identifying expertise." Proceedings of the
24th international conference on software engineering. ACM, 2002.
[3] Thongtanunam, Patanamon, et al. "Who should review my code? A
file location-based code-reviewer recommendation approach for
Modern Code Review."Software Analysis, Evolution and Reengineering
(SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015.
[4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de
Figueiredo. "Using Developers Contributions on Software Vocabularies
to Identify Experts."Information Technology-New Generations (ITNG),
2015 12th International Conference on. IEEE, 2015.
Q: And what data mining suite will we
use to mine data about programmers?
• A: need you ask?
18
Source Code
19
But what are we clustering?
Developer products
• Lightweight parsing of source code
• Developers profiles, accessed
via social media sites
Languages Used
Skill Set (self promotion)
Data processing
1. Github repos (for code) ➔ Social media(for years of work)
2. Static code analysis: frequency counts of AST features
(e.g. count loops, returns, var comparisons, map, etc )
3. Bayes classifier
Early
career
Later career
Classification
- Features: Nodes of AST
- Algorithms Used: Simple Cart, Random
Forest, Naive Bayes etc.
- Can distinguish expert from novice
programmers
•precision= 78% early career
•precision = 74% later career
* Using Weka
Current status
The good news
• Can auto-find groups of
better programmers
• Can do that for very large
data sets
– The ECL advantages
The other news
• Seeking larger data sets
• Talking to HackerRank
• Looking at ways to
instrument the HPCC
forums
– Matchmaker tools
– Affinity groups
25
Where is the weakest link?
26
Where is the weakest link?
27
We can make that link stronger
28
Acknowledgements:
Thanks to funding from LexisNexis
29

More Related Content

PPTX
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...
PPTX
Towards a Threat Hunting Automation Maturity Model
PDF
ownR platform technical description
PDF
Snigdha Goel Resume
PDF
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing
PDF
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
PDF
CV - DCHATTERJI
PDF
v2_Shikha_Gupta_Resume
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...
Towards a Threat Hunting Automation Maturity Model
ownR platform technical description
Snigdha Goel Resume
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
CV - DCHATTERJI
v2_Shikha_Gupta_Resume

What's hot (14)

DOCX
Brian_Thomas_Resume_20160215
PDF
TienResumeFinalV22016
PDF
Resume qinshu xiao_10_10
PPTX
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
PDF
ownR extended technical introduction
PDF
Opinion Mining for Software Engineering
PPTX
Reflex and model based agents
PDF
Put Your Hands in the Mud: What Technique, Why, and How
PPTX
Overview of Data Science
PDF
Resume of Zikai Cai
PDF
Venkata brundavanam 2020
PDF
Timothy Chu Resume
PDF
Jinank
Brian_Thomas_Resume_20160215
TienResumeFinalV22016
Resume qinshu xiao_10_10
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
ownR extended technical introduction
Opinion Mining for Software Engineering
Reflex and model based agents
Put Your Hands in the Mud: What Technique, Why, and How
Overview of Data Science
Resume of Zikai Cai
Venkata brundavanam 2020
Timothy Chu Resume
Jinank
Ad

Viewers also liked (20)

PDF
LEVICK Weekly - Sept 7 2012
PPTX
Porody sobak
PPTX
Cotizacion+(1)
PDF
Uu ite
DOC
Author guidelines
PDF
Looking at INSPIRE from an Open Source obsessed SME
PDF
Партнерский договор LR с физическим лицом_12.15
DOCX
TUBULAR EXCHANGER
PPTX
Remembrance of data past
PDF
Bet-the-Farm User Experience
PAGES
A Night Owl Seeking Balance.
DOC
DOCX
User guide
RTF
Traumatic brain injury
PDF
選択する肢/branch_city
PPTX
Il condizionale
PPTX
Metodo di scrittura (P:O:R:C:C:O)
PDF
PDF
PPTX
Demo
LEVICK Weekly - Sept 7 2012
Porody sobak
Cotizacion+(1)
Uu ite
Author guidelines
Looking at INSPIRE from an Open Source obsessed SME
Партнерский договор LR с физическим лицом_12.15
TUBULAR EXCHANGER
Remembrance of data past
Bet-the-Farm User Experience
A Night Owl Seeking Balance.
User guide
Traumatic brain injury
選択する肢/branch_city
Il condizionale
Metodo di scrittura (P:O:R:C:C:O)
Demo
Ad

Similar to Analyzing Big Data's Weakest Link (hint: it might be you) (20)

PDF
Software Mining and Software Datasets
PPTX
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
PPTX
Intelligent Software Engineering: Synergy between AI and Software Engineering
PDF
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
PPTX
Synergy of Human and Artificial Intelligence in Software Engineering
PDF
Visualization for Software Analytics
PDF
Keynote at-icpc-2020
PPTX
Software Analytics: Towards Software Mining that Matters (2014)
PDF
Software Analytics - Achievements and Challenges
PDF
Applying AI to software engineering problems: Do not forget the human!
PPTX
Towards Reusable Research Software
PDF
Intelligent Software Engineering: Synergy between AI and Software Engineering...
PPTX
How ChatGPT and AI-assisted coding changes software engineering profoundly
PDF
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
PDF
Reproducible Science and Deep Software Variability
PDF
NoSQL (Not Only SQL)
PDF
Sudipta_Mukherjee_Resume_APR_2023.pdf
PPT
01.intro
PDF
Software Analytics: Data Analytics for Software Engineering
PDF
Mastering Software Variability for Innovation and Science
Software Mining and Software Datasets
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
Intelligent Software Engineering: Synergy between AI and Software Engineering
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Synergy of Human and Artificial Intelligence in Software Engineering
Visualization for Software Analytics
Keynote at-icpc-2020
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics - Achievements and Challenges
Applying AI to software engineering problems: Do not forget the human!
Towards Reusable Research Software
Intelligent Software Engineering: Synergy between AI and Software Engineering...
How ChatGPT and AI-assisted coding changes software engineering profoundly
Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data
Reproducible Science and Deep Software Variability
NoSQL (Not Only SQL)
Sudipta_Mukherjee_Resume_APR_2023.pdf
01.intro
Software Analytics: Data Analytics for Software Engineering
Mastering Software Variability for Innovation and Science

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Introduction to Business Data Analytics.
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Launch Your Data Science Career in Kochi – 2025
Data_Analytics_and_PowerBI_Presentation.pptx
.pdf is not working space design for the following data for the following dat...
climate analysis of Dhaka ,Banglades.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Business Data Analytics.
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
Acceptance and paychological effects of mandatory extra coach I classes.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Launch Your Data Science Career in Kochi – 2025

Analyzing Big Data's Weakest Link (hint: it might be you)

  • 1. Big Data: the weakest link Vivek Nair, Tim Menzies {vivekaxl,tim.menzies}@gmail.com HPCC Eng. Summit - Sept 29, 2015
  • 2. Where is the weakest link? 2
  • 3. Where is the weakest link? 3
  • 4. Where is the weakest link? 4
  • 5. Where is the weakest link? 5
  • 6. Where is the weakest link? 6
  • 7. Premise of Big Data Analysis is a “systems” task? • Better conclusions = same algorithms + more data + more cpu • If so, then … – No role for human error – All insight is auto-generated from CPUs. Analysis is a “human” task? • Current results on “software analytics” – A human-intensive process 7
  • 8. Q: Is Big Data a “Systems” or “Human”-task? A: Yes 8
  • 9. Code used in my last paper (1100 LOC of Python calling scikitlearn) 9
  • 10. Use a Higher-Level languages? • ECL solves this problem? • But if you can write it quick, – you can write it wrong, quick. 10
  • 11. Is this really a problem? • Q: What would we expect to see if… – Top experts, publishing in top journals – Many of the same data sets – 8 years of trying • A: – Perhaps some upward progress – Perhaps a little less variance 11 So, what do we see?
  • 12. • Software analytics – Defect prediction – Many of the same learners, – Many of the same data sets • 42 papers, top journals, • 23 author groups • 2002 to 2010 • Y-axis measures mean performance 12 Researcher Bias: The Use of Machine Learning in Software Defect Prediction, Martin Shepperd, David Bowes, and Tracy Hall, IEEE TRANS on Soft. Eng. , 40(6), JUNE 2014
  • 14. A little theory • James D. Herbsleb, CMU • Socio-Technical Coordination • A predictor for higher defects: – Groups of programmers working on similar functions then, – but do not sharing that expertise 14
  • 15. Q: How to find expertise groups within the HPCC community? A: using data mining 15
  • 16. Static features and commit history can act as a cue for expertise ● Our motivation o “relation between embodiment and language acquisition by locating the ‘minimal set of necessary features’ that enable language of any kind to be learned” - The Philosophy of Expertise 16
  • 17. Software analytics results: learn predictors for expertise ● “...counts of the cumulative number of different developers changing a file over its lifetime can help to improve defect predictions…”[1] ● “Quantify person's experience with a part of code using change history of the code”[2] ● “RevFinder, a file location-based code-reviewer recommendation approach” [3] ● “30% of its code entities has more than 0.3 of similarity with at least one developer vocabulary” [4] 17 [1] Ostrand, Thomas J., Elaine J. Weyuker, and Robert M. Bell. "Programmer-based fault prediction." Proceedings of the 6th International Conference on Predictive Models in Software Engineering. ACM, 2010. [2] Mockus, Audris, and James D. Herbsleb. "Expertise browser: a quantitative approach to identifying expertise." Proceedings of the 24th international conference on software engineering. ACM, 2002. [3] Thongtanunam, Patanamon, et al. "Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review."Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015. [4] Santos, Katyusco de F., Dalton DS Guerrero, and Jorge CA de Figueiredo. "Using Developers Contributions on Software Vocabularies to Identify Experts."Information Technology-New Generations (ITNG), 2015 12th International Conference on. IEEE, 2015.
  • 18. Q: And what data mining suite will we use to mine data about programmers? • A: need you ask? 18
  • 20. But what are we clustering? Developer products • Lightweight parsing of source code • Developers profiles, accessed via social media sites
  • 22. Skill Set (self promotion)
  • 23. Data processing 1. Github repos (for code) ➔ Social media(for years of work) 2. Static code analysis: frequency counts of AST features (e.g. count loops, returns, var comparisons, map, etc ) 3. Bayes classifier Early career Later career
  • 24. Classification - Features: Nodes of AST - Algorithms Used: Simple Cart, Random Forest, Naive Bayes etc. - Can distinguish expert from novice programmers •precision= 78% early career •precision = 74% later career * Using Weka
  • 25. Current status The good news • Can auto-find groups of better programmers • Can do that for very large data sets – The ECL advantages The other news • Seeking larger data sets • Talking to HackerRank • Looking at ways to instrument the HPCC forums – Matchmaker tools – Affinity groups 25
  • 26. Where is the weakest link? 26
  • 27. Where is the weakest link? 27
  • 28. We can make that link stronger 28