SlideShare a Scribd company logo
My  Flings  with  
Data  Analysis
Venkatesh-­‐Prasad	
  Ranganath	
  
Kansas	
  State	
  University	
  
!
Dagstuhl	
  Software	
  Development	
  Analytics	
  Seminar	
  
June	
  25,	
  2014
The	
  described	
  efforts	
  were	
  carried	
  out	
  at	
  Microsoft	
  Research	
  and	
  Microsoft.
Observations
1. Involve  the  users  early  
2. Don’t  fear  the  unconventional  
3. Embrace  dynamic  artifacts  
• Logs,  core  dumps,  usage  profiles,  etc.    
4. Have  a  feedback  loop  involving  the  users  
5. Domain  knowledge  is  a  power  pellet  
6. Questions  and  answers  first,  methods  later  
7. Relevant  features  and  simple  methods  work  really  well  
8. Quick  trumps  slow  and  utility  trumps  quick  
9. Automate  (whenever  possible)  
10.Presentation  matters  
11.Models  are  good,  simple  explanations  are  better  
Win8
USB	
  3.0	
  	
  
Driver	
  Stack
XHCI
Driver1
Win7
USB	
  2.0	
  	
  
Driver	
  Stack
	
  Driver1
EHCI OHCI UHCI
Driver2
USB	
  2.0	
  
device
USB	
  2.0	
  
device
Is  USB  3.0/Win8  ≈ USB  2.0/Win7?
Collaborators:	
  Randy	
  Aull,	
  Pankaj	
  Gupta,	
  Jane	
  Lawrence,	
  Pradip	
  Vallathol,	
  &	
  Eliyas	
  Yakub
When  a  USB  2.0  device  is  plugged  into  a  USB  3.0  port  on  Win8,  
the  USB  3.0  stack  in  Win8  should  behave  as  the  USB  2.0  stack  
in  Win7  (along  both  software  and  hardware  interfaces).
One  could  ….
USB2	
  
Log
Pattern	
  Miner
USB2	
  Patterns
USB3	
  
Log
Pattern	
  Miner
USB3	
  Patterns
Structural	
  and	
  Temporal	
  	
  
Pattern	
  Diffing
USB2	
  
Patterns
USB3	
  
Patterns
DispatchIrp	
  	
  forward	
  alternates	
  with	
  IrpCompletion	
  &&	
  PreIoCompleteRequest	
  	
  	
  
when	
  	
  
IOCTLType=IRP_MJ_PNP(0x1B),IRP_MN_START_DEVICE(0x00),	
  irpID=SAME,	
  and	
  
IrpSubmitDetails.irp.ioStackLocation.control=SAME
IOCTLType=URB_FUNCTION_BULK_OR_INTERRUPT_TRA
NSFER(0x09)	
  
&&	
  IoCallDriverReturn	
  &&	
  IoCallDriverReturn.irql=2	
  
&&	
  IoCallDriverReturn.status=0xC000000E
Patterns-­‐based  Compatibility  Testing
Results
Device Id Detected Reported False +ve Unique
1 9844 478 465 10
2 2545 15 11 2
3 743 4 1 1
4 1372 2 2 0
5 26118 55 55 0
6 26126 0 0 0
7 2320 0 0 0
8 27804 2 1 1
9 34985 115 98 4
10 51556 59 56 3
11 695 0 0 0
12 1372 0 0 0
13 3315 24 23 1
14 9299 3 0 2
Test  Suite  Reduction
Experience-­‐based  Selection
Collaborators:	
  Naren	
  Datha,	
  Robbie	
  Harris,	
  Aravind	
  Namasivayam,	
  &	
  Pradip	
  Vallathol
 Patterns-­‐based  Test  Suite  Reduction
Cluster-­‐n-­‐Select
• 50%  reduction  in  test  suite  size  
• 75-­‐80%  bugs  uncovered  (v/s  human-­‐based  baseline)  
• Fully  automated  (except  for  threshold  setting)  
• Flying  without  safety  net
Results
Structural  &  Temporal  Patterns  (w/  Data  flow)
h=fopen	
  ....	
  fclose(h)
(h!=0	
  &&	
  h=fopen)	
  ....	
  fclose(h)
21	
  =	
  fopen(“passwd.txt”,	
  “r”)	
  ....	
  fclose(21)
fopen	
  ....	
  fclose
21=fopen	
  ....	
  fclose(21)
21=fopen(,	
  “r”)	
  ....	
  fclose(21)
Observations
1. Involve  the  users  early  
2. Don’t  fear  the  unconventional  
3. Embrace  dynamic  artifacts  
• Logs,  core  dumps,  usage  profiles,  etc.    
4. Have  a  feedback  loop  involving  the  users  
5. Domain  knowledge  is  a  power  pellet  
6. Questions  and  answers  first,  methods  later  
7. Relevant  features  and  simple  methods  work  really  well  
8. Quick  trumps  slow  and  utility  trumps  quick  
9. Automate  (whenever  possible)  
10.Presentation  matters  
11.Models  are  good,  simple  explanations  are  better  
Opportunities
• Combining  of  dynamic  and  static  artifacts/techniques  
• Time  to  move  beyond  repositories  
• Using  data  analysis  to  improve  techniques  
• Exploring/designing  language  for  
• Query  
• Visualization  
• Scaling  
• Distributed  Computing  
• GPUs
Questions

More Related Content

PPTX
Towards Task Analysis Tool Support
PPTX
Transferring Software Testing and Analytics Tools to Practice
PDF
Data analytics, a (short) tour
PDF
R language, an introduction
PPTX
Practitioners’ Expectations on Automated Fault Localization
PPTX
Risk-Based Attack Surface Approximation: How Much Data is Enough? [ICSE - SEI...
PDF
SCAM 2012 Keynote Slides on Cooperative Testing and Analysis by Tao Xie
Towards Task Analysis Tool Support
Transferring Software Testing and Analytics Tools to Practice
Data analytics, a (short) tour
R language, an introduction
Practitioners’ Expectations on Automated Fault Localization
Risk-Based Attack Surface Approximation: How Much Data is Enough? [ICSE - SEI...
SCAM 2012 Keynote Slides on Cooperative Testing and Analysis by Tao Xie

Similar to My flings with data analysis (20)

PDF
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
PDF
Compatibility Testing using Patterns-based Trace Comparison
PPTX
Synergy of Human and Artificial Intelligence in Software Engineering
PPTX
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
PDF
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
PDF
A Large-Scale Empirical Comparison of Static and DynamicTest Case Prioritizat...
PDF
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
PPTX
Provenance for Data Munging Environments
PDF
SBQS 2013 Keynote: Cooperative Testing and Analysis
PDF
Offensive (Web, etc) Testing Framework: My gift for the community - BerlinSid...
PDF
Machine Learning Crash Course by Sebastian Raschka
PPTX
Software Analytics: Towards Software Mining that Matters (2014)
PDF
Operating Systems A Concept Based Approach 1st Edition Dhananjay Dhamdhere
PDF
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
PPTX
WTF is Penetration Testing v.2
PDF
Effective Fault-Localization Techniques for Concurrent Software
PDF
stackconf 2024 | Squash the Flakes! – How to Minimize the Impact of Flaky Tes...
PDF
Building Your Application Security Data Hub - OWASP AppSecUSA
PPTX
An Agile Approach to Machine Learning
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
Compatibility Testing using Patterns-based Trace Comparison
Synergy of Human and Artificial Intelligence in Software Engineering
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Protecting the Protector, Hardening Machine Learning Defenses Against Adversa...
A Large-Scale Empirical Comparison of Static and DynamicTest Case Prioritizat...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
Provenance for Data Munging Environments
SBQS 2013 Keynote: Cooperative Testing and Analysis
Offensive (Web, etc) Testing Framework: My gift for the community - BerlinSid...
Machine Learning Crash Course by Sebastian Raschka
Software Analytics: Towards Software Mining that Matters (2014)
Operating Systems A Concept Based Approach 1st Edition Dhananjay Dhamdhere
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
"Data Provenance: Principles and Why it matters for BioMedical Applications"
WTF is Penetration Testing v.2
Effective Fault-Localization Techniques for Concurrent Software
stackconf 2024 | Squash the Flakes! – How to Minimize the Impact of Flaky Tes...
Building Your Application Security Data Hub - OWASP AppSecUSA
An Agile Approach to Machine Learning
Ad

More from Venkatesh Prasad Ranganath (14)

PDF
SeMA: A Design Methodology for Building Secure Android Apps
PDF
Are free Android app security analysis tools effective in detecting known vul...
PDF
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
PDF
Why do Users kill HPC Jobs?
PDF
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
PDF
Code Coverage [9] - Software Testing Techniques (CIS640)
PDF
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
PDF
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
PDF
Property Based Testing [5] - Software Testing Techniques (CIS640)
PDF
Intro to Python3 [2] - Software Testing Techniques (CIS640)
PDF
Unit testing [4] - Software Testing Techniques (CIS640)
PDF
Testing concepts [3] - Software Testing Techniques (CIS640)
PDF
Introduction [1] - Software Testing Techniques (CIS640)
PPTX
Pattern-based Features
SeMA: A Design Methodology for Building Secure Android Apps
Are free Android app security analysis tools effective in detecting known vul...
Benchpress: Analyzing Android App Vulnerability Benchmark Suites
Why do Users kill HPC Jobs?
Behavior Driven Development [10] - Software Testing Techniques (CIS640)
Code Coverage [9] - Software Testing Techniques (CIS640)
Equivalence Class Testing [8] - Software Testing Techniques (CIS640)
Boundary Value Testing [7] - Software Testing Techniques (CIS640)
Property Based Testing [5] - Software Testing Techniques (CIS640)
Intro to Python3 [2] - Software Testing Techniques (CIS640)
Unit testing [4] - Software Testing Techniques (CIS640)
Testing concepts [3] - Software Testing Techniques (CIS640)
Introduction [1] - Software Testing Techniques (CIS640)
Pattern-based Features
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPT
Quality review (1)_presentation of this 21
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
Quality review (1)_presentation of this 21
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

My flings with data analysis

  • 1. My  Flings  with   Data  Analysis Venkatesh-­‐Prasad  Ranganath   Kansas  State  University   ! Dagstuhl  Software  Development  Analytics  Seminar   June  25,  2014 The  described  efforts  were  carried  out  at  Microsoft  Research  and  Microsoft.
  • 2. Observations 1. Involve  the  users  early   2. Don’t  fear  the  unconventional   3. Embrace  dynamic  artifacts   • Logs,  core  dumps,  usage  profiles,  etc.     4. Have  a  feedback  loop  involving  the  users   5. Domain  knowledge  is  a  power  pellet   6. Questions  and  answers  first,  methods  later   7. Relevant  features  and  simple  methods  work  really  well   8. Quick  trumps  slow  and  utility  trumps  quick   9. Automate  (whenever  possible)   10.Presentation  matters   11.Models  are  good,  simple  explanations  are  better  
  • 3. Win8 USB  3.0     Driver  Stack XHCI Driver1 Win7 USB  2.0     Driver  Stack  Driver1 EHCI OHCI UHCI Driver2 USB  2.0   device USB  2.0   device Is  USB  3.0/Win8  ≈ USB  2.0/Win7? Collaborators:  Randy  Aull,  Pankaj  Gupta,  Jane  Lawrence,  Pradip  Vallathol,  &  Eliyas  Yakub When  a  USB  2.0  device  is  plugged  into  a  USB  3.0  port  on  Win8,   the  USB  3.0  stack  in  Win8  should  behave  as  the  USB  2.0  stack   in  Win7  (along  both  software  and  hardware  interfaces).
  • 5. USB2   Log Pattern  Miner USB2  Patterns USB3   Log Pattern  Miner USB3  Patterns Structural  and  Temporal     Pattern  Diffing USB2   Patterns USB3   Patterns DispatchIrp    forward  alternates  with  IrpCompletion  &&  PreIoCompleteRequest       when     IOCTLType=IRP_MJ_PNP(0x1B),IRP_MN_START_DEVICE(0x00),  irpID=SAME,  and   IrpSubmitDetails.irp.ioStackLocation.control=SAME IOCTLType=URB_FUNCTION_BULK_OR_INTERRUPT_TRA NSFER(0x09)   &&  IoCallDriverReturn  &&  IoCallDriverReturn.irql=2   &&  IoCallDriverReturn.status=0xC000000E Patterns-­‐based  Compatibility  Testing
  • 6. Results Device Id Detected Reported False +ve Unique 1 9844 478 465 10 2 2545 15 11 2 3 743 4 1 1 4 1372 2 2 0 5 26118 55 55 0 6 26126 0 0 0 7 2320 0 0 0 8 27804 2 1 1 9 34985 115 98 4 10 51556 59 56 3 11 695 0 0 0 12 1372 0 0 0 13 3315 24 23 1 14 9299 3 0 2
  • 7. Test  Suite  Reduction Experience-­‐based  Selection Collaborators:  Naren  Datha,  Robbie  Harris,  Aravind  Namasivayam,  &  Pradip  Vallathol
  • 8.  Patterns-­‐based  Test  Suite  Reduction Cluster-­‐n-­‐Select
  • 9. • 50%  reduction  in  test  suite  size   • 75-­‐80%  bugs  uncovered  (v/s  human-­‐based  baseline)   • Fully  automated  (except  for  threshold  setting)   • Flying  without  safety  net Results
  • 10. Structural  &  Temporal  Patterns  (w/  Data  flow) h=fopen  ....  fclose(h) (h!=0  &&  h=fopen)  ....  fclose(h) 21  =  fopen(“passwd.txt”,  “r”)  ....  fclose(21) fopen  ....  fclose 21=fopen  ....  fclose(21) 21=fopen(,  “r”)  ....  fclose(21)
  • 11. Observations 1. Involve  the  users  early   2. Don’t  fear  the  unconventional   3. Embrace  dynamic  artifacts   • Logs,  core  dumps,  usage  profiles,  etc.     4. Have  a  feedback  loop  involving  the  users   5. Domain  knowledge  is  a  power  pellet   6. Questions  and  answers  first,  methods  later   7. Relevant  features  and  simple  methods  work  really  well   8. Quick  trumps  slow  and  utility  trumps  quick   9. Automate  (whenever  possible)   10.Presentation  matters   11.Models  are  good,  simple  explanations  are  better  
  • 12. Opportunities • Combining  of  dynamic  and  static  artifacts/techniques   • Time  to  move  beyond  repositories   • Using  data  analysis  to  improve  techniques   • Exploring/designing  language  for   • Query   • Visualization   • Scaling   • Distributed  Computing   • GPUs