SlideShare a Scribd company logo
Development  Emails  Content  Analyzer:  
Intention  Mining  in  Developer  Discussions	
  Andrea	
Di  Sorbo	
  Sebastiano	
Panichella	
  Corrado	
Visaggio	
  Massimiliano	
Di  Penta	
  Gerardo	
Canfora	
  Harald	
Gall
Outline  
	
Context:  	
Wri5en  	
Development  Discussions	
	
Case  Study:  	
Development  Mailing  List	
of  2  Open  Source  Projects	
	
Results:	
Automatic  Classification  of  Relevant	
Contents  in  Developers’  Communication	
	
2
Open  Source  (OS)  and    
Industrial  Projects  	
3
Open  Source  (OS)  and    
Industrial  Projects	
4
Open  Source  (OS)  and    
Industrial  Projects	
5
Open  Source  (OS)  and    
Industrial  Projects	
6
Development    
Communication  Means	
Recommender  systems:	
-­‐‑  Bug  Triaging  [1]	
-­‐‑  Suggest  Mentors  [2]	
-­‐‑  Code  re-­‐‑documentation  [3]	
-­‐‑  Etc.	
[1]  Anvik  et  al.  “Who  should  fix  this  bug?”.	
[2]  Canfora  et  al.  “Who  is  going  to  mentor  newcomers  in  open  source  projects?”  	
[3]  Panichella  et  al.  “Mining  source  code  descriptions  from  developer  communications”	
7
Development    
Communication  Means	
8
Development    
Communication  Means	
[1]  Bacchelli  et  al.  “Content  classification  of  development  emails”.	
[2]  Cerulo  et  al.  “A  Hidden  Markov  Model  to  detect  coded  information  islands  in  free  text.”  	
9
Different  Kinds  of  Data  
	
Structured	
Semi-­‐‑Structured	
Unstructured	
10
A  Considerable  Effort  for  
Developers	
Many  messages  	
Developers  get  lost  in  unnecessary  details  
missing  potential  useful  information…	
11
Previous  Work  
	
12
Hana  et  al.	
“…Lazy”  RTC  occurs  when  
a  core  developer  post  a  
change  to  a  mailing  lists  
and  nobody  responds,	
  it  assumed  that  other  
developers  reviewed  the  
code…”
Previous  Work  
	
Approaches  for:  	
-­‐‑  Generating  summaries  	
        of  emails.  	
            à  Lam  et  al.  ,  	
            à  Rambow  et  al.	
-­‐‑  Generating  summaries  	
          of  bug  reports.	
            à    Rastkar  et  al.	
13
Different  Purposes  
	
Feature  requests	
Bug  disclosures	
Project  Management	
14
DECA  
(Development  Email  Content  Analyzer)	
An  approach  to  Classify  Paragraphs  	
According  to  Intentions	
hSp://www.ifi.uzh.ch/seal/people/panichella/tools/DECA.html	
 15
Why  use  NLP  for  Classifying  
Paragraphs  According  to  
Intentions?	
16
Example	
i.  We  could  use  a  leaky  bucket  algorithm  to  limit  
the  bandwidth	
ii.  The  leaky  bucket  algorithm  fails  in  limiting  the  
bandwidth  	
17
i.  We  could  use  a  leaky  bucket  algorithm  to  limit  
the  bandwidth	
ii.  The  leaky  bucket  algorithm  fails  in  limiting  the  
bandwidth  	
      An  high  percentage  of  words  in  common	
Example	
18
i.  We  could  use  a  leaky  bucket  algorithm  to  limit  
the  bandwidth	
ii.  The  leaky  bucket  algorithm  fails  in  limiting  the  
bandwidth  	
Discuss  about  the  same  topics	
Example	
19
i.  We  could  use  a  leaky  bucket  algorithm  to  limit  
the  bandwidth	
ii.  The  leaky  bucket  algorithm  fails  in  limiting  the  
bandwidth  	
Have  different  intentions	
Example	
20
i.  We  could  use  a  leaky  bucket  algorithm  to  limit  
the  bandwidth	
ii.  The  leaky  bucket  algorithm  fails  in  limiting  the  
bandwidth  	
Have  different  intentions	
Example	
“Techniques  based  on  lexicon  analysis,  such  as  VSM  [1],  LSI  [2],  or  LDA  [3]  would  
not  be  sufficient  to  classify  paragraphs  according  to  intentions”.	
.	
[1]  Baeza-­‐‑Yates  et  al.  “Modern  Information  Retrieval”.	
[2]  de  Marneffe  et  al.,  “The  Stanford  typed  dependencies  representation”.	
[3]  Blei  et  al.,  “Latent  dirichlet  allocation”.	
21
Perspective  
	
22
Goal:  Understanding  to  what  extent  NL  parsing  could  be  
used     in  recognizing  informative  text  fragments  in  emails  
from  a  software  maintenance  and  evolution  perspective	
	
Quality   focus:   Detection   of   text   paragraphs   in  
development   discussions      containing   helpful   information  
for  developers.  	
	
Perspective:   Guide   developers   in   maintaining   and  
evolving  their  products.  	
Case  Study  
	
23
Research  Questions  
	
RQ1:   Can   an   NLP   approach   (i.e.   DECA)   be  
effective   in   classifying   writers’   intentions   in  
development  emails?	
RQ2:  Is  DECA  more  effective  than  existing  
Machine   Learning   techniques   in  
classifying  development  emails  content?	
24
Qt	
Ubuntu	
Context	
25
STEPS:	
1)    Taxonomy  Definition	
  	
  2)    Classification  Based  on  DECA  (NLP  Analyzer)	
26
Taxonomy Definition
27
Sampling  	
	
We  selected  100	
Of  the                            Project      	
28
Clustering	
	
Clusters	
Implementation	
Technical  Infrastructure	
Project  Status	
Social  Interations	
Usage	
Discarded	
Guzzi  et.  al  –  MSR2013	
29
Clustering	
	
Guzzi  et.  al  –  ICSE2012	
30
The  final  taxonomy	
	
31
Differences  with  Guzzi  et.  al.	
	
32
Examples	
	
33
Natural Language
Parsing
DECA  
(Development  Email  Content  Analyzer)	
34
Recurrent  Linguistic  PaSerns	
	
35
Why  NL  parsing?  
	
Well  defined  predicate-­‐‑argument  structures	
use	
we	
 could	
 algorithm	
a	
 leaky	
 bucket	
limit	
to	
 bandwidth	
the	
            nsubj                    aux                    dobj                        xcomp	
            det                  amod                    nn          	
            aux                        dobj            	
det	
fails	
algorithm	
the	
 leaky	
 bucket	
in	
limiting	
bandwidth	
the	
                                    nsubj                                                          prep	
                det                amod          nn    	
                 pcomp  	
                dobj  	
                det  	
36
NL  parsing	
Natural  Language  Templates	
use	
[someone]	
 could	
 [something]	
                          nsubj                    aux                    dobj	
fails	
[somehing]	
nsubj	
37
Natural  Language  Templates	
use	
[someone]	
 could	
 [something]	
                          nsubj                    aux                    dobj	
fails	
[somehing]	
nsubj	
NL  parsing	
38
Natural  Language  Templates	
use	
[someone]	
 could	
 [something]	
                          nsubj                    aux                    dobj	
fails	
[somehing]	
nsubj	
NL  parsing	
39
NLP  Heuristics	
	
40
NLP  Parser	
	
raw  text	
 NLP  parser	
 NLP  heuristics	
41
42
43
RQ1:  	
Is  DECA  effective  in  	
classifying  writers’  intentions  in  
development  emails?	
	
44
Experiment  I	
	
training	
test	
102 87
100
45
Experiment  I	
	
training	
test	
102 87
100
Experiment  II	
False  
Negative	
 46
Experiment  II	
	
training	
100 169
test	
100
Experiment  III	
False  
Negative	
 47
Experiment  III	
	
training	
100 231
test	
100
48
49
50
51
52
53
54
RQ2:  	
Is  the  proposed  approach  more  
effective  than  existing  ML  in  classifying  
development  emails  content?	
	
55
ML  for  Email  Classification	
	
An  Approach  Based  on  ML  for  Email  Content  Classification	
            à  Antoniol  et.  al.,  CASCON  2008    	
            à  Zhou  et  al.  ,  ICSME  2014	
	
56
ML  for  Email  Classification	
	
An  Approach  Based  on  ML  for  Email  Content  Classification	
1)Text  Features	
	
57
ML  for  Email  Classification	
	
An  Approach  Based  on  ML  for  Email  Content  Classification	
1)Text  Features	
	
2)  Split  training  
and  test  sets	
	
58
ML  for  Email  Classification	
	
An  Approach  Based  on  ML  for  Email  Content  Classification	
1)Text  Features	
	
2)  Split  training  
and  test  sets	
	
3)  Oracle  
building	
59
ML  for  Email  Classification	
	
An  Approach  Based  on  ML  for  Email  Content  Classification	
1)Text  Features	
	
2)  Split  training  
and  test  sets	
	
3)  Oracle  
building	
4)  Classification	
training	
prediction	
            à  Antoniol  et.  al.,  CASCON  2008    	
            à  Zhou  et  al.  ,  ICSME  2014	
	
60
61
62
63
64
65
66
67
68
69
Summary	
	
•  RQ2:   DECA outperforms traditional ML techniques in
terms of recall, precision and F-Measure when
classifying e-mail content.	
•  RQ1:   the automatic classification performed by DECA
achieves very good results in terms of both precision,
recall and F-measure (over all the experiments).	
70
Summary	
	
•  RQ2:   DECA outperforms traditional ML techniques in
terms of recall, precision and F-Measure when
classifying e-mail content.	
”…it took the MSR community more than 10 years to
figure out that machine learning is not the best method
for analyzing human-written text. Thank you for helping
move the field forward…”  [One of the ASE Reviewers]	
•  RQ1:   the automatic classification performed by DECA
achieves very good results in terms of both precision,
recall and F-measure (over all the experiments).	
71
72
Code  e-­‐‑documentation	
	
àPanichella  et.  al.  –  ICPC  2012  
Extract  methods’  descriptions  from  
developers  discussions	
	
à  Vector  Space  Models	
à  ad  hoc  heuristics	
“…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
73
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
74
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
75
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
76
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
77
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
78
Code  re-­‐‑documentation	
	
 “…  several  are  the  discourse  
paIerns  that  characterize  false  
negative  method  descriptions…  “	
79
Code  re-­‐‑documentation	
	
delete	
  
80
Conclusion	
	
81
Conclusion	
	
82
Conclusion	
	
83
Conclusion	
	
84
Conclusion	
	
85
Conclusion	
	
86
Future  work	
	
1)DECA  as  preprocessing  
support  to  discard  irrelevant  
sentences  in  summarization  
approaches	
87
Future  work	
	
1)DECA  as  preprocessing  
support  to  discard  irrelevant  
sentences  in  summarization  
approaches	
2)DECA  in  combination  with  
topic  models  for  mining  
contents  with  the  same  intentions  
and  the  same  topics  	
88

More Related Content

PDF
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
PDF
Summarization Techniques for Code, Change, Testing and User Feedback
PPTX
R programming for psychometrics
PDF
Integrating natural language processing and software engineering
PDF
Natural language processing for requirements engineering: ICSE 2021 Technical...
PDF
Analyzing Big Data's Weakest Link (hint: it might be you)
PDF
Code Mixing computationally bahut challenging hai
PPTX
Natural Language Processing - Research and Application Trends
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
Summarization Techniques for Code, Change, Testing and User Feedback
R programming for psychometrics
Integrating natural language processing and software engineering
Natural language processing for requirements engineering: ICSE 2021 Technical...
Analyzing Big Data's Weakest Link (hint: it might be you)
Code Mixing computationally bahut challenging hai
Natural Language Processing - Research and Application Trends

What's hot (17)

PDF
CV - DCHATTERJI
PDF
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
PDF
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
PDF
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
PPTX
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
PPT
Nlp 2020 global ai conf -jeff_shomaker_final
PDF
Icpc13.ppt
PDF
Senjuti Kundu - Resume
PDF
Thesis+of+latifa+guerrouj.ppt
PDF
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
PDF
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
PDF
Butler
PDF
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
PDF
Intelligent Software Engineering: Synergy between AI and Software Engineering...
PPS
Quality in use of domain-specific languages: a case study
PDF
130817 latifa guerrouj - context-aware source code vocabulary normalization...
PDF
Modest Formalization of Software Design Patterns
CV - DCHATTERJI
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Future of Natural Language Processing - Potential Lists of Topics for PhD stu...
Nlp 2020 global ai conf -jeff_shomaker_final
Icpc13.ppt
Senjuti Kundu - Resume
Thesis+of+latifa+guerrouj.ppt
Conversational AI:An Overview of Techniques, Applications & Future Scope - Ph...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
Butler
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Quality in use of domain-specific languages: a case study
130817 latifa guerrouj - context-aware source code vocabulary normalization...
Modest Formalization of Software Design Patterns
Ad

Similar to Development Emails Content Analyzer: Intention Mining in Developer Discussions (20)

PDF
A novel approach based on topic
PPTX
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
PPTX
short-story.pptx
PDF
Naver learning to rank question answer pairs using hrde-ltc
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
PDF
>Wondershare Recoverit 13.5.11.3 Free crack | 2025
PDF
MiniTool Partition Wizard 12.8 Crack License Key [2025] Free
PDF
Wondershare Recoverit 13.5.11.3 Free crack
PDF
EssentialPIM Pro Business Free Download
PPTX
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
PDF
Mastercam 2025 v27.0.7027 Free Download
PDF
A preliminary study on using code smells to improve bug localization
PDF
‘CodeAliker’ - Plagiarism Detection on the Cloud
PDF
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
PDF
Finding Bad Code Smells with Neural Network Models
PDF
Deepcoder to Self-Code with Machine Learning
PDF
A novel approach for clone group mapping
PDF
IRJET- Semantic Question Matching
PPTX
Generative AI Reasoning Tech Talk - July 2024
PDF
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
A novel approach based on topic
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
short-story.pptx
Naver learning to rank question answer pairs using hrde-ltc
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
>Wondershare Recoverit 13.5.11.3 Free crack | 2025
MiniTool Partition Wizard 12.8 Crack License Key [2025] Free
Wondershare Recoverit 13.5.11.3 Free crack
EssentialPIM Pro Business Free Download
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
Mastercam 2025 v27.0.7027 Free Download
A preliminary study on using code smells to improve bug localization
‘CodeAliker’ - Plagiarism Detection on the Cloud
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Finding Bad Code Smells with Neural Network Models
Deepcoder to Self-Code with Machine Learning
A novel approach for clone group mapping
IRJET- Semantic Question Matching
Generative AI Reasoning Tech Talk - July 2024
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
Ad

More from Sebastiano Panichella (20)

PDF
ICST/SBFT Tool Competition 2025 - UAV Testing Track
PDF
NL-based Software Engineering (NLBSE) '25
PDF
ICST Tool Competition 2025 Self-driving Car Testing Track
PDF
ICST Awards: 18th IEEE International Conference on Software Testing, Verifica...
PDF
ICST Panel: 18th IEEE International Conference on Software Testing, Verificat...
PDF
ICST Closing: 18th IEEE International Conference on Software Testing, Verific...
PDF
ICST Opening: 18th IEEE International Conference on Software Testing, Verific...
PDF
ICST/SBFT Tool Competition 2025 UAV Testing Track
PDF
Announcement of 18th IEEE International Conference on Software Testing, Verif...
PDF
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
PDF
International Workshop on Artificial Intelligence in Software Testing
PDF
The 3rd Intl. Workshop on NL-based Software Engineering
PDF
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
PDF
SBFT Tool Competition 2024 -- Python Test Case Generation Track
PDF
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
PDF
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
PDF
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
PDF
COSMOS: DevOps for Complex Cyber-physical Systems
PDF
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
PDF
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
ICST/SBFT Tool Competition 2025 - UAV Testing Track
NL-based Software Engineering (NLBSE) '25
ICST Tool Competition 2025 Self-driving Car Testing Track
ICST Awards: 18th IEEE International Conference on Software Testing, Verifica...
ICST Panel: 18th IEEE International Conference on Software Testing, Verificat...
ICST Closing: 18th IEEE International Conference on Software Testing, Verific...
ICST Opening: 18th IEEE International Conference on Software Testing, Verific...
ICST/SBFT Tool Competition 2025 UAV Testing Track
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
International Workshop on Artificial Intelligence in Software Testing
The 3rd Intl. Workshop on NL-based Software Engineering
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
COSMOS: DevOps for Complex Cyber-physical Systems
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...

Recently uploaded (20)

PPTX
power point presentation ofDracena species.pptx
PPTX
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
PDF
Microsoft-365-Administrator-s-Guide_.pdf
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PDF
natwest.pdf company description and business model
PDF
IKS PPT.....................................
PDF
public speaking for kids in India - LearnifyU
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PDF
Module 7 guard mounting of security pers
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
ANICK 6 BIRTHDAY....................................................
PPTX
Anesthesia and it's stage with mnemonic and images
PPTX
CAPE CARIBBEAN STUDIES- Integration-1.pptx
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PDF
_Nature and dynamics of communities and community development .pdf
DOCX
Action plan to easily understanding okey
PDF
Unnecessary information is required for the
PDF
Yusen Logistics Group Sustainability Report 2024.pdf
PPTX
PurpoaiveCommunication for students 02.pptx
PPTX
Kompem Part Untuk MK Komunikasi Pembangunan 5.pptx
power point presentation ofDracena species.pptx
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
Microsoft-365-Administrator-s-Guide_.pdf
chapter8-180915055454bycuufucdghrwtrt.pptx
natwest.pdf company description and business model
IKS PPT.....................................
public speaking for kids in India - LearnifyU
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
Module 7 guard mounting of security pers
Sustainable Forest Management ..SFM.pptx
ANICK 6 BIRTHDAY....................................................
Anesthesia and it's stage with mnemonic and images
CAPE CARIBBEAN STUDIES- Integration-1.pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
_Nature and dynamics of communities and community development .pdf
Action plan to easily understanding okey
Unnecessary information is required for the
Yusen Logistics Group Sustainability Report 2024.pdf
PurpoaiveCommunication for students 02.pptx
Kompem Part Untuk MK Komunikasi Pembangunan 5.pptx

Development Emails Content Analyzer: Intention Mining in Developer Discussions

  • 1. Development  Emails  Content  Analyzer:   Intention  Mining  in  Developer  Discussions  Andrea Di  Sorbo  Sebastiano Panichella  Corrado Visaggio  Massimiliano Di  Penta  Gerardo Canfora  Harald Gall
  • 2. Outline   Context:   Wri5en   Development  Discussions Case  Study:   Development  Mailing  List of  2  Open  Source  Projects Results: Automatic  Classification  of  Relevant Contents  in  Developers’  Communication 2
  • 3. Open  Source  (OS)  and     Industrial  Projects   3
  • 4. Open  Source  (OS)  and     Industrial  Projects 4
  • 5. Open  Source  (OS)  and     Industrial  Projects 5
  • 6. Open  Source  (OS)  and     Industrial  Projects 6
  • 7. Development     Communication  Means Recommender  systems: -­‐‑  Bug  Triaging  [1] -­‐‑  Suggest  Mentors  [2] -­‐‑  Code  re-­‐‑documentation  [3] -­‐‑  Etc. [1]  Anvik  et  al.  “Who  should  fix  this  bug?”. [2]  Canfora  et  al.  “Who  is  going  to  mentor  newcomers  in  open  source  projects?”   [3]  Panichella  et  al.  “Mining  source  code  descriptions  from  developer  communications” 7
  • 9. Development     Communication  Means [1]  Bacchelli  et  al.  “Content  classification  of  development  emails”. [2]  Cerulo  et  al.  “A  Hidden  Markov  Model  to  detect  coded  information  islands  in  free  text.”   9
  • 10. Different  Kinds  of  Data   Structured Semi-­‐‑Structured Unstructured 10
  • 11. A  Considerable  Effort  for   Developers Many  messages   Developers  get  lost  in  unnecessary  details   missing  potential  useful  information… 11
  • 12. Previous  Work   12 Hana  et  al. “…Lazy”  RTC  occurs  when   a  core  developer  post  a   change  to  a  mailing  lists   and  nobody  responds,  it  assumed  that  other   developers  reviewed  the   code…”
  • 13. Previous  Work   Approaches  for:   -­‐‑  Generating  summaries          of  emails.              à  Lam  et  al.  ,              à  Rambow  et  al. -­‐‑  Generating  summaries            of  bug  reports.            à    Rastkar  et  al. 13
  • 14. Different  Purposes   Feature  requests Bug  disclosures Project  Management 14
  • 15. DECA   (Development  Email  Content  Analyzer) An  approach  to  Classify  Paragraphs   According  to  Intentions hSp://www.ifi.uzh.ch/seal/people/panichella/tools/DECA.html 15
  • 16. Why  use  NLP  for  Classifying   Paragraphs  According  to   Intentions? 16
  • 17. Example i.  We  could  use  a  leaky  bucket  algorithm  to  limit   the  bandwidth ii.  The  leaky  bucket  algorithm  fails  in  limiting  the   bandwidth   17
  • 18. i.  We  could  use  a  leaky  bucket  algorithm  to  limit   the  bandwidth ii.  The  leaky  bucket  algorithm  fails  in  limiting  the   bandwidth        An  high  percentage  of  words  in  common Example 18
  • 19. i.  We  could  use  a  leaky  bucket  algorithm  to  limit   the  bandwidth ii.  The  leaky  bucket  algorithm  fails  in  limiting  the   bandwidth   Discuss  about  the  same  topics Example 19
  • 20. i.  We  could  use  a  leaky  bucket  algorithm  to  limit   the  bandwidth ii.  The  leaky  bucket  algorithm  fails  in  limiting  the   bandwidth   Have  different  intentions Example 20
  • 21. i.  We  could  use  a  leaky  bucket  algorithm  to  limit   the  bandwidth ii.  The  leaky  bucket  algorithm  fails  in  limiting  the   bandwidth   Have  different  intentions Example “Techniques  based  on  lexicon  analysis,  such  as  VSM  [1],  LSI  [2],  or  LDA  [3]  would   not  be  sufficient  to  classify  paragraphs  according  to  intentions”. . [1]  Baeza-­‐‑Yates  et  al.  “Modern  Information  Retrieval”. [2]  de  Marneffe  et  al.,  “The  Stanford  typed  dependencies  representation”. [3]  Blei  et  al.,  “Latent  dirichlet  allocation”. 21
  • 23. Goal:  Understanding  to  what  extent  NL  parsing  could  be   used    in  recognizing  informative  text  fragments  in  emails   from  a  software  maintenance  and  evolution  perspective Quality   focus:   Detection   of   text   paragraphs   in   development   discussions     containing   helpful   information   for  developers.   Perspective:   Guide   developers   in   maintaining   and   evolving  their  products.   Case  Study   23
  • 24. Research  Questions   RQ1:   Can   an   NLP   approach   (i.e.   DECA)   be   effective   in   classifying   writers’   intentions   in   development  emails? RQ2:  Is  DECA  more  effective  than  existing   Machine   Learning   techniques   in   classifying  development  emails  content? 24
  • 26. STEPS: 1)    Taxonomy  Definition    2)    Classification  Based  on  DECA  (NLP  Analyzer) 26
  • 28. Sampling   We  selected  100 Of  the                            Project       28
  • 29. Clustering Clusters Implementation Technical  Infrastructure Project  Status Social  Interations Usage Discarded Guzzi  et.  al  –  MSR2013 29
  • 30. Clustering Guzzi  et.  al  –  ICSE2012 30
  • 32. Differences  with  Guzzi  et.  al. 32
  • 34. Natural Language Parsing DECA   (Development  Email  Content  Analyzer) 34
  • 36. Why  NL  parsing?   Well  defined  predicate-­‐‑argument  structures use we could algorithm a leaky bucket limit to bandwidth the            nsubj                    aux                    dobj                        xcomp            det                  amod                    nn                      aux                        dobj             det fails algorithm the leaky bucket in limiting bandwidth the                                    nsubj                                                          prep                det                amod          nn                    pcomp                  dobj                  det   36
  • 37. NL  parsing Natural  Language  Templates use [someone] could [something]                          nsubj                    aux                    dobj fails [somehing] nsubj 37
  • 38. Natural  Language  Templates use [someone] could [something]                          nsubj                    aux                    dobj fails [somehing] nsubj NL  parsing 38
  • 39. Natural  Language  Templates use [someone] could [something]                          nsubj                    aux                    dobj fails [somehing] nsubj NL  parsing 39
  • 41. NLP  Parser raw  text NLP  parser NLP  heuristics 41
  • 42. 42
  • 43. 43
  • 44. RQ1:   Is  DECA  effective  in   classifying  writers’  intentions  in   development  emails? 44
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. 54
  • 55. RQ2:   Is  the  proposed  approach  more   effective  than  existing  ML  in  classifying   development  emails  content? 55
  • 56. ML  for  Email  Classification An  Approach  Based  on  ML  for  Email  Content  Classification            à  Antoniol  et.  al.,  CASCON  2008                à  Zhou  et  al.  ,  ICSME  2014 56
  • 57. ML  for  Email  Classification An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features 57
  • 58. ML  for  Email  Classification An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features 2)  Split  training   and  test  sets 58
  • 59. ML  for  Email  Classification An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features 2)  Split  training   and  test  sets 3)  Oracle   building 59
  • 60. ML  for  Email  Classification An  Approach  Based  on  ML  for  Email  Content  Classification 1)Text  Features 2)  Split  training   and  test  sets 3)  Oracle   building 4)  Classification training prediction            à  Antoniol  et.  al.,  CASCON  2008                à  Zhou  et  al.  ,  ICSME  2014 60
  • 61. 61
  • 62. 62
  • 63. 63
  • 64. 64
  • 65. 65
  • 66. 66
  • 67. 67
  • 68. 68
  • 69. 69
  • 70. Summary •  RQ2:   DECA outperforms traditional ML techniques in terms of recall, precision and F-Measure when classifying e-mail content. •  RQ1:   the automatic classification performed by DECA achieves very good results in terms of both precision, recall and F-measure (over all the experiments). 70
  • 71. Summary •  RQ2:   DECA outperforms traditional ML techniques in terms of recall, precision and F-Measure when classifying e-mail content. ”…it took the MSR community more than 10 years to figure out that machine learning is not the best method for analyzing human-written text. Thank you for helping move the field forward…”  [One of the ASE Reviewers] •  RQ1:   the automatic classification performed by DECA achieves very good results in terms of both precision, recall and F-measure (over all the experiments). 71
  • 72. 72
  • 73. Code  e-­‐‑documentation àPanichella  et.  al.  –  ICPC  2012   Extract  methods’  descriptions  from   developers  discussions à  Vector  Space  Models à  ad  hoc  heuristics “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 73
  • 74. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 74
  • 75. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 75
  • 76. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 76
  • 77. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 77
  • 78. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 78
  • 79. Code  re-­‐‑documentation “…  several  are  the  discourse   paIerns  that  characterize  false   negative  method  descriptions…  “ 79
  • 87. Future  work 1)DECA  as  preprocessing   support  to  discard  irrelevant   sentences  in  summarization   approaches 87
  • 88. Future  work 1)DECA  as  preprocessing   support  to  discard  irrelevant   sentences  in  summarization   approaches 2)DECA  in  combination  with   topic  models  for  mining   contents  with  the  same  intentions   and  the  same  topics   88