SlideShare a Scribd company logo
Supporting Program Comprehension
with Source Code Summarization

Sonia Haiduc, Jairo
Aponte, Andrian Marcus
Presented By: Mohammad Masudur Rahman
Contents










2

Why Code Summarization?
Thesis Statement
Research Questions about summary
Research Questions about tool
Automatic Code Summarization
Evaluation
Experiments Conducted
Pyramid Method
Important Findings
My Observation & Future Works
Why Code Summarization?
 Program

comprehension 50% of all
maintenance works
 Two extreme approaches – skim through and
read thoroughly
 Skim through – leads to misunderstanding
 Read thoroughly – time consuming
 An intermediate solution – source code entity
with comprehensive textual description
3
Thesis Statement
 New

idea: code summarization to help in
program comprehension (PC)
 Applying TR methods like Latent Semantic
Indexing in source code summarization.
 Combining structural information with
retrieved code summary to make it effective
for realistic purposes.
4
Research Questions of Code
Summarization
 Summary

should be automatically generated
 Generate summary to different granularity
levels – class, method, packages etc
 Shorter than the source code
 Capture and preserve code semantics and
structure – text as well as structure from the
code
 Consistent structure – important items at first
5
Research Questions of Code
Summarization
 Summary

should reflect the developer’s
understanding about the code
 Tool should allow user to change summary
and will remember user’s choice in future
summary
 Tool should rebuild the summary if the code
changes or developer’s provide feedback
6
Research Questions about
Summarizer Tool









7

Which summarization technique works the best for
source code?
What type of structural info necessary in summary?
Will the summary be different for different type of
maintenance task?
How long it would be?
How much will it resemble to actual summary?
How do developers generate summary?
Automatic Code Summarization
 Generate

extractive summary – the most
important info extracted from the document

8
Automatic Code Summarization
 Two

types info extracted – lexical and
structural
 Lexical info – identifiers and comments are
extracted
 Common English and PL keywords are
removed
 Identifiers are split into constituent words and
stemming performed.
9
Automatic Code Summarization
 Extracted

lexical info forms the text corpus of
code where TR methods (e.g. LSI) used to
get most important n words.
 Once retrieved, n words are combined with
structural info like their class name, method
name, package name, parameter name and
type etc
 How to apply structural info to autogenerated summary is an important part
10
Automatic Code Summarization
A

method name reflects the description of
what it does.
 If method name ignored by TR, the tool can
introduce it automatically
 Additional info can be added like –user tags

11
Evaluation






12

Two types – intrinsic and extrinsic
Intrinsic – content evaluation, how closely it depicts
the document or how close to manually generated
summary
Metrics- precision, recall, pyramid method
Extrinsic – how much utility and usability it has to
support SE tasks – concept location, impact
analysis, software reuse, traceability links recovery
etc
Experiments Conducted
 Pyramid

method
 ATunes OS project, 12 methods
 6 developers from different demographic
locations, undergraduate students, 3 years
Java programming experiences
 Developers provided with a list of terms, they
need to choose 5 terms for each method that
suits best, 60 minutes total time
13
Experiments Conducted
 Corpus

containing whole code vocabulary
 Each method is a different document
 LSI indexing the corpus against each method
terms
 Cosine measure between corpus and
method and corpus words are ranked
 Top 5 words from corpus are chosen
14
Pyramid method
 Pyramid

score = (Sum of A’s score / Total
score A could make)

15
Pyramid Score

16
Important Findings








17

Pyramid score >=.1 and <=.5, marked it encouraging
Words chosen by developers – 98.7% in method
name, 88.9% in class name and 84.6% in parameter
name
Automatic summary terms – 20% in method name,
12.9% in class name and 30.7% in parameter name
Structural info should be considered properly in
automatic summary
Comments text not included in summary
My Observation &Future Works








18

The corpus development technique is not well
specified- no specification about redundancy
protection
LSI focuses on term frequency rather than structural
info which produces bad scores.
During cosine measurement structural info of term in
the method could be considered to get better results
There should have some heuristic measurement for
structural info.
Thank You
Questions?

19

More Related Content

PDF
Summarization Techniques for Code, Changes, and Testing
PDF
Supporting program comprehension with source code summarization icse nier 2010
PDF
Frequently asked tcs technical interview questions and answers
PDF
IRJET - Pseudocode to Python Translation using Machine Learning
PDF
50120140503001
PDF
50120140503001
PDF
A Novel Approach for Rule Based Translation of English to Marathi
PPTX
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Summarization Techniques for Code, Changes, and Testing
Supporting program comprehension with source code summarization icse nier 2010
Frequently asked tcs technical interview questions and answers
IRJET - Pseudocode to Python Translation using Machine Learning
50120140503001
50120140503001
A Novel Approach for Rule Based Translation of English to Marathi
Finding Help with Programming Errors: An Exploratory Study of Novice Software...

What's hot (18)

PPTX
Mining Code Examples with Descriptive Text from Software Artifacts
PDF
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
PDF
Cohesive Software Design
PDF
Algorithms and Application Programming
PDF
Supporting software documentation with source code summarization
PPTX
Extracting Archival-Quality Information from Software-Related Chats
PPTX
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
PPT
Chain indexing
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
DOCX
Bt9402 artificial intelligence
PDF
A New Metric for Code Readability
PDF
Design and Development of a Malayalam to English Translator- A Transfer Based...
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
Java chapter 3
PPTX
Cd ch2 - lexical analysis
PDF
Hindi language as a graphical user interface to relational database for tran...
PPT
Automatic Traceability
PDF
130817 latifa guerrouj - context-aware source code vocabulary normalization...
Mining Code Examples with Descriptive Text from Software Artifacts
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
Cohesive Software Design
Algorithms and Application Programming
Supporting software documentation with source code summarization
Extracting Archival-Quality Information from Software-Related Chats
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Chain indexing
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
Bt9402 artificial intelligence
A New Metric for Code Readability
Design and Development of a Malayalam to English Translator- A Transfer Based...
Survey on Indian CLIR and MT systems in Marathi Language
Java chapter 3
Cd ch2 - lexical analysis
Hindi language as a graphical user interface to relational database for tran...
Automatic Traceability
130817 latifa guerrouj - context-aware source code vocabulary normalization...
Ad

Viewers also liked (8)

PPTX
Automated Bug classification using Bayesian probabilistic approach
DOC
MAHEDI-finalcv_March
PDF
Improving Neural Abstractive Text Summarization with Prior Knowledge
PPTX
Assignment 1
PPT
How Sentiment Analysis works
PPT
Summarizing Tips
PPT
Opinion Mining Tutorial (Sentiment Analysis)
DOCX
Best topics for seminar
Automated Bug classification using Bayesian probabilistic approach
MAHEDI-finalcv_March
Improving Neural Abstractive Text Summarization with Prior Knowledge
Assignment 1
How Sentiment Analysis works
Summarizing Tips
Opinion Mining Tutorial (Sentiment Analysis)
Best topics for seminar
Ad

Similar to Supporting program comprehension with source code summarization (20)

PDF
Summarization of Software Artifacts : A Review
PDF
Summarization of Software Artifacts : A Review
PDF
Supporting software documentation with source code summarization
PDF
Clean Code
PDF
Giving Code a Good Name
PDF
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
PPTX
Clean Code
PDF
Working With Legacy Code
KEY
Java Performance MythBusters
PPTX
Code quality
PPT
Ensuring code quality
PDF
Effective code reviews
PDF
Effective code reviews
PDF
PDF
Control source code quality using the SonarQube platform
ODP
Writting Better Software
PDF
So You Want To Write Your Own Benchmark
PPTX
IWESEP 2013
PDF
Seven Ineffective Coding Habits of Many Programmers
PPTX
How to code review for awesomeness and clarity
Summarization of Software Artifacts : A Review
Summarization of Software Artifacts : A Review
Supporting software documentation with source code summarization
Clean Code
Giving Code a Good Name
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Clean Code
Working With Legacy Code
Java Performance MythBusters
Code quality
Ensuring code quality
Effective code reviews
Effective code reviews
Control source code quality using the SonarQube platform
Writting Better Software
So You Want To Write Your Own Benchmark
IWESEP 2013
Seven Ineffective Coding Habits of Many Programmers
How to code review for awesomeness and clarity

More from Masud Rahman (20)

PDF
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
PDF
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
PDF
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
PPTX
HereWeCode 2022: Dalhousie University
PPTX
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PPTX
PhD Seminar - Masud Rahman, University of Saskatchewan
PPTX
PhD proposal of Masud Rahman
PPTX
PhD Comprehensive exam of Masud Rahman
PPTX
Doctoral Symposium of Masud Rahman
PPTX
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
PDF
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
PDF
Impact of Continuous Integration on Code Reviews
PPTX
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
PPTX
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
PPTX
An Insight into the Unresolved Questions at Stack Overflow
PPTX
An Insight into the Pull Requests of GitHub
PPTX
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
PPTX
TextRank Based Search Term Identification for Software Change Tasks
PPTX
CMPT-842-BRACK
PPTX
RACK: Code Search in the IDE using Crowdsourced Knowledge
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
HereWeCode 2022: Dalhousie University
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD proposal of Masud Rahman
PhD Comprehensive exam of Masud Rahman
Doctoral Symposium of Masud Rahman
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
Impact of Continuous Integration on Code Reviews
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
An Insight into the Unresolved Questions at Stack Overflow
An Insight into the Pull Requests of GitHub
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
TextRank Based Search Term Identification for Software Change Tasks
CMPT-842-BRACK
RACK: Code Search in the IDE using Crowdsourced Knowledge

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Pharma ospi slides which help in ospi learning
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Business Ethics Teaching Materials for college
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
01-Introduction-to-Information-Management.pdf
Microbial diseases, their pathogenesis and prophylaxis
Pharma ospi slides which help in ospi learning
102 student loan defaulters named and shamed – Is someone you know on the list?
O5-L3 Freight Transport Ops (International) V1.pdf
Institutional Correction lecture only . . .
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O7-L3 Supply Chain Operations - ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf
Business Ethics Teaching Materials for college
Microbial disease of the cardiovascular and lymphatic systems
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
VCE English Exam - Section C Student Revision Booklet
Week 4 Term 3 Study Techniques revisited.pptx
Anesthesia in Laparoscopic Surgery in India

Supporting program comprehension with source code summarization

  • 1. Supporting Program Comprehension with Source Code Summarization Sonia Haiduc, Jairo Aponte, Andrian Marcus Presented By: Mohammad Masudur Rahman
  • 2. Contents           2 Why Code Summarization? Thesis Statement Research Questions about summary Research Questions about tool Automatic Code Summarization Evaluation Experiments Conducted Pyramid Method Important Findings My Observation & Future Works
  • 3. Why Code Summarization?  Program comprehension 50% of all maintenance works  Two extreme approaches – skim through and read thoroughly  Skim through – leads to misunderstanding  Read thoroughly – time consuming  An intermediate solution – source code entity with comprehensive textual description 3
  • 4. Thesis Statement  New idea: code summarization to help in program comprehension (PC)  Applying TR methods like Latent Semantic Indexing in source code summarization.  Combining structural information with retrieved code summary to make it effective for realistic purposes. 4
  • 5. Research Questions of Code Summarization  Summary should be automatically generated  Generate summary to different granularity levels – class, method, packages etc  Shorter than the source code  Capture and preserve code semantics and structure – text as well as structure from the code  Consistent structure – important items at first 5
  • 6. Research Questions of Code Summarization  Summary should reflect the developer’s understanding about the code  Tool should allow user to change summary and will remember user’s choice in future summary  Tool should rebuild the summary if the code changes or developer’s provide feedback 6
  • 7. Research Questions about Summarizer Tool       7 Which summarization technique works the best for source code? What type of structural info necessary in summary? Will the summary be different for different type of maintenance task? How long it would be? How much will it resemble to actual summary? How do developers generate summary?
  • 8. Automatic Code Summarization  Generate extractive summary – the most important info extracted from the document 8
  • 9. Automatic Code Summarization  Two types info extracted – lexical and structural  Lexical info – identifiers and comments are extracted  Common English and PL keywords are removed  Identifiers are split into constituent words and stemming performed. 9
  • 10. Automatic Code Summarization  Extracted lexical info forms the text corpus of code where TR methods (e.g. LSI) used to get most important n words.  Once retrieved, n words are combined with structural info like their class name, method name, package name, parameter name and type etc  How to apply structural info to autogenerated summary is an important part 10
  • 11. Automatic Code Summarization A method name reflects the description of what it does.  If method name ignored by TR, the tool can introduce it automatically  Additional info can be added like –user tags 11
  • 12. Evaluation     12 Two types – intrinsic and extrinsic Intrinsic – content evaluation, how closely it depicts the document or how close to manually generated summary Metrics- precision, recall, pyramid method Extrinsic – how much utility and usability it has to support SE tasks – concept location, impact analysis, software reuse, traceability links recovery etc
  • 13. Experiments Conducted  Pyramid method  ATunes OS project, 12 methods  6 developers from different demographic locations, undergraduate students, 3 years Java programming experiences  Developers provided with a list of terms, they need to choose 5 terms for each method that suits best, 60 minutes total time 13
  • 14. Experiments Conducted  Corpus containing whole code vocabulary  Each method is a different document  LSI indexing the corpus against each method terms  Cosine measure between corpus and method and corpus words are ranked  Top 5 words from corpus are chosen 14
  • 15. Pyramid method  Pyramid score = (Sum of A’s score / Total score A could make) 15
  • 17. Important Findings      17 Pyramid score >=.1 and <=.5, marked it encouraging Words chosen by developers – 98.7% in method name, 88.9% in class name and 84.6% in parameter name Automatic summary terms – 20% in method name, 12.9% in class name and 30.7% in parameter name Structural info should be considered properly in automatic summary Comments text not included in summary
  • 18. My Observation &Future Works     18 The corpus development technique is not well specified- no specification about redundancy protection LSI focuses on term frequency rather than structural info which produces bad scores. During cosine measurement structural info of term in the method could be considered to get better results There should have some heuristic measurement for structural info.