SlideShare a Scribd company logo
SUPPORTING SOURCE CODE SEARCH WITH
CONTEXT-AWARE AND SEMANTICS-DRIVEN
QUERY REFORMULATION
Masud Rahman, PhD Candidate
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6
TALK OUTLINE
Part 2: PhD Thesis
Part 1: Research Problem
Part 4: Q&A + Discussions
2
Part 3: Bug Doctor Demo
MasudRahman,PhDCandidate,UofS
MasudRahman,PhDCandidate,UofS
Part 1: Research Problem
3
P1 P2 P4P3
4
MasudRahman,PhDCandidate,UofS
Story I
Software Bugs
P1 P2 P4P3
5
MasudRahman,PhDCandidate,UofS
Story II
Software Features
P1 P2 P4P3
6
MasudRahman,PhDCandidate,UofS
Software Bugs
Software Features
Bug
Resolution
Feature
Enhancement
20% 60%
Story III
P1 P2 P4P3
AN EXAMPLE BUG REPORT &
BUG LOCALIZATION
7
MasudRahman,PhDCandidate,UofS
P1 P2 P4P3
Bug Localization
SEARCH QUERY REFORMULATION FOR
BUG LOCALIZATION
8
MasudRahman,PhDCandidate,UofS
JDIValue, toString, execute,
EvaluationThread, run, NullPointerException
able cast null
Keyword
selection
127 Words
53
1
9
MasudRahman,PhDCandidate,UofS
P1 P2 P4P3
AN EXAMPLE CHANGE REQUEST &
CONCEPT LOCATION
Concept Location
SEARCH QUERY REFORMULATION FOR
CONCEPT LOCATION
10
MasudRahman,PhDCandidate,UofS
14
element IResource Level Tree Provider
1
Keyword
selection
56 Words
RELEVANT CODE SEARCH ON THE WEB
MasudRahman,PhDCandidate,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 11
P1 P2 P4P3
PART 1: SUMMARY
12
MasudRahman,PhDCandidate,UofS
(1) Bug Localization (2) Concept Location
(3) Code Search on the Web
P1 P2 P4P3
MasudRahman,PhDCandidate,UofS
Part 2: PhD Thesis
13
P1 P2 P4P3
SYSTEMATIC LITERATURE REVIEW
MasudRahman,PhDCandidate,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
14
P1 P2 P4P3
PHD THESIS OVERVIEW
15
MasudRahman,PhDCandidate,UofS
(2) Bug Localization
(1) Concept Location
(3) Internet-scale Code
Search
S1 (STRICT)
S2 (ACER)
S3 (BLIZZARD)
S4 (BLADER)
S6 (NLP2API)
S5 (RACK)
P1 P2 P4P3
16
MasudRahman,PhDCandidate,UofS
BLIZZARD
Search Query Reformulation for Bug Localization
using Report Quality Dynamics & Graph-based
Term Weighting
[ESEC/FSE 2018]
P1 P2 P4P3
BUG REPORT QUALITY: A CLOSER LOOK
17
MasudRahman,PhDCandidate,UofS
P1 P2 P4P3
5000+
Noisy Poor Rich
18
MasudRahman,PhDCandidate,UofS
A NOISY BUG REPORT
Stack traces
500 keywords
19
i entry
j entry
Pi Ci Mi
Pj Cj Mj
I
MasudRahman,PhDCandidate,UofS
Static
Static
Hierarchical
STEP I: TRACE GRAPH CONSTRUCTION
P1 P2 P4P3
STEP II. SEARCH KEYWORD SELECTION FROM
TRACE GRAPH
20
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS  PageRank
Algorithm
MasudRahman,PhDCandidate,UofS
P1 P2 P4P3
Ci
Cj
Mk
Mn
CpII
SEARCH QUERY FROM A NOISY BUG REPORT
21
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
MasudRahman,PhDCandidate,UofS
P1 P2 P4P3
22
MasudRahman,PhDCandidate,UofS
A POOR BUG REPORT
15 keywords
SEARCH KEYWORDS FROM SOURCE CODE
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
 
 )(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS 
23
MasudRahman,UofS
P1 P2 P4P3
launch
debug
resolve
required
classpath
SEARCH QUERY FOR A POOR BUG REPORT
24
compliance create preference add
configuration field dialog
annotation
01
MasudRahman,PhDCandidate,UofS
30
P1 P2 P4P3
25
MasudRahman,PhDCandidate,UofS
A RICH BUG REPORT
27
195 keywords
astvisitor post postvisit
previsit pre file post pre
astnode visitor
01
EXPERIMENT, DATASET & METRICS
26
5K+ Bug Reports Version HistoryGround Truth
MasudRahman,PhDCandidate,UofS
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
4 RQs
P1 P2 P4P3
27
MasudRahman,PhDCandidate,UofS
COMPARISON WITH THE
STATE OF THE ART (MAP & HIT@K)
P1 P2 P4P3
28
MasudRahman,PhDCandidate,UofS
COMPARISON WITH THE STATE OF THE ART
(THEORETICAL)
P1 P2 P4P3
PART 2: SUMMARY
29
MasudRahman,PhDCandidate,UofS
(1) Bug Localization
(2) Concept Location
(3) Internet-scale Code Search
73%-88%
P1 P2 P4P3
PART 2 SUMMARY: SEARCH QUERY FOR
BUG LOCALIZATION
30
MasudRahman,PhDCandidate,UofS
JDIValue, toString, execute,
EvaluationThread, run, NullPointerException
able cast null
Keyword
selection
127 Words
53
1
BLIZZARD
BLADER
PART 2 SUMMARY: SEARCH QUERY FOR
CONCEPT LOCATION
31
MasudRahman,PhDCandidate,UofS
14
element IResource Level Tree Provider
1
Keyword
selection
56 Words
STRICT
ACER
PART 2 SUMMARY: SEARCH QUERY FOR CODE
SEARCH ON WEB
MasudRahman,PhDCandidate,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 32
RACK
NLP2API
MasudRahman,PhDCandidate,UofS
Part 3: Bug Doctor Demo
33
P1 P2 P4P3
MasudRahman,PhDCandidate,UofS
34
http://www.usask.ca/~masud.rahman
https://github.com/masud-technope
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
Part IV: Q&A
P1 P2 P4P3

More Related Content

PPTX
Doctoral Symposium of Masud Rahman
PPTX
PhD Seminar - Masud Rahman, University of Saskatchewan
PPTX
PhD proposal of Masud Rahman
PPTX
PhD Comprehensive exam of Masud Rahman
PDF
Metrics for Effort/Cost Estimation of Mobile apps development
PPTX
Sci ml 2020 on bayesian
PPTX
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
PDF
Mapping vendor solutions to emmm capability map
Doctoral Symposium of Masud Rahman
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD proposal of Masud Rahman
PhD Comprehensive exam of Masud Rahman
Metrics for Effort/Cost Estimation of Mobile apps development
Sci ml 2020 on bayesian
Improving IR-Based Bug Localization with Context-Aware-Query Reformulation
Mapping vendor solutions to emmm capability map

More from Masud Rahman (20)

PDF
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
PDF
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
PDF
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
PPTX
HereWeCode 2022: Dalhousie University
PPTX
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PDF
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
PDF
Impact of Continuous Integration on Code Reviews
PPTX
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
PPTX
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
PPTX
An Insight into the Unresolved Questions at Stack Overflow
PPTX
An Insight into the Pull Requests of GitHub
PPTX
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
PPTX
TextRank Based Search Term Identification for Software Change Tasks
PPTX
CMPT-842-BRACK
PPTX
RACK: Code Search in the IDE using Crowdsourced Knowledge
PPTX
RACK: Automatic API Recommendation using Crowdsourced Knowledge
PPTX
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
PPTX
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
PPTX
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
PPTX
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
HereWeCode 2022: Dalhousie University
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
Impact of Continuous Integration on Code Reviews
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
An Insight into the Unresolved Questions at Stack Overflow
An Insight into the Pull Requests of GitHub
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
TextRank Based Search Term Identification for Software Change Tasks
CMPT-842-BRACK
RACK: Code Search in the IDE using Crowdsourced Knowledge
RACK: Automatic API Recommendation using Crowdsourced Knowledge
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
Ad

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PPTX
GDM (1) (1).pptx small presentation for students
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
Anesthesia in Laparoscopic Surgery in India
Final Presentation General Medicine 03-08-2024.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Institutional Correction lecture only . . .
Supply Chain Operations Speaking Notes -ICLT Program
Cell Types and Its function , kingdom of life
GDM (1) (1).pptx small presentation for students
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
human mycosis Human fungal infections are called human mycosis..pptx
Complications of Minimal Access Surgery at WLH
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
RMMM.pdf make it easy to upload and study
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Chinmaya Tiranga quiz Grand Finale.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Microbial diseases, their pathogenesis and prophylaxis
Ad

Supporting Source Code Search with Context-Aware and Semantics-Driven Code Search

Editor's Notes

  • #2: Hello everyone! Good afternoon! Thanks for attending this meeting. My name is Masud Rahman. I am a PhD Candidate from Software Research Lab. I work with Dr. Chanchal K. Roy. Today, I will be talking about automated query reformulations for code search.
  • #3: Today, my talk will be divided into four sections. In the first section, I will discuss the research problem I am trying to solve in my PhD. In the second section, I will discuss about my PhD Thesis proposals to solve that research problem. In the third section, I will summarize my PhD contributions. Finally, we will have a Q&A session and interesting discussions.
  • #4: Part 1: Research Problem
  • #12: Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  • #14: Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- PhD Thesis
  • #15: So what we did? We did a systematic literature survey using 56 primary studies on query reformulation for code search. During this study, we found 3 major issues in the literature.
  • #18: We did an empirical study with 5K+ bug reports in our ICSE poster. And we discovered that bug reports could be very different in terms of quality. There could be different types of bug reports. It could be noisy with stack traces which is 16% It could be really poor that does not contain any structured entities, which is 30% Or it could be rich bug reports that include source code, test case and other stuffs, which is 54%
  • #20: Now, each trace will contain three piece of information – a package name, a class name and a method name. We know that, in a single line, the method and class are statically connected for sure. However, classes and methods are also hierarchically dependent across trace lines due to caller-callee relationships. We capture such static and hierarchical relationships from consecutive trace lines, and develop a trace graph like this.
  • #21: So, we have created two graphs, right? Now, we have two graphs developed from the bug report based on two different dimensions --Word co-occurrence and syntactic dependence. Once we have graphs, we apply this famous algorithm called PageRank algorithm. This is the backbone of Google search. Now, the algorithmic details are a bit complex, but I will try to provide an overview here. Why do you think, this guy is laughing? Because, it is getting the maximum votes. Similarly, in the graph, the node that is connected to most of the nodes is the winner. That is, a term’s importance will be determined by its connectivity with other nodes. More importantly, since this is a recursive algorithm, the importance depends on the weights of the connected node as well. Once the computation is done, we get a reformulation candidate from each graph. What is the reformulation candidate? – a ranked list of keywords like this. So, we collect two candidates from two graph, apply machine learning and suggest the best one as our suggested query from the bug report.
  • #22: So, from a noisy report, we extract The report title The encountered exception The most important keywords from the stack traces. Then we do the search with this newly constructed query. For example, the baseline noisy query returns the result at 53rd position. Whereas our query returns the correct result at the topmost position.
  • #24: Now once such items are extracted, we split them. Now as we see, these single terms share some kind of semantics to convey a broader semantic. That is, they complement each other in this context. Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
  • #25: In the case of poor bug report, we also apply a similar PageRank approach. But we collect the keywords from the source code using pseudo-relevance feedback. The details can be found in the paper. So, the bug report texts are merged with the keywords from relevant source code. While the bug report texts return the result at 30th position. After feeding poor report with appropriate keywords, the correct result is returned at the top most position.
  • #27: Now, for the experiments, we chose 8 subject systems from Apache and Eclipse. We collect about 3000 bug reports, and try to map them with the version control history at GitHub. Through such mapping we extract the ground truth for the bug reports. This is a standard process followed by the existing literature.
  • #33: Now, I am not going to discuss those studies in details. But here is the glimpse. Developers generally look for relevant code on the web using natural language query. Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub. Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers. But what we are dealing with source code right? So, we need source code friendly query for a better result. So, we identify relevant API classes against this natural language query through extensive data mining and data analytics. And once again, Stack Overflow is our friend in this grand challenge.
  • #34: Now, we are done with Background concepts, Part 1. Now, we are going into Part 2 -- PhD Thesis
  • #35: Thanks for your time and attention. Now, I am ready to take your questions.