PhD Comprehensive exam of Masud Rahman

A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal K. Roy
@masud233
6

MASUD RAHMAN: ACADEMICS
2
2019
PhD (In Progress),
University of Saskatchewan
(Award: Dr. Keith Geddes Award)
2014
MSc, University of Saskatchewan
(Award: Best MSc Thesis Nomination)
2009
BSc, Khulna University, Bangladesh
(Award: President Gold Medal)
MasudRahman,UofS

TALK OUTLINE
3
MasudRahman,UofS
Part 1: Background Concepts
Part 2: Literature Review
Part 4: Q & A
Part 3: Future Opportunities

4
MasudRahman,UofS
Part 1: Background Concepts
P1 P2 P3 P4

5
MasudRahman,UofS
A SYSTEMATIC LITERATURE REVIEW OF
AUTOMATED QUERY REFORMULATIONS IN
SOURCE CODE SEARCH
Source Code Search
Automated Query
Reformulation
BACKGROUND CONCEPTS
1 2
P1 P2 P3 P4

MCAS: A SOFTWARE BUG THAT KILLS
6
MasudRahman,UofS
P1 P2 P3 P4
Boeing 737 MAX 8

A TALE OF SOURCE CODE SEARCH
7
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
P1 P2 P3 P4

QUERY REFORMULATION: 2 WORKING CONTEXTS
8
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3 P4

VOCABULARY MISMATCH PROBLEM
9
MasudRahman,UofS
P1 P2 P3 P4
Both are correct and wrong!
Boeing
Customer Boeing
Developer

QUERY REFORMULATION: 3 TYPES
10
MasudRahman,UofS
Fix MCAS Bug
Fix MCAS Bug
Lion Air
MCAS Bug
Boeing 737 Max
Software Issue
+
Query Expansion Query Reduction Query Replacement
P1 P2 P3 P4

QUERY REFORMULATION: 3 STEPS
11
MasudRahman,UofS
Initial Query Code Search
Relevance Feedback
Feedback
documents
Text mining
Candidate term
weighting
Candidate term
ranking
11
1
2
22
3
P1 P2 P3 P4

QUERY REFORMULATION: IMPLICATIONS
12
MasudRahman,UofS
Automated Query
Reformulation
Benefits
• 20% Improvement
• Redefine information needs
• Reduced cost & efforts
Costs
• Hurts good queries
• Topic drifting
• Difficult queries
P1 P2 P3 P4

13
MasudRahman,UofS
Part 2: Systematic Literature
Review
P1 P2 P3 P4

SYSTEMATIC LITERATURE REVIEW: 6 STEPS
14
MasudRahman,UofS
Research questions Search keywords Literature search
Literature bulkNoise filtrationPrimary studies
In-depth
investigation
7 RQs
P1 P2 P3 P4

SYSTEMATIC LITERATURE REVIEW: PRIMARY
STUDY SELECTION
15
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
P1 P2 P3 P4
Filter by
Full texts
Information retrieval, IR, text retrieval, TR, bug localization,
concept location, feature location, FLT, concern location, Internet-
scale code search, code search engine, search engine, local code
search, code search, source code search, and code search query.
Query reformulation, query expansion, query reduction, query
formulation, query refinement, automated query expansion, AQE,
query suggestion, query recommendation, term selection, query
replacement, query difficulty, query quality, keyword selection,
keyword extraction, search term identification, search query,
search term, and search keyword.
+

OUR RESEARCH QUESTIONS
16
MasudRahman,UofS
RQ1: Which methods, algorithms and data sources have been used for automated
query reformulations targeting code search in the literature?
P1 P2 P3 P4
RQ2: Which methods, metrics or subject systems have been used to evaluate and
validate the researches on automated query reformulations?
RQ3: What are the major challenges of automated query reformulations intended
for code search? How many of them have been solved to date by the literature?
RQ4: How much activities of research on automated query reformulations have
been performed to date? What are the venues that these researches got published
at?
RQ5: What are the differences and similarities between query reformulations for
local code search and query reformulations for Internet-scale code search?
RQ6: Which one is more appropriate among term weighting, query-term co-
occurrence and thesaurus-based approaches for query keyword selection?
RQ7: What are the scopes for future work in the area of automated query
reformulation targeting the code search?

RQ1: WHICH METHODS & ALGORITHMS ARE
USED BY LITERATURE?
17
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4

RQ1: WHICH ALGORITHMS & REFORMULATION TYPES
ARE USED BY LITERATURE?
18
MasudRahman,UofS
P1 P2 P3 P4

RQ2: WHICH EVALUATION & VALIDATION
SETTINGS ARE EMPLOYED?
19
MasudRahman,UofS
P1 P2 P3 P4

RQ3: WHAT ARE COMMON CHALLENGES &
LIMITATIONS OF EXISTING LITERATURE?
20
MasudRahman,UofS
Grounded
Theory
• Open coding
• Axial coding
• Selective
coding
1
2
P1 P2 P3 P4

RQ3: WHAT ARE COMMON CHALLENGES & LIMITATIONS OF
EXISTING LITERATURE?
21
MasudRahman,UofS
P1 P2 P3 P4

RQ4: PUBLICATION STATS & INTERESTS ON QUERY
REFORMULATION RESEARCH
22
MasudRahman,UofS
P1 P2 P3 P4

RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
23
MasudRahman,UofS
TW = Term Weighting, TQC = Term-Query Co-occurrence, TS =
Thesaurus, ON = Ontology, SLM = Search Log Mining, ML = Machine
Learning, HM = Heuristics & Miscellaneous
P1 P2 P3 P4

RQ5: COMPARISON BETWEEN LOCAL CODE SEARCH &
INTERNET-SCALE CODE SEARCH
24
MasudRahman,UofS
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4

RQ6: CHALLENGES WITH THREE KEYWORD
SELECTION METHODS
25
MasudRahman,UofS
Method #Study CH1 CH2 CH3 CH6
Term Weighting 22 (39%) 36% 18% 91% 50%
Term-Query Co-occurrence 11 (20%) 9% 27% 64% 91%
Thesaurus 17 (30%) 12% 12% 47% 41%
CH1 = Vocabulary Mismatch Unsolved, CH2 = Extra Burden on
Developers, CH3 = Lack of Generalizability, CH4 = Lack of Practical
Use, CH5 = Inappropriate Use of Tools, CH6 = Human Bias + Weak
Evaluation, CH7 = Unverified Assumptions
P1 P2 P3 P4

26
MasudRahman,UofS
Part 3: Future Opportunities
P1 P2 P3 P4

27
MasudRahman,,UofS
R1: KEYWORD SELECTION FROM BUG REPORT
Title
Description
ID Query QE
1. Custom search results view iresource
2. Custom search results search results view
3. element iresource provider level tree
4. Custom search results hierarchically java search results
1331
636
01
570
Lower QE is better
P1 P2 P3 P4

R2: TERM WEIGHTING FOR SOURCE CODE
28


RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
P1 P2 P3 P4

R3: SOLVE VOCABULARY MISMATCH ISSUE
29
MasudRahman,UofS
Customer
Developer
Past
Developer
Bug Report
Codebase
P1 P2 P3 P4

SOLUTION: SEMANTIC HYPERSPACE
30
MasudRahman,UofS
Word 1 P (1, 5, 6, 7, ….., N)
Word 2 P (2, 4, 6, 9, ….., N)
Word 2
Cosine distance = Semantic
relevance
P1 P2 P3 P4

R4: GENETIC ALGORITHM FOR QUERIES
31
MasudRahman,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4

R5: STACK OVERFLOW FOR QUERY
32
MasudRahman,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects
P1 P2 P3 P4

33
http://www.usask.ca/~masud.rahman
Contact: masud.rahman@usask.ca
@masud2336
Masud Rahman
MasudRahman,UofS
Part IV: Q & A
P1 P2 P3 P4

DICE, ROCCHIO, RSV
34
MasudRahman,UofS

PROBABILISTIC TERM WEIGHTING
35
MasudRahman,UofS
KLD

PhD Comprehensive exam of Masud Rahman

More Related Content

What's hot (20)

Similar to PhD Comprehensive exam of Masud Rahman (20)

More from Masud Rahman (20)

Recently uploaded (20)

PhD Comprehensive exam of Masud Rahman

Editor's Notes