SlideShare a Scribd company logo
RACK: CODE SEARCH IN THE IDE
USING CROWDSOURCED
KNOWLEDGE
Mohammad Masudur Rahman, Chanchal K. Roy, and
David Lo+
University of Saskatchewan, Canada,
Singapore Management University+, Singapore
International Conference on Software Engineering
(ICSE 2017), Buenos Aires, Argentina
CODE SEARCH ENGINE (BLACK DUCK)
2
 Natural language query does not work well with code search
engines (e.g., Black Duck, Krugle, GitHub search)
 Performs simple keyword matching, not sufficient.
Query
No results
CODE SEARCH ENGINE (GITHUB & KRUGLE)
 Query should be carefully prepared, not an easy task.
 Query should contain relevant API names.
 Developer needs up-front knowledge on required APIs.
3
Keyword matching
NL query to
be replaced
by relevant
APIs?
WEB SEARCH ENGINE (GOOGLE)
4
 Thousands of results from web search engines (e.g., Google)
 Cannot guarantee relevant code examples.
 Information overload for the developers.
Search Query: How to send email in Java?
5
RACK: Code Search in the IDE using Crowdsourced
Knowledge
Research Problem: Code Example Search using
Natural Language Queries
MOTIVATING EXAMPLE
6
Question title
Relevant API Item of our interest
CHALLENGES FOR RACK
7
RQ1: Do accepted answers refer to API names
frequently?
RQ2: What percentage of standard APIs are
referred to in the accepted answers?
RQ3: Do titles from Stack Overflow questions contain
potential keywords for code search?
ANSWERING RQ1: API CLASSES/ANSWER
8
 Heavy-tailed distribution, i.e., Poisson
 Mean, λ=2.37 with 95% CI between
2.33 and 2.41
ANSWERING RQ2: CORE API CLASS
COVERAGE IN STACK OVERFLOW
9
RQ3: SEARCH QUERY KEYWORDS IN SO
QUESTION TITLES
10
Fig 6: Coverage of query keywords in Stack Overflow question titles
SUMMARY OF FINDINGS
11
• At least 2 API classes per answer
• 60% of core API classes covered by Stack Overflow
• 73% real life query terms matched with question title
terms of Stack Overflow
SO is good for relevant API names for
code example search
12
Token-API mapping
database construction
Query reformulation or
translation
Code example
search
WORKING METHODOLOGY OF RACK
STEP I: CONSTRUCTION OF TOKEN-API
MAPPING DATABASE
13
1
2
3 4
5 6
7
8
STEP II: NL QUERY REFORMULATION
14
1 2
3
4 5 6
STEP III: CODE SEARCH WITH GITHUB
15
1 2 3 4 5
DATASET & EXPERIMENTAL SETUP
16
Java2s
150 code search
queries
Validation
(Thung et al, ASE 2013)
Evaluation
(4 metrics)
PERFORMANCE OF RACK: ANSWERING RQ4
17
Performance Metric Top-3 Top-5 Top-10
Top-K Accuracy 49.33% 62.67% 78.67%
MRR@K 0.17 0.17 0.17
MAP@K 30.39% 33.36% 34.92%
MR@K 23.71% 33.48% 45.02%
• Provides about 79% Top-K accuracy
• Precision is about 35%, Recall is about 45%
• Main strength– exploiting the wide coverage of API
packages and libraries from Stack Overflow.
COMPARISON WITH EXISTING TECHNIQUES:
ANSWERING RQ7
18
Fig : Comparison using box plots
RACK: PROPOSED CODE SEARCH TOOL
19
1
Alice needs the code example for
sending an email
2
Alice submits a NL query to RACK
3
RACK returns a ranked list of
relevant API classes
4
Alice selects
API classes
based on the
metrics and her
experience
5
Makes code search
6
Relevant code
example returned
7
Checks Top-K examples
RACK returns Top-K
relevant examples
8
9
Alice can copy/paste
and continue her
work in the IDE
RACK: TOOL DEMO
20
Click here to check the demo
TAKE-HOME MESSAGES
 Research problem: Effective code search using natural
language queries.
 Neither existing code search engines nor web search
engines are sufficient for this.
 First step: Reformulate NL query into relevant API names
– was done using crowd sourced knowledge from Stack
Overflow.
 Second step: Code search using GitHub search API in the
IDE.
 Our solution packaged as an Eclipse plugin and a web
service.
 Please give it a try, and let us know!
21
THANK YOU ! QUESTIONS?
22
RACK page: http://homepage.usask.ca/~masud.rahman/rack
Email us: chanchal.roy@usask.ca or masud.rahman@usask.ca
HEURISTICS: REFORMULATION OF NL QUERY
 Keyword--API Co-occurrence (KAC)
 Considers co-occurrence between query keywords & API
classes in Stack Overflow questions & answers.
 Co-occurrence due to semantic relevance or by chance
 Random co-occurrences discarded using threshold.
 Candidate API Selection
 denotes a keyword, denotes an API, and denotes
the association between them in SO Q & A
 KAC-Score Calculation
 Normalized API score using co-occurrence frequency
23
))(|{][  jijji AKrankfreqAAAKL
|][|
]))[(,(
1),(
i
ij
ijKAC
KL
KLsortByFreqArank
KAS 
iK jA ji AK 
HEURISTICS: REFORMULATION OF NL QUERY
 Keyword—Keyword Coherence (KKC)
 Candidate APIs should be coherent with one another.
 Two APIs are coherent if their co-occurred keywords are
semantically relevant.
 Keyword relevance based on contextual words in Stack
Overflow question titles. Total key pairs =
 Candidate API Selection
 refer to contextual word list of
 KKC-Score Calculation
24
}),cos(|][][{],[  jijijicoh CCKLKLKKL
ji CC , ji KK ,
2nC
)()(|),cos(),,( jjjijijijKKC AKAKCCKKAS 

More Related Content

PPTX
RACK: Automatic API Recommendation using Crowdsourced Knowledge
PDF
GraphQL Europe Recap
PDF
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
PDF
GraphQL: Enabling a new generation of API developer tools
PPT
Graphql presentation
PDF
PDF
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
PPTX
GraphQL Introduction
RACK: Automatic API Recommendation using Crowdsourced Knowledge
GraphQL Europe Recap
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL: Enabling a new generation of API developer tools
Graphql presentation
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
GraphQL Introduction

What's hot (20)

PDF
Better APIs with GraphQL
PDF
Introduction to GraphQL
PDF
Build your next REST API with gRPC
PPTX
Introduction to GraphQL
PPTX
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
PDF
Investigating Code Review Practices in Defective Files
PDF
apidays LIVE Paris 2021 - GraphQL Today and Tomorrow by Uri Goldshtein, The G...
PPTX
Graph ql vs rest
PDF
Intro to GraphQL
PPTX
Taking Control of your Data with GraphQL
PDF
REST vs GraphQL
PDF
GraphQL
PDF
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
PDF
Java Day Istanbul 2018 GraphQL vs Traditional REST API
PPTX
GraphQL - The new "Lingua Franca" for API-Development
PDF
GraphQL across the stack: How everything fits together
PDF
GraphQL + relay
PDF
Devoxx France 2018 GraphQL vs Traditional REST API
PPTX
An intro to GraphQL
PDF
Implementing OpenAPI and GraphQL services with gRPC
Better APIs with GraphQL
Introduction to GraphQL
Build your next REST API with gRPC
Introduction to GraphQL
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
Investigating Code Review Practices in Defective Files
apidays LIVE Paris 2021 - GraphQL Today and Tomorrow by Uri Goldshtein, The G...
Graph ql vs rest
Intro to GraphQL
Taking Control of your Data with GraphQL
REST vs GraphQL
GraphQL
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Java Day Istanbul 2018 GraphQL vs Traditional REST API
GraphQL - The new "Lingua Franca" for API-Development
GraphQL across the stack: How everything fits together
GraphQL + relay
Devoxx France 2018 GraphQL vs Traditional REST API
An intro to GraphQL
Implementing OpenAPI and GraphQL services with gRPC
Ad

Similar to RACK: Code Search in the IDE using Crowdsourced Knowledge (20)

PPTX
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
PPTX
CMPT470-usask-guest-lecture
PPT
Tools to Find Source Code on the Web
PDF
Evaluating Recommended Applications
PPT
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
PDF
I Love APIs Europe 2015: Developer Sessions
PPTX
Let's Chat to Find the APIs: Connecting Human, LLM and Knowledge Graph throug...
PPTX
API Usage Pattern Extraction using Semantic Similarity
KEY
Cascalog at May Bay Area Hadoop User Group
PPTX
aip-workshop1-dev-tutorial
PPTX
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
Model-driven Round-trip Engineering of REST APIs
PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PDF
LLMs in Production: Tooling, Process, and Team Structure
PPTX
TextRank Based Search Term Identification for Software Change Tasks
PPTX
Using Compass to Diagnose Performance Problems
PPTX
Using Compass to Diagnose Performance Problems in Your Cluster
PDF
Spark DataFrames and ML Pipelines
PPTX
apidays LIVE Australia 2020 - Have your cake and eat it too: GraphQL? REST? W...
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
CMPT470-usask-guest-lecture
Tools to Find Source Code on the Web
Evaluating Recommended Applications
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
I Love APIs Europe 2015: Developer Sessions
Let's Chat to Find the APIs: Connecting Human, LLM and Knowledge Graph throug...
API Usage Pattern Extraction using Semantic Similarity
Cascalog at May Bay Area Hadoop User Group
aip-workshop1-dev-tutorial
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
An Insider’s Guide to Maximizing Spark SQL Performance
Model-driven Round-trip Engineering of REST APIs
Natural Language to SQL Query conversion using Machine Learning Techniques on...
LLMs in Production: Tooling, Process, and Team Structure
TextRank Based Search Term Identification for Software Change Tasks
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems in Your Cluster
Spark DataFrames and ML Pipelines
apidays LIVE Australia 2020 - Have your cake and eat it too: GraphQL? REST? W...
Ad

More from Masud Rahman (20)

PDF
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
PDF
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
PDF
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
PPTX
HereWeCode 2022: Dalhousie University
PPTX
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PPTX
PhD Seminar - Masud Rahman, University of Saskatchewan
PPTX
PhD proposal of Masud Rahman
PPTX
PhD Comprehensive exam of Masud Rahman
PPTX
Doctoral Symposium of Masud Rahman
PPTX
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
PDF
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
PDF
Impact of Continuous Integration on Code Reviews
PPTX
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
PPTX
An Insight into the Unresolved Questions at Stack Overflow
PPTX
An Insight into the Pull Requests of GitHub
PPTX
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
PPTX
CMPT-842-BRACK
PPTX
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
PPTX
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
PPTX
Improved Query Reformulation for Concept Location using CodeRank and Document...
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
HereWeCode 2022: Dalhousie University
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD proposal of Masud Rahman
PhD Comprehensive exam of Masud Rahman
Doctoral Symposium of Masud Rahman
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
Impact of Continuous Integration on Code Reviews
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
An Insight into the Unresolved Questions at Stack Overflow
An Insight into the Pull Requests of GitHub
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
CMPT-842-BRACK
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
Improved Query Reformulation for Concept Location using CodeRank and Document...

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
A Presentation on Artificial Intelligence
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Spectroscopy.pptx food analysis technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Diabetes mellitus diagnosis method based random forest with bat algorithm
sap open course for s4hana steps from ECC to s4
A Presentation on Artificial Intelligence
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
A comparative analysis of optical character recognition models for extracting...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectroscopy.pptx food analysis technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Weekly Chronicles - August'25-Week II
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.

RACK: Code Search in the IDE using Crowdsourced Knowledge

  • 1. RACK: CODE SEARCH IN THE IDE USING CROWDSOURCED KNOWLEDGE Mohammad Masudur Rahman, Chanchal K. Roy, and David Lo+ University of Saskatchewan, Canada, Singapore Management University+, Singapore International Conference on Software Engineering (ICSE 2017), Buenos Aires, Argentina
  • 2. CODE SEARCH ENGINE (BLACK DUCK) 2  Natural language query does not work well with code search engines (e.g., Black Duck, Krugle, GitHub search)  Performs simple keyword matching, not sufficient. Query No results
  • 3. CODE SEARCH ENGINE (GITHUB & KRUGLE)  Query should be carefully prepared, not an easy task.  Query should contain relevant API names.  Developer needs up-front knowledge on required APIs. 3 Keyword matching NL query to be replaced by relevant APIs?
  • 4. WEB SEARCH ENGINE (GOOGLE) 4  Thousands of results from web search engines (e.g., Google)  Cannot guarantee relevant code examples.  Information overload for the developers. Search Query: How to send email in Java?
  • 5. 5 RACK: Code Search in the IDE using Crowdsourced Knowledge Research Problem: Code Example Search using Natural Language Queries
  • 7. CHALLENGES FOR RACK 7 RQ1: Do accepted answers refer to API names frequently? RQ2: What percentage of standard APIs are referred to in the accepted answers? RQ3: Do titles from Stack Overflow questions contain potential keywords for code search?
  • 8. ANSWERING RQ1: API CLASSES/ANSWER 8  Heavy-tailed distribution, i.e., Poisson  Mean, λ=2.37 with 95% CI between 2.33 and 2.41
  • 9. ANSWERING RQ2: CORE API CLASS COVERAGE IN STACK OVERFLOW 9
  • 10. RQ3: SEARCH QUERY KEYWORDS IN SO QUESTION TITLES 10 Fig 6: Coverage of query keywords in Stack Overflow question titles
  • 11. SUMMARY OF FINDINGS 11 • At least 2 API classes per answer • 60% of core API classes covered by Stack Overflow • 73% real life query terms matched with question title terms of Stack Overflow SO is good for relevant API names for code example search
  • 12. 12 Token-API mapping database construction Query reformulation or translation Code example search WORKING METHODOLOGY OF RACK
  • 13. STEP I: CONSTRUCTION OF TOKEN-API MAPPING DATABASE 13 1 2 3 4 5 6 7 8
  • 14. STEP II: NL QUERY REFORMULATION 14 1 2 3 4 5 6
  • 15. STEP III: CODE SEARCH WITH GITHUB 15 1 2 3 4 5
  • 16. DATASET & EXPERIMENTAL SETUP 16 Java2s 150 code search queries Validation (Thung et al, ASE 2013) Evaluation (4 metrics)
  • 17. PERFORMANCE OF RACK: ANSWERING RQ4 17 Performance Metric Top-3 Top-5 Top-10 Top-K Accuracy 49.33% 62.67% 78.67% MRR@K 0.17 0.17 0.17 MAP@K 30.39% 33.36% 34.92% MR@K 23.71% 33.48% 45.02% • Provides about 79% Top-K accuracy • Precision is about 35%, Recall is about 45% • Main strength– exploiting the wide coverage of API packages and libraries from Stack Overflow.
  • 18. COMPARISON WITH EXISTING TECHNIQUES: ANSWERING RQ7 18 Fig : Comparison using box plots
  • 19. RACK: PROPOSED CODE SEARCH TOOL 19 1 Alice needs the code example for sending an email 2 Alice submits a NL query to RACK 3 RACK returns a ranked list of relevant API classes 4 Alice selects API classes based on the metrics and her experience 5 Makes code search 6 Relevant code example returned 7 Checks Top-K examples RACK returns Top-K relevant examples 8 9 Alice can copy/paste and continue her work in the IDE
  • 20. RACK: TOOL DEMO 20 Click here to check the demo
  • 21. TAKE-HOME MESSAGES  Research problem: Effective code search using natural language queries.  Neither existing code search engines nor web search engines are sufficient for this.  First step: Reformulate NL query into relevant API names – was done using crowd sourced knowledge from Stack Overflow.  Second step: Code search using GitHub search API in the IDE.  Our solution packaged as an Eclipse plugin and a web service.  Please give it a try, and let us know! 21
  • 22. THANK YOU ! QUESTIONS? 22 RACK page: http://homepage.usask.ca/~masud.rahman/rack Email us: chanchal.roy@usask.ca or masud.rahman@usask.ca
  • 23. HEURISTICS: REFORMULATION OF NL QUERY  Keyword--API Co-occurrence (KAC)  Considers co-occurrence between query keywords & API classes in Stack Overflow questions & answers.  Co-occurrence due to semantic relevance or by chance  Random co-occurrences discarded using threshold.  Candidate API Selection  denotes a keyword, denotes an API, and denotes the association between them in SO Q & A  KAC-Score Calculation  Normalized API score using co-occurrence frequency 23 ))(|{][  jijji AKrankfreqAAAKL |][| ]))[(,( 1),( i ij ijKAC KL KLsortByFreqArank KAS  iK jA ji AK 
  • 24. HEURISTICS: REFORMULATION OF NL QUERY  Keyword—Keyword Coherence (KKC)  Candidate APIs should be coherent with one another.  Two APIs are coherent if their co-occurred keywords are semantically relevant.  Keyword relevance based on contextual words in Stack Overflow question titles. Total key pairs =  Candidate API Selection  refer to contextual word list of  KKC-Score Calculation 24 }),cos(|][][{],[  jijijicoh CCKLKLKKL ji CC , ji KK , 2nC )()(|),cos(),,( jjjijijijKKC AKAKCCKKAS 

Editor's Notes

  • #2: Introduce yourself + introductory statements. Today, I am going to talk/demonstrate --- how code search can be performed using natural language queries where we used crowd sourced knowledge from Stack Overflow.
  • #3: Currently, we have several code search engines such as Black Duck, Krugle, GitHub code search and so on. But traditional code search engines generally do not perform well with natural language queries. For example, Black Duck does not return any results for this simple query—”how to send an email in Java” Its because these engines do simple keyword matching.
  • #4: Here, we see Krugle and GitHub return some results for the same query. But they also use simple keyword matching. These keywords can match from code tokens or code comments, as we can see. Besides, there is a larger problem with keyword matching would be--vocabulary mismatch problem. That is, if query and source code have different vocabularies, keyword matching will not work.
  • #5: Now, with the web search engines, the same query returns thousands of results. But the bigger problem is, you cannot guarantee that they will contain relevant code examples. Developers generally like to play with working code examples rather than dealing with lots of texts. So, there is information overload for the developers to deal with in this case.
  • #6: So, we are trying to solve this research problem: Code example search using natural language queries. One obvious solution is to translate the NL query into relevant API names, and then use them for the code search. Now, the challenge is how do you get the relevant API names for a particular code search?
  • #7: Now lets see how can we do that. In Stack Overflow, users often ask questions like this—”How to generate MD5 hash?” Then someone expert answers the question that often contains the relevant API names. For example, here we see that MessageDigest is a relevant API for performing the task. More importantly, such information/answer is validated by a large developer crowd who agree with the answer. We mine thousands of questions and answers like this, and then translate a natural language query into relevant API names.
  • #8: Now this idea comes with several challenges, and we need to solve them first before using information from Stack Overflow. For example, we have to answer these questions: First, do the accepted answers in Stack Overflow use API names frequently, as a part of the solutions? Or this is just occasional? Second, what percentage of the APIs are mentioned in those accepted answers from Stack Overflow? Third, do the titles from Stack Overflow questions overlap with code search keywords?
  • #9: These are the findings for our RQ1. We determine API class frequency per accepted answer, and analyze their distribution. The frequencies have a heavy-tailed distribution such as Poisson distribution. And the mean frequency per answer is 2.37 with a 95% confidence interval between 2.33 and 2.41. That means, at least 2 API classes are used in each of the accepted answers of Stack Overflow.
  • #10: These figures show how core Java API packages are covered by SO answers. We consider 11 core Java packages from standard Java libraries. We see that on avearge, more than 60% of the API classes are covered, i.e., used by questions and answers of Stack Overflow. This is a significant amount. More granular details are in the main paper. However, these findings suggest that SO is a good source for collecting API related information.
  • #11: RQ3 focuses on the overlap between SO question titles and the real life search queries collected for our study. Fig. 6 shows the percentage of title tokens that matched with real life query tokens for the last 8 years. Queries are taken from the Google search history of the first author. This is above 70% on average which is really interesting.
  • #12: So, these are summary of our exploratory findings. You can read them out, then say Stack Overflow is a good source for relevant API names for our query reformulation.
  • #13: Now, we have these three steps in our methodology. (1) We first mine 344K questions and answers from Stack Overflow, and construct a Token-API mapping database. (2) Then we reformulate the NL query into relevant API names using two novel metrics. The details are in the paper. (3) Then we search for code examples in GitHub using those relevant API names.
  • #14: The first step is Token-API mapping database development. This database provides a mapping between natural language tokens and relevant API names with the help of crowdsourced knowledge. We first collect the accepted answers of Stack Overflow questions, and extract their code segments or code like elements. Then we parse them for API class names using regular expressions, and collect them. Then we also collect the titles from SO questions, and perform standard natural language processing on them to generate unique tokens. So, now we have a list of tokens and a list of API classes from each Q & A thread, and they co-occurred due to semantic relevance. This co-occurrence is derived from the knowledge/experience of the crowd. So, we link those tokens and APIs, and store them in a database, and we call it Token-API mapping database.
  • #15: We have the database. Now, the next step is pretty much straight forward. We collect the tokens from the query, and collect candidate APIs from Token-API mapping database using two co-occurrence based heuristics. Please consult the paper for metric details. This returns a ranked list of relevant API names for the NL query.
  • #16: Now, we apply that relevant API names to GitHub search API which returns a list of relevant source code files. Now, we do AST parsing on them, extract the method implementations, and apply textual matching between query and code segments. Please note that now the query is relevant API names. Then based on relevance, we return a list of code snippets.
  • #17: Experiment. We evaluate and validate the query reformulation part of our technique, and the code search part is basically leveraging the an existing search API. We conducted experiments using 150 code search queries, and they are collected from three programming tutorial websites– KodeJava, Java2s and Javadb.com Relevant past studies also used the same sites. In particular, we select the questions related to programming, and consider the question title as the query. They are quite similar to real life code search queries. Then we analyze the code example as well as the API classes from corresponding answers, and consider the APIs as the gold set for the query. In average, we select 3-5 API classes as the gold set API for each query
  • #18: This is the performance of our technique. For Top-5 recommendation, it provides about 63% Top-K accuracy, and for Top-10, it provides about 79% accuracy which are quite promising. The mean average precision is 35%, and the recall is 45%. Although, the precision is a bit low, the technique’s main strength is its capability to return APIs for a wide number of queries. Because, the question titles provide a very wide vocabulary from thousands of people, and about 70% of the standard Java APIs are discussed in Stack Overflow. So, it would possibly always provide a better accuracy than other competitive techniques.
  • #19: We also compare with two baseline techniques adapted from Thung et al., the state-of-the-art from ASE 2013. We see that the our technique is performing relatively better. It has a bit larger variance, but still median performance is higher than the other 2 baseline techniques. The detailed evaluation can be found in the main paper.
  • #20: Now, this is the UI of our tool.
  • #21: Lets see the demo of our code search tool, RACK. Note: Adding the offline video misses the annotation feature of youtube videos. That annotation can be only seen from youtube domain. So, please show the demo directly from the youtube.
  • #22: So, these are take home messages. You can read them out.
  • #23: Thanks for your attention! Questions?
  • #24: We use two heuristics to select the candidate APIs for a search query. The first one is– Keyword—API Co-occurrence (KAC). That means we consider how frequently a search keyword and an API co-occurred across various questions and answers from Stack Overflow. This co-occurrence can be either due to semantic relevance or by chance. We discard the random association using a threshold frequency, but mostly such co-occurrence is due to some kind of relevance. So, for each keyword, we would collect a candidate list. Then they would be sorted and ranked. From that rank, we extract a heuristic score for each candidate API. We call this score--KAC score.
  • #25: The second heuristic considers another aspect– Keyword—Keyword Coherence That is, the two candidate APIs can be frequent with two different keywords. But these two APIs should be coherent with each other as well. If they are not coherent/compatible, then they cannot solve a technical problem. So, we extract a coherent API list using the contextual word list for each pair of query tokens. We then determine cosine similarity between context of the two tokens which can be considered as the semantic relevance between these tokens. Then that score is considered as the coherence score for the selected API list.