Extracting Code Segments and Their Descriptions from Research Articles

Preetha Chatterjee, Benjamin Gause, Hunter Hedinger,
and Lori Pollock
Computer & Information Science
Extracting Code Segments and Their
Descriptions from Research Articles
1

Bug Reports
Emails
Blog Posts
Q & A forums
Code Reviews
Documentation
E-books
Research Papers
Course Materials
Presentations
Public Chats
Benchmarks
Research Papers
DL Domain # of
articles
ACM DL Computer
Science
> 300,000
IEEE
Xplore
Computer
Science
> 3,500,000
DBLP Mostly
Computer
Science
> 3,729,582
https://en.wikipedia.org/wiki/IEEE_Xplore
https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication-
statistics-in-the-acm-digital-library/fulltext
http://dblp.uni-trier.de/
3

70% of the articles contain one or more code segments,
with an average of 3-4 code segments per article.
4

Excerpt from a research article
Indication of a problem in the code
The functionality of the code
Functionalities of individual method
calls within the code
The type of data structure used by
the program
Cause of the code issue presented
earlier
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns we
found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify a
search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak.
The programming language of the
source code
5

Why extract code segments and their descriptions?
Code recommendation
Automatic comment generation
Extension of documentation
Learning from & reusing code examples
6

Excerpt from a research article
To understand the difficulty of fixing a memory leak, let
us take a look at an example program in Fig. 1. This is a
contrived example mimicking recurring leak patterns
we found in real C programs. Procedure check_records
checks whether there is any bad records in a large file,
and the caller could either check all records, or specify
a search condition to check only part of records. In this
example, both get_next and search_for_next will
allocate and return a heap structure, which is expected
to be freed at line 12. However, the execution may
break out the loop at line 10, causing a memory leak. 7
Code segments
[Bacchelli et al. (ICPC’10)
Tang et al. (KDD’05)
Bettenburg et al. (MSR’08)
Subramanian et al. (MSR’13)
Rigby et al. (ICSE’13)
Natural language text describing
each code segment

This Paper’s Contributions
Automatically identifying & mapping text describing code
segments in research articles
A prototype, CoDesNPub Miner, that outputs XML-based
representation associating code segments with their
descriptions
Evaluation of the effectiveness of code description
identification techniques
8
• Seeds
• Neighbors

Identify Seeds: References Figure Containing Code
Fig.5 shows a typical test method of this
pattern. The method tests a set of basic
functionality of API class BasicAuthCache,
including the method put, get, remove and
clear. There are three test scenarios in the
method: line 4-5, line 6-7, line 8-10. They
share two data objects, cache and
authScheme. Their method invocation
sequences are not same and there is no
unified test target method. But there is a
common subsequence among three method
invocation sequences, i.e., the invocations of
get and HttpHost.
10
*****Code Segment appears here*****
Listing 9 shows an example of three
statements that were single statement blocks
after the first phases, but can be merged into
a single block because they have similar RHSs.

Identify Seeds: Located Immediately Before or
After Inlined Code
get and HttpHost.
11
A major obstacle to extracting API examples
from test code is the multiple test scenarios in
a test method. Fig. 1 depicts such a test
method. Lines 2-4 are the declaration of some
data objects. Lines 5-13 depict a test scenario
that contains the usage of some API methods,
such as keySetByValue, put, and getKey. Lines
14-22 depict another test scenario, which
contains a similar usage to the previous one.
Such multiple test scenarios are quite
reasonable when aiming at covering testing
input domains. But they bring redundant code
for API users to read. In fact, there are actually
200+ code lines containing similar test
scenarios in the test method in Fig.1. It is
necessary to separate different test scenarios
from one test method and cluster the similar
usages to remove redundancy.

Identify Seeds: Contains Code Identifiers
get and HttpHost.
12

Identify Seeds: References Code By Position
13
including the method put, get, remove
and clear. There are three test scenarios in
the method: line 4-5, line 6-7, line 8-10.
They share two data objects, cache and
common subsequence among three
method invocation sequences, i.e., the
invocations of get and HttpHost.
This code snippet obtains a user name (user- Name)
by invoking request.getParameter(“name”)and uses
it to construct a query to be passed to a database for
execution (con.execute (query)). This seemingly
innocent piece of code may allow an attacker to gain
access to unauthorized information: if an attacker
has full control of string userName obtained from an
HTTP request, he can for example set it to 'OR 1 = 1;-
-. Two dashes are used to indicate comments in the
Oracle dialect of SQL, so the WHERE clause of the
query effectively becomes the tautologyname = ' '
OR 1 = 1. This allows the attackerto circumvent the
name check and get access to all user records in the
database.

Fig.5 shows a typical test method of this pattern.
The method tests a set of basic functionality of API
class BasicAuthCache, including the method put,
get, remove and clear. There are three test
scenarios in the method: line 4-5, line 6-7, line 8-
10. They share two data objects, cache and
authScheme. Their method invocation sequences
are not same and there is no unified test target
method. But there is a common subsequence
among three method invocation sequences, i.e.,
the invocations of get and HttpHost.
ReferencesCodeFigure Score=3
ContainsCodeIdentifiers Score=2
ContainsCodeIdentifiers Score=2
ReferencesCodeByPosition Score=2
ContainsCodeIdentifiers, Score=2
ReferencesCodeByPosition Score=2
TextBefore Score=1
14
Identify Seeds: Putting It All Together
• Scoring sentences
• Equal
• Accuracy-based
• Threshold Analysis
Heuristic References
CodeFigure
Contains Code
Identifiers
References
Code ByPosition
Text Before,
Text After
Score 3 2 2 1

Identifying Neighboring Code-related Text
• Heuristic 1: At least 1 sentence is a seed
• Heuristic 2: At least (25%, 50%, or 75%, respectively) sentences in the
paragraph are seeds
Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality
of API class BasicAuthCache, including the method put, get, remove and clear. There are three
test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache
and authScheme. Their method invocation sequences are not same and there is no unified test
target method. But there is a common subsequence among three method invocation
sequences, i.e., the invocations of get and HttpHost.
5 out of 6 (75%) sentences are seeds
15
Heuristic 1
Heuristic 2
whole paragraph is a description

Evaluation Methodology
• Research Question: How effective is our approach to automatically
identify code descriptions in natural language text of research
articles?
• Subjects: 100 code segments from ACM DL and IEEE Xplore journal
and conference software engineering papers
• Gold Set:
• 10 Human annotators (non-authors)
• Measures:
• Overall code description identification: Precision and recall
• Seed identification: Precision
16

Evaluation Results
Minimum # of
Seeds
Precision Recall
1-24% 39.05 70.20
>= 25% 53.41 50.33
>= 50% 66.04 28.45
>= 75% 68.30 20.53
Overall system effectiveness
17

18
High Clarity
Low Clarity
Kinds of information in the code descriptions

What cues are most prominent?
19

Main Threats to Validity
• Unable to distinguish between pseudocode and code fragments
• Papers with no pseudocode, plan to extend the approach to identify both.
• Evaluation relies on human judges
• Human judges with experiences in programming and research paper reading.
• Each code segment judged by at least two judges.
• Scaling to extensive evaluation set might lead to different results
• Plan to expand the evaluation with more participants, and research papers
containing more code segments.
20

Related Work
• Analyzing Collections of Research Articles:
Cruzes et al. (ESEM’07), Siegmund et al. (ICSE’15)
• Code Segment Extraction:
Bacchelli et al. (ICPC’10), Tang et al. (KDD’05), Bettenburg et al.
(MSR’08), Subramanian et al. (MSR’13), Rigby et al. (ICSE’13)
• Code Description Identification:
Bug Reports: Panichella et al. (ICPC’12),
Q&A: Vassallo et al. (ICPC’14), Wong et al. (ASE’13),
Rahman et al. (SCAM’15)
21

Summary
• Automatically identify & map text describing code
segments in research articles
• CoDesNPub Miner outputting XML-based representation
associating code segments with their descriptions
• Evaluation of the effectiveness of code description
identification techniques
• Precision = 68%
• Recall = 21%
22
Future Work
Improve recall and precision
Fully automate preprocessing
Expand the experiments

Extracting Code Segments and Their Descriptions from Research Articles

More Related Content

What's hot (6)

Similar to Extracting Code Segments and Their Descriptions from Research Articles (20)

More from Preetha Chatterjee (10)

Recently uploaded (20)

Extracting Code Segments and Their Descriptions from Research Articles