SlideShare a Scribd company logo
Preetha Chatterjee,	Benjamin	Gause,	Hunter	Hedinger,	
and	Lori	Pollock
Computer	&	Information	Science
Extracting	Code	Segments	and	Their	
Descriptions	from	Research	Articles	
1
Code	is	everywhere!
2
Bug	Reports
Emails
Blog	Posts
Q	&	A	forums
Code	Reviews
Documentation
E-books
Research	Papers
Course	Materials
Presentations
Public	Chats
Benchmarks
Research	Papers
DL Domain #	of	
articles
ACM	DL	 Computer	
Science
>	300,000
IEEE	
Xplore
Computer	
Science
>	3,500,000
DBLP Mostly
Computer	
Science
>	3,729,582
https://en.wikipedia.org/wiki/IEEE_Xplore
https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication-
statistics-in-the-acm-digital-library/fulltext
http://dblp.uni-trier.de/
3
70%	of	the	articles	contain	one	or	more	code	segments,	
with	an	average	of	3-4	code	segments	per	article.
4
Excerpt	from	a	research	article
Indication	of	a	problem	in	the	code
The	functionality	of	the	code
Functionalities	of	individual	method	
calls	within	the	code
The	type	of	data	structure	used	by	
the	program
Cause	of	the	code	issue	presented	
earlier	
To	understand	the	difficulty	of	fixing	a	memory	leak,	let	
us	take	a	look	at	an	example	program	in	Fig.	1.	This	is	a	
contrived	example	mimicking	recurring	leak	patterns	we	
found	in	real	C	programs.	Procedure	check_records
checks	whether	there	is	any	bad	records	in	a	large	file,	
and	the	caller	could	either	check	all	records,	or	specify	a	
search	condition	to	check	only	part	of	records.	In	this	
example,	both	get_next and	search_for_next will	
allocate	and	return	a	heap	structure,	which	is	expected	
to	be	freed	at	line	12.	However,	the	execution	may	
break	out	the	loop	at	line	10,	causing	a	memory	leak.
The	programming	language	of	the	
source	code	
5
Why	extract	code	segments	and	their	descriptions?
Code	recommendation
Automatic	comment	generation
Extension	of	documentation
Learning	from	&	reusing	code	examples
6
Excerpt	from	a	research	article
To	understand	the	difficulty	of	fixing	a	memory	leak,	let	
us	take	a	look	at	an	example	program	in	Fig.	1.	This	is	a	
contrived	example	mimicking	recurring	leak	patterns	
we	found	in	real	C	programs.	Procedure	check_records
checks	whether	there	is	any	bad	records	in	a	large	file,	
and	the	caller	could	either	check	all	records,	or	specify	
a	search	condition	to	check	only	part	of	records.	In	this	
example,	both	get_next and	search_for_next will	
allocate	and	return	a	heap	structure,	which	is	expected	
to	be	freed	at	line	12.	However,	the	execution	may	
break	out	the	loop	at	line	10,	causing	a	memory	leak. 7
Code	segments
[Bacchelli et	al.	(ICPC’10)
Tang	et	al.	(KDD’05)	
Bettenburg et	al.	(MSR’08)	
Subramanian	et	al.	(MSR’13)	
Rigby	et	al.	(ICSE’13)
Natural	language	text	describing	
each	code	segment
This	Paper’s	Contributions
Automatically	identifying	&	mapping	text	describing	code	
segments	in	research	articles
A	prototype,	CoDesNPub Miner, that	outputs	XML-based	
representation	associating	code	segments	with	their	
descriptions	
Evaluation	of	the	effectiveness	of	code	description	
identification	techniques	
8
• Seeds	
• Neighbors
Overview	of	CoDesNPub Miner
9
Identify	Seeds:	References	Figure	Containing	Code	
Fig.5	shows	a	typical	test	method	of	this	
pattern.	The	method	tests	a	set	of	basic	
functionality	of	API	class	BasicAuthCache,	
including	the	method	put,	get,	remove	and	
clear.	There	are	three	test	scenarios	in	the	
method:	line	4-5,	line	6-7,	line	8-10.	They	
share	two	data	objects,	cache	and	
authScheme.	Their	method	invocation	
sequences	are	not	same	and	there	is	no	
unified	test	target	method.	But	there	is	a	
common	subsequence	among	three	method	
invocation	sequences,	i.e.,	the	invocations	of	
get	and	HttpHost.	
10
*****Code	Segment	appears	here*****
Listing	9	shows	an	example	of	three	
statements	that	were	single	statement	blocks	
after	the	first	phases,	but	can	be	merged	into	
a	single	block	because	they	have	similar	RHSs.	
*****Code	Segment	appears	here*****
Identify	Seeds:	Located	Immediately	Before	or	
After	Inlined Code	
Fig.5	shows	a	typical	test	method	of	this	
pattern.	The	method	tests	a	set	of	basic	
functionality	of	API	class	BasicAuthCache,	
including	the	method	put,	get,	remove	and	
clear.	There	are	three	test	scenarios	in	the	
method:	line	4-5,	line	6-7,	line	8-10.	They	
share	two	data	objects,	cache	and	
authScheme.	Their	method	invocation	
sequences	are	not	same	and	there	is	no	
unified	test	target	method.	But	there	is	a	
common	subsequence	among	three	method	
invocation	sequences,	i.e.,	the	invocations	of	
get	and	HttpHost.	
11
*****Code	Segment	appears	here*****
A	major	obstacle	to	extracting	API	examples	
from	test	code	is	the	multiple	test	scenarios	in	
a	test	method.	Fig.	1	depicts	such	a	test	
method.	Lines	2-4	are	the	declaration	of	some	
data	objects.	Lines	5-13	depict	a	test	scenario	
that	contains	the	usage	of	some	API	methods,	
such	as	keySetByValue,	put,	and	getKey.	Lines	
14-22	depict	another	test	scenario,	which	
contains	a	similar	usage	to	the	previous	one.	
Such	multiple	test	scenarios	are	quite	
reasonable	when	aiming	at	covering	testing	
input	domains.	But	they	bring	redundant	code	
for	API	users	to	read.	In	fact,	there	are	actually	
200+	code	lines	containing	similar	test	
scenarios	in	the	test	method	in	Fig.1.	It	is	
necessary	to	separate	different	test	scenarios	
from	one	test	method	and	cluster	the	similar	
usages	to	remove	redundancy.
*****Code	Segment	appears	here*****
Identify	Seeds:	Contains	Code	Identifiers	
Fig.5	shows	a	typical	test	method	of	this	
pattern.	The	method	tests	a	set	of	basic	
functionality	of	API	class	BasicAuthCache,	
including	the	method	put,	get,	remove	and	
clear.	There	are	three	test	scenarios	in	the	
method:	line	4-5,	line	6-7,	line	8-10.	They	
share	two	data	objects,	cache	and	
authScheme.	Their	method	invocation	
sequences	are	not	same	and	there	is	no	
unified	test	target	method.	But	there	is	a	
common	subsequence	among	three	method	
invocation	sequences,	i.e.,	the	invocations	of	
get	and	HttpHost.	
12
*****Code	Segment	appears	here*****
Identify	Seeds:	References	Code	By	Position	
13
Fig.5	shows	a	typical	test	method	of	this	
pattern.	The	method	tests	a	set	of	basic	
functionality	of	API	class	BasicAuthCache,	
including	the	method	put,	get,	remove	
and	clear.	There	are	three	test	scenarios	in	
the	method:	line	4-5,	line	6-7,	line	8-10.	
They	share	two	data	objects,	cache	and	
authScheme.	Their	method	invocation	
sequences	are	not	same	and	there	is	no	
unified	test	target	method.	But	there	is	a	
common	subsequence	among	three	
method	invocation	sequences,	i.e.,	the	
invocations	of	get	and	HttpHost.	
This	code	snippet	obtains	a	user	name	(user- Name)	
by	invoking	request.getParameter(“name”)and	uses	
it	to	construct	a	query	to	be	passed	to	a	database	for	
execution	(con.execute (query)).	This	seemingly	
innocent	piece	of	code	may	allow	an	attacker	to	gain	
access	to	unauthorized	information:	if	an	attacker	
has	full	control	of	string	userName obtained	from	an	
HTTP	request,	he	can	for	example	set	it	to	'OR	1	=	1;-
-.	Two	dashes	are	used	to	indicate	comments	in	the	
Oracle	dialect	of	SQL,	so	the	WHERE	clause	of	the	
query	effectively	becomes	the	tautologyname =	'	'	
OR	1	=	1.	This	allows	the	attackerto circumvent	the	
name	check	and	get	access	to	all	user	records	in	the	
database.
Fig.5	shows	a	typical	test	method	of	this	pattern.	
The	method	tests	a	set	of	basic	functionality	of	API	
class	BasicAuthCache,	including	the	method	put,	
get,	remove	and	clear.	There	are	three	test	
scenarios	in	the	method:	line	4-5,	line	6-7,	line	8-
10.	They	share	two	data	objects,	cache	and	
authScheme.	Their	method	invocation	sequences	
are	not	same	and	there	is	no	unified	test	target	
method.	But	there	is	a	common	subsequence	
among	three	method	invocation	sequences,	i.e.,	
the	invocations	of	get	and	HttpHost.	
ReferencesCodeFigure Score=3
ContainsCodeIdentifiers Score=2
ContainsCodeIdentifiers Score=2
ReferencesCodeByPosition Score=2
ContainsCodeIdentifiers,							Score=2
ReferencesCodeByPosition Score=2	
TextBefore Score=1
14
Identify	Seeds:	Putting	It	All	Together	
*****Code	Segment	appears	here*****
• Scoring	sentences
• Equal
• Accuracy-based
• Threshold	Analysis
Heuristic References	
CodeFigure
Contains	Code	
Identifiers
References	
Code	ByPosition
Text	Before,
Text	After
Score 3 2 2 1
Identifying	Neighboring	Code-related	Text	
• Heuristic	1:	At	least	1	sentence	is	a	seed
• Heuristic	2:	At	least	(25%,	50%,	or	75%,	respectively)	sentences	in	the	
paragraph	are	seeds
Fig.5	shows	a	typical	test	method	of	this	pattern.	The	method	tests	a	set	of	basic	functionality	
of	API	class	BasicAuthCache,	including	the	method	put,	get,	remove	and	clear.	There	are	three	
test	scenarios	in	the	method:	line	4-5,	line	6-7,	line	8-10.	They	share	two	data	objects,	cache	
and	authScheme.	Their	method	invocation	sequences	are	not	same	and	there	is	no	unified	test	
target	method.	But	there	is	a	common	subsequence	among	three	method	invocation	
sequences,	i.e.,	the	invocations	of	get	and	HttpHost.	
5 out	of	6	(75%)	sentences	are	seeds
15
Heuristic	1
Heuristic	2
whole	paragraph	is	a	description
Evaluation	Methodology
• Research	Question:	How	effective	is	our	approach	to	automatically	
identify	code	descriptions	in	natural	language	text	of	research	
articles?	
• Subjects:	100	code	segments	from	ACM	DL	and	IEEE	Xplore journal	
and	conference	software	engineering	papers
• Gold	Set:
• 10	Human	annotators	(non-authors)
• Measures:	
• Overall	code	description	identification:	Precision	and	recall
• Seed	identification:	Precision
16
Evaluation	Results
Minimum	#	of	
Seeds	
Precision Recall
1-24% 39.05	 70.20	
>=	25% 53.41	 50.33	
>=	50% 66.04 28.45
>=	75% 68.30 20.53
Overall	system	effectiveness
17
18
High	Clarity
Low	Clarity
Kinds	of	information	in	the	code	descriptions
What	cues	are	most	prominent?
19
Main	Threats	to	Validity
• Unable	to	distinguish	between	pseudocode	and	code	fragments	
• Papers	with	no	pseudocode,	plan	to	extend	the	approach	to	identify	both.
• Evaluation	relies	on	human	judges
• Human	judges	with	experiences	in	programming	and	research	paper	reading.
• Each	code	segment	judged	by	at	least	two	judges.
• Scaling	to	extensive	evaluation	set	might	lead	to	different	results
• Plan	to	expand	the	evaluation	with	more	participants,	and	research	papers	
containing	more	code	segments.	
20
Related	Work
• Analyzing	Collections	of	Research	Articles:
Cruzes	et	al.	(ESEM’07),	Siegmund et	al.	(ICSE’15)
• Code	Segment	Extraction:	
Bacchelli et	al.	(ICPC’10),	Tang	et	al.	(KDD’05),	Bettenburg et	al.	
(MSR’08),	Subramanian	et	al.	(MSR’13),	Rigby	et	al.	(ICSE’13)
• Code	Description	Identification:
Bug	Reports:	Panichella et	al.	(ICPC’12),
Q&A:	Vassallo et	al.	(ICPC’14),	Wong	et	al.	(ASE’13),	
Rahman	et	al.	(SCAM’15)
21
Summary
• Automatically	identify	&	map	text	describing	code	
segments	in	research	articles
• CoDesNPub Miner outputting	XML-based	representation	
associating	code	segments	with	their	descriptions
• Evaluation	of	the	effectiveness	of	code	description	
identification	techniques
• Precision	=	68%
• Recall	=	21%
22
Future	Work
Improve	recall	and	precision
Fully	automate	preprocessing
Expand	the	experiments

More Related Content

PDF
Ijetr012045
PPT
Ethnograph 10 Jul07
PDF
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
PDF
Review of plagiarism detection and control & copyrights in India
PDF
A Novel Approach for Keyword extraction in learning objects using text mining
PDF
Study on security and quality of service implementations in p2 p overlay netw...
PPT
Ethnograph 11 Jul07
PPTX
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Ijetr012045
Ethnograph 10 Jul07
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
Review of plagiarism detection and control & copyrights in India
A Novel Approach for Keyword extraction in learning objects using text mining
Study on security and quality of service implementations in p2 p overlay netw...
Ethnograph 11 Jul07
Author Identification of Source Code Segments Written by Multiple Authors Usi...

What's hot (6)

PDF
Performance analysis on secured data method in natural language steganography
PDF
Keyword extraction and clustering for document recommendation in conversations.
PDF
Speech recognition using neural + fuzzy logic
PDF
Ijet journal
PDF
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
PDF
H017445260
Performance analysis on secured data method in natural language steganography
Keyword extraction and clustering for document recommendation in conversations.
Speech recognition using neural + fuzzy logic
Ijet journal
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
H017445260
Ad

Similar to Extracting Code Segments and Their Descriptions from Research Articles (20)

DOC
Tcp performance Final Report
PDF
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
PPTX
Info scince pp
PPTX
HTML, CSS and XML
PPTX
High performance computing
PPTX
ppt template to popular thank phone .pptx
PDF
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
PDF
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
PDF
Channel Coding Theory Algorithms and Applications 1st Edition David Declercq ...
PDF
Overview of SMB, NetBIOS and other network attacks
PDF
Ieee Cyber 2012 Late News Cfp
PPTX
A.Levenchuk -- Complexity in Engineering
PPT
1 history computers
PPTX
Decipher openseminar (1)
PPTX
Open Cybersecurity Alliance Briefing at RSAC 2020
DOCX
This final assignment will be a departure from the three previous.docx
PDF
From the classroom to the cloud a journey with node.js - Christopher Hogue
PDF
Customer Success Story: IEEE Xplore Saves Time
DOCX
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
PDF
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
Tcp performance Final Report
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Info scince pp
HTML, CSS and XML
High performance computing
ppt template to popular thank phone .pptx
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Channel Coding Theory Algorithms and Applications 1st Edition David Declercq ...
Overview of SMB, NetBIOS and other network attacks
Ieee Cyber 2012 Late News Cfp
A.Levenchuk -- Complexity in Engineering
1 history computers
Decipher openseminar (1)
Open Cybersecurity Alliance Briefing at RSAC 2020
This final assignment will be a departure from the three previous.docx
From the classroom to the cloud a journey with node.js - Christopher Hogue
Customer Success Story: IEEE Xplore Saves Time
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
Ad

More from Preetha Chatterjee (10)

PDF
Incivility in Open Source Projects: A Comprehensive Annotated Dataset of Lock...
PDF
Exploring ChatGPT for Toxicity Detection in GitHub
PDF
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
PDF
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
PPTX
Automatic Identification of Informative Code in Stack Overflow Posts
PPTX
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
PPTX
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
PPTX
Extracting Archival-Quality Information from Software-Related Chats
PPTX
Mining Code Examples with Descriptive Text from Software Artifacts
PPTX
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Incivility in Open Source Projects: A Comprehensive Annotated Dataset of Lock...
Exploring ChatGPT for Toxicity Detection in GitHub
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
Automatic Identification of Informative Code in Stack Overflow Posts
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Extracting Archival-Quality Information from Software-Related Chats
Mining Code Examples with Descriptive Text from Software Artifacts
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...

Recently uploaded (20)

PDF
Softaken Excel to vCard Converter Software.pdf
PDF
medical staffing services at VALiNTRY
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Transform Your Business with a Software ERP System
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administration Chapter 2
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
history of c programming in notes for students .pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
Softaken Excel to vCard Converter Software.pdf
medical staffing services at VALiNTRY
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Navsoft: AI-Powered Business Solutions & Custom Software Development
Transform Your Business with a Software ERP System
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administration Chapter 2
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Digital Systems & Binary Numbers (comprehensive )
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
history of c programming in notes for students .pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why Generative AI is the Future of Content, Code & Creativity?

Extracting Code Segments and Their Descriptions from Research Articles