A Primer on High-Quality Identifier Naming

A Primer on High-Quality
Identiﬁer Naming
Anthony Peruma
Ph.D. Candidate - Rochester Institute of Technology, USA
Incoming Assistant Professor - University of Hawai‘i at Mānoa, USA
International Conference on Software and Systems Reuse (ICSR), 15-17 June 2022

About Anthony…
Experience/Qualiﬁcations
Assistant Professor - University of Hawai‘i at
Mānoa, USA (Starting August 2022)
Ph.D. Candidate - Rochester Institute of
Technology, USA (Expected June 2022)
Masters in Software Engineering - Rochester
Institute of Technology, USA
10+ years of industry experience
Research Interests
Program Comprehension - Identiﬁer Naming
Software Quality - Test Smells
Software Refactoring
Software Maintenance & Evolution
Empirical Software Engineering
https://www.peruma.me
https://twitter.com/ShehanPeruma

Agenda
● Introduction
○ What are identifiers and why are their names important?
● Linguistic Anti-Patterns
○ Introduction to the types of identifier naming violations
● Grammar Patterns
○ Common semantic structures for identifier names
● Tools
○ Some tools that can help developers and researchers with identifier naming
● Conclusion
○ Summary and additional resources

Software Maintenance
● Consumes 60% - 80% of
organization resources [1,2]
● Poor maintenance → Low
quality software
● Includes:
○ Fixing bugs
○ Incorporating new or
updating features
○ Improving the internal
quality of the system
[1]
Lientz, B. P., Swanson, E. B., & Tompkins, G. E. (1978). Characteristics of application software maintenance. Communications
of the ACM, 21(6), 466-471.
[2]
R.S. Pressman. Software engineering: a practitioner's approach. McGraw-Hill higher education. McGraw-Hill Education, 2010.

Program Comprehension
● Developers need to understand
the code before applying
changes or debugging
● 58% of developers time is spent
on comprehension activities [1]
● Poor code readability impacts
time and quality
● Application growth →
○ More classes/files →
■ More lines of code
[1]
Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., & Li, S. (2017). Measuring program comprehension: A large-scale field study
with professionals. IEEE Transactions on Software Engineering, 44(10), 951-976.

Identiﬁers
● Lexical tokens that uniquely identify
elements in the source code
○ Classes, Methods, etc.
● Everywhere in source code – significant
part in code comprehension
○ Account for 70% of characters in the
code base [1]
● Must be read to understand behaviour and
before any other coding activity
● Automated techniques user identifier data
[1]
Deissenboeck, F., & Pizka, M. (2006). Concise and consistent naming. Software Quality Journal, 14(3), 261-282.

Lexical tokens that uniquely identify entities

What is a good name?
Crafting names can be challenging – probability of two developer picking the same name is 7% [1]
A strong/high-quality name must reflect its intended behavior
Name should concisely summarize the role of its correlating entity
Good names are important – Hence, many organizations emphasis the use of best practices and
coding standards by development teams
High quality identifiers improves comprehension time by 19+% [2]
Low quality names can lead to bugs and poor code quality (i.e., more-complex, less-readable and
less-maintainable) [3]
[1]
Feitelson, D., Mizrahi, A., Noy, N., Shabat, A. B., Eliyahu, O., & Sheffer, R. (2020). How developers choose names. IEEE Transactions on Software Engineering.
[2]
Hofmeister, J., Siegmund, J., & Holt, D. V. (2017, February). Shorter identifier names take longer to comprehend. In 2017 IEEE 24th International conference on software analysis, evolution and reengineering (SANER) (pp. 217-227). IEEE.
[3]
Butler, S., Wermelinger, M., Yu, Y., & Sharp, H. (2010, March). Exploring the influence of identifier names on code quality: An empirical study. In 2010 14th European Conference on Software Maintenance and Reengineering (pp. 156-165). IEEE.

Some poor-quality names are easy to spot…

… others are not so straightforward!

Challenges with determining the quality of names
● A readable name does not always mean its a high quality name
○ context is important!
● Words are diverse and subjective; for example, a single word can have multiple
meanings (homonyms)
○ Example: Bank can mean rivier bank or financial institution
● In English prose, the context is provided in natural language, this is not the case
with identifier names – context is part of the behaviour of the code
○ Example: the method: “doForward()” can either refer to an HTTP redirect operation
or to move an image on screen
● Challenge: understanding how to map the meaning of natural language phrases to the
behavior of the code

A Primer on High-Quality Identifier Naming

Challenges with renames
A “rename chain” - multiple instances
of developers renaming an identifier

Challenges with renaming models
● Models only provide name recommendations
● They do not provide details as to why the
proposed name is a good replacement
● Does not indicate what parts of the code are
influencing the model’s recommendation
● Developer will continually make the same
naming mistake

Challenges with renaming models
Source code can differ between different
environments; a model built and evaluated in one
environment will perform badly in another

Smells
● Smells are specific structures in the code that deviate from fundamental programming
practices
● Smells make code harder to understand and make it more prone to bugs and changes
● Smells are a surface indication that usually corresponds to a deeper problem in the
software system
● Types of smells:
○ Code Smells (e.g., Long Method, Large Class, Dead Code, etc.)
○ Test Smells (e.g., Assertion Roulette, Eager Test, Lazy Test, etc.)
○ Database Smells (e.g., Multi-purpose column, Tables with many columns, etc.)
○ Linguistic Smells
○ ….

Linguistic Anti-Patterns
Represent deviations from well-established lexical naming practices in source code
Act as indicators of poor naming quality
Typically take the form of an identifier name that incorrectly describes the behavior of the entity
that it represents OR an entity that betrays the behavior conveyed linguistically by its
corresponding identifier
Leads to code misinterpretation by developers increasing cognitive load [1]
First conceptualized by Arnaoudova et al. [2]
Catalog of 15+ anti-patterns
[1]
Fakhoury, S., Ma, Y., Arnaoudova, V., & Adesope, O. (2018, May). The effect of poor source code lexicon and readability on developers' cognitive load. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC) (pp. 286-28610). IEEE.
[2]
Arnaoudova, V., Di Penta, M., Antoniol, G., & Guéhéneuc, Y. G. (2013, March). A new family of software anti-patterns: Linguistic anti-patterns. In 2013 17th European Conference on Software Maintenance and Reengineering (pp. 187-196). IEEE.

Categories of linguistic anti-patterns
Methods:
1. Do more than they say
2. Say more than they do
3. Do the opposite than they say
4. The entity contains more than what it says
Attributes:
5. The name says more than the entity contains
6. The name says the opposite than the entity contains

Catalog of linguistic anti-patterns
“Get” more than accessor
“Is” returns more than a Boolean
“Set” method returns
Expecting but not getting single instance
Not implemented condition
Validation method does not conﬁrm
“Get” method does not return
Not answered question
Transform method does not return
Expecting but not getting a collection
Method name and return type are opposite
Method signature and comment are opposite
Says one but contains many
Name suggests boolean but type is not
Says many but contains one
Attribute name and type are opposite
Attribute signature and comment are opposite

Get more than accessor
A getter that performs actions other than returning the corresponding attribute
Example: method getImageData which always returns a new object
How to resolve:
1. The method name should change so that it is not a getter or
2. the implementation should be corrected to conform to standard get-method behavior

Is returns more than a Boolean
The name of a method is a predicate suggesting a true/false value in return. However the return type is not Boolean but rather a more
complex type thus allowing a wider range of values without documenting them
Example: method isValid with return type int
How to resolve:
1. The type should be changed to boolean to reflect the function's behavior as a binary predicate.
2. Consider changing the name such that it does not imply a yes/no question and provides some indication of n-ary return values.
3. Carefully document the meaning of each value that can be returned. Thoroughly test each value.

Set method returns
A set method having a return type different than void without proper documentation of the return type/values
Example: method setBreadth has a non-void return type
How to resolve:
1. The word set, when used in this manner, has a specific definition in the programming domain. Consider using a different term, such as
change.
2. Correct the implementation such that it works like a stereotypical set method (i.e., void return, mutates a class attribute)
3. Carefully document the reasoning behind using set while also returning a value

Expecting but not getting single instance
The name of a method indicates that a single object is returned but the return type is a collection
Example: method getExpansion, which ends with a head-noun that is singular, but returns a List object
How to resolve:
1. Correct the method name so that it is plural-- getExpansions()

Not implemented condition
The comments of a method suggest a conditional behavior that is not implemented in the code. When the implementation is default
this should be documented.
Example: method getChildren has a comment which indicates there should be a conditional within its body.
How to resolve:
1. Complete implementation of the method
2. Document (i.e., update the comment) that the method is incomplete and does not implement the behavior indicated in its comment

Validation method does not conﬁrm
A validation method (e.g., name starting with "validate", "check", "ensure") does not confirm the validation, i.e., the method neither
provides a return value informing whether the validation was successful, nor documents how to proceed to understand
Example: method checkCollision returns void despite indicating that it is designed to perform validation
How to resolve:
1. Change method to return confirmation (i.e., true or false)
2. Consider changing the name to avoid implication of validation behavior (i.e., avoid terms like check and is)
3. If the previous options are not available then thoroughly document method behavior, consider highlighting irregular validation behavior

Get method does not return
The name suggests that the method returns something (e.g., name starts with "get" or "return") but the return type is void. The
documentation should explain where the resulting data is stored and how to obtain it
Example: method getMethodBodies has a void return type but its name indicates that it is a getter method
How to resolve:
1. Change method to return correct entity
2. Consider changing the name to avoid the word get
3. If the previous options are not available then thoroughly document method behavior, consider highlighting irregular getter behavior

Not answered question
The name of a method is in the form of predicate whereas the return type is not Boolean
Example: method isValid with a void return type
How to resolve:
2. Consider changing the name to avoid the word get
3. If the previous options are not available then thoroughly document method behavior, consider highlighting irregular getter behavior

Transform method does not return
The name of a method suggests the transformation of an object but there is no return value and it is not clear from the documentation
where the result is stored.
Example: method javaToNative has a void return type but indicates that it performs a transformation (i.e., type conversion).
How to resolve:
2. If the previous option is not available then thoroughly document method behavior, consider highlighting irregular transformation
behavior

Expecting but not getting a collection
The name of a method suggests that a collection should be returned but a single object or nothing is returned
Example: method getStats with a Boolean return type; making it difficult to understand the reason behind the plurality of the method name.
How to resolve:
1. Change the name of the method (and any related identifier names) so that it is singular instead of plural

Method name and return type are opposite
The intent of the method suggested by its name is in contradiction with what it returns
Example: method disable with return type ControlEnableState. The words "disable" and "enable" having opposite meanings.
How to resolve:
1. Change method name so that it aligns better with the return type (i.e., change disable to enable)
2. Change type name to align better with method name (i.e., to ControlDisableState)

Method signature and comment are opposite
The documentation of a method is in contradiction with its declaration
Example: method isNavigateForwardEnabled is in contradiction with its comment documenting "a back navigation", as "forward" and "back" are
antonyms
How to resolve:
1. Change the comment to specify that this method is for forward navigation

Says one but contains many
The name of an attribute suggests a single instance, while its type suggests that the attribute stores a collection of objects
Example: attribute _target that is of type Vector. It is unclear whether a change aspects one or multiple instances in the collection.
How to resolve:
1. Change the identifier name to reflect plurality of its type (i.e., _target -> _targets)

Name suggests boolean but type is not
The name of an attribute suggests that its value is true or false, but its declaring type is not Boolean
Example: attribute isReached that is of type int[] where the declared type and values are not documented.
How to resolve:
1. Change the name of the identifier to be more descriptive with respect to what kind of array it represents.
2. Consider removing the word is and using a different term unless the array represents a sequence of appropriate (i.e., boolean-like)
values
3. If appropriate, consider using a boolean array
4. Carefully document the data represented by the array, including the reasoning for its integer type and whether different integer values
have different meanings

Says many but contains one
The name of an attribute suggests multiple instances, but its type suggests a single one
Example: attribute _stats that is of type Boolean. Documenting such inconsistencies avoids additional comprehension effort to understand the
purpose of the attribute.
How to resolve:
1. Change identifier name to singular instead of plural

Attribute name and type are opposite
The name of an attribute is in contradiction with its type as they contain antonyms
Example: attribute start that is of type MAssociationEnd. The use of antonyms can induce wrong assumptions.
How to resolve:
1. Change identifier name to align with type name (i.e., change start to end)

Attribute signature and comment are opposite
The declaration of an attribute is in contradiction with its documentation
Example: attribute INCLUDE_NAME_DEFAULT whose comment documents an "exclude pattern". Whether the pattern is included or excluded is
thus unclear
How to resolve:
1. Change identifier name to align with comment (i.e., include -> exclude)
2. Change comment to align with method name (i.e., exclude -> include)

Challenges with determining the quality of names
One challenge to studying identiﬁers is the difﬁculty in understanding how to map the meaning of natural
language phrases to the behavior of the code.
A second challenge lies in the natural language analysis techniques themselves, many of which are not
trained to be applied to software

Part-of-Speech
Part-of-speech is a category to which a word is assigned in accordance with its syntactic
functions
In English, the main parts of speech are noun, pronoun, adjective, determiner, verb,
adverb, preposition, conjunction, and interjection
Can help us reason about the meaning of words and code behavior

Grammar Patterns
Identifier Phase Structure != Human Language Phrase Structure
A grammar pattern is the sequence of part-of-speech tags assigned to individual words
within an identifier
Grammar patterns allow a more efficient analysis by broadly categorizing words into their
corresponding part-of-speech

Common identiﬁer naming patterns

Common identifier naming patterns
Noun Phrase - common naming pattern for identifiers that are not function names; A good identifier will include only
enough noun-modifiers to concisely define the concept represented by the head-noun
Plural noun phrase - Identifiers that follow this pattern are usually not function names; these identifiers are more likely to
have a collection data type
Verb Phrase - typically either function identifiers or identifiers with a boolean type; for non-boolean the verb is an action,
otherwise its a predicate
Prepositional Phrase - used in many types of identifiers; The preposition typically explains how the entity represented by
the accompanying noun or verb-phrase are related
Noun phrase with leading determiner - used in many types of identifiers; determiner tells us how much of the population,
which is specified by the noun-phrase, is represented, or acted on, by the identifier
Verb Pattern - typically function names or identifiers with a boolean type; missing a noun phrase; the noun phrase is
implied by the program context or it is present in the function parameters.

Tools to analyze/transform identiﬁers
Naming Violation Detection
● Detects 19 types of linguistic anti-patterns
● Provides an explanation of the violation
● Analyzes C# & Java source code
● Supports project-speciﬁc customizations
● Average precision: 75.27%
● Open-source
● https://github.com/SCANL/ProjectSunshine/blob/
master/documentaion/IDEAL/SetupAndUse.md
Ensemble Part-of-Speech Tagger
● Tagger uses machine-learning and the
output from multiple part-of-speech
taggers to annotate natural language text
● The ensemble uses three state-of-the-art
part-of-speech taggers: SWUM, POSSE,
and Stanford
● Accuracy of 86%; Outperforms Stanford by
51%
● Open-source
Peruma, A., Arnaoudova, V., & Newman, C. D. (2021, September). Ideal: An
open-source identifier name appraisal tool. In 2021 IEEE International Conference
on Software Maintenance and Evolution (ICSME) (pp. 599-603). IEEE.
Newman, C. D., Decker, M. J., Alsuhaibani, R., Peruma, A., Mkaouer, M., Mohapatra, S.,
... & Hill, E. (2021). An ensemble approach for annotating source code identifiers with
part-of-speech tags. IEEE Transactions on Software Engineering.

Tools to analyze/transform identiﬁers
These are just some of the identifier related tools that are available for the developer and research community
Rename recommendation models
G. Li et al., “A Survey on Renamings of Software Entities”, in ACM Comput. Surv.
Code readability models
S. Scalabrino et al., “A comprehensive model for code readability.” in Journal of Software: Evolution and Process
LAPD: linguistic anti-pattern detector
V. Arnaoudova, et al., "Linguistic antipatterns: What they are and how developers perceive them," in Empirical Software Engineering.
Spiral: splitters for identifiers in source code files
M. Hucka, “Spiral: splitters for identifiers in source code files,” in Journal of Open Source Software.
Nominal: Java library to test compliance of identifier names with naming conventions
S. Butler et al., "Investigating naming convention adherence in Java references," 2015 IEEE International Conference on Software Maintenance and Evolution.

Demo
● The Ensemble Tagger
● IDEAL

Summary
● Naming identifiers is one of the most challenging tasks for developers
● A high-quality name should reflect its intended behavior
● Names are diverse and subjective – this makes it challenging to
automatically determine their quality
● Linguistic anti-patterns – deviations from lexical naming practices
● Grammar patterns – allow a more efficient analysis of names
● Availability of tools to assist developers with crafting and maintaining
names, but they are not a complete one-stop solution

Additional sources
● Identifier Naming Structure Catalogue
○ https://github.com/SCANL/identifier_name_structure_catalogue

Thanks!
Anthony Peruma
https://www.peruma.me
https://www.scanl.org

A Primer on High-Quality Identifier Naming

More Related Content

Similar to A Primer on High-Quality Identifier Naming (20)

More from University of Hawai‘i at Mānoa (20)

Recently uploaded (20)

A Primer on High-Quality Identifier Naming