SlideShare a Scribd company logo
A Case Study of a Reusable Component Collection
William B. Frakes
Computer Science Department
Virginia Tech, Falls Church
wfrakes@vt.edu
Abstract
This paper reports on practical issues in the
development, distribution, use, and evolution of a
reusable component collection in the domain of
information retrieval.
1. Introduction
Software reuse is the use of existing software
knowledge or artifacts to build new software. There are
many types of software reuse [9]. The reuse described in
this paper is ad-hoc, black box, compositional, code
reuse. Ad hoc means that the reuse is not part of a
repeatable mandated organizational process. Ad hoc reuse
is by far more common than systematic reuse, though the
latter is thought to be more powerful. Black box reuse is
reuse of a software item without modification.
Compositional reuse means that the software system was
built by a human programmer out of components, as
opposed to generating a system automatically from
specifications. The reuse described in this paper is
primarily vertical rather than horizontal since it is focused
in the domain of information retrieval, though some of
the components such as string searching might also be
considered horizontal.
One source of reusable software is the code that is
developed to accompany books. This paper concerns code
from a book on data structures and algorithms for
information retrieval (IR) systems [6]. Information
retrieval systems retrieve textual documents from
a database in response to queries submitted to
the system by users.
IR systems can be defined more formally using set and
function notation as follows.
D = set of textual documents
D’=subset of D
Q=set of queries
M=matching function
Systems in the domain of information retrieval can
now be specified as follows.
S : S computes D’=M(D,Q)
That is, all systems S such that S returns a subset of
documents D’ of D that match the set of queries Q are IR
systems.
One of the goals for the book was development of
reusable IR code. Authors were asked to develop software
components for their chapters in C following industrial
coding guidelines. This was partly successful, and with
some rework, the following components were developed
and tested:
• Lexical Analysis and Stop List operations - this code
breaks text into words and removes words considered
unimportant for indexing.
• Stemmer Code - implements the Porter stemming
algorithm. A stemmer conflates words by finding a
common root form of the words.
• Thesaurus Construction - supports the automatic
construction of thesauri from source text.
• Boolean Operations - implements standard Boolean
operations (AND, OR, NOT) on sets of documents.
• Hashing Algorithms - including an algorithm for
minimal perfect hashing.
• String searching - implementations for basic
algorithms for finding patterns in text strings.
This paper is about practical issues encountered in the
creation, distribution, and use of the components. These
issues are not particular to the domain of information
retrieval, nor particular to C functions. They may well
arise in any domain and for any type of reusable asset.
0-7695-0559-7/00 $10.00 ã 2000 IEEE
2. What is a component?
What is a component? The term is ambiguous. A
component can be any lifecycle object or part thereof.
Usually a code component is a subroutine (function or
subprogram), or an object or class, but it could also be
many other things like macros, header files, subsystems,
processes, or patterns. This paper discusses collections of
C functions. This simplifies things a little since this is a
kind of reuse familiar to many. Even this kind of reuse,
however, can still be complicated.
The 3 C’s model of reuse design [12] says that there
are three aspects of a reusable component--the concept, the
content, and the context. The concept corresponds to the
abstract functionality of a component such as might be
specified in an abstract data type or a formal algorithm
specification. The purpose of such abstractions is to focus
on the essence of the component, whatever that might be,
and ignore other details, usually implementation details.
The chapters in the book provide the specifications for the
concepts of the components.
The content is the implementation of the component.
This involves selection of a programming language and a
design. The implementations of the components in C are
the content. The transition from concept to content
involves moving from the problem, or domain, space to
the solution space. The problem space is only concerned
with the concepts and operations of the domain in
question--in this case information retrieval. The solution
space involves the concepts and operations of the
implementation environment--in this case the C language.
The context is the environment needed to use the
components. Context for code components might be the
required machine, operating systems, compiler version,
and so on. The code for the IR components was
developed for and tested on a Unix system and certain
assumptions were made regarding implementation.
Porting the code to DOS, for example, required changes
to make filenames have the required length of no more
than eight characters.
3. Language
Software reuse is now generally regarded to be a good
thing, and most modern languages make some claim for
their support of reuse. The C language, for example, was
designed for extensive reuse in the sense that it is a small
language extensively augmented by reusable function
libraries. Newer languages like C++ provide reuse of
higher level programming constructs such as objects,
classes, and templates and directly supports type
polymorphism via function overloading. A summary of
the reuse aspects of C++, for example, can be found in
[14].
I selected C as the component implementation
language because C was and is a widely known and used
language in both industry and academia. It is also the
programming language I know best, and the one I’ve used
to develop industrial software. There are also many good
free software engineering support tools for C, including
free compilers. Was C the best choice? This of course
opens the door to language lawyering. Let me just say
that the components got developed.
Some of the components have been rewritten in other
languages, sometimes with attribution of the source,
sometimes not. Versions of the stemmer, for example,
have appeared on the web in Perl and Java.
4. Source or Binary?
The argument is sometimes made that only the
executable code for reusable components should be
distributed, not the source code. The reasoning here is
that distributing source code means that it will be
modified which will break the design abstraction, thus
losing much of the reuse benefit. Executable distribution
could be done in C by making and distributing archive
files containing object code for the functions. This
assumes that all of the users will have an environment
where the archives can be used.
Distributing only executable code may be a good idea
if the user of the components can be assured that someone
is available to fix problems and make enhancements as
needed. With software such as the IR components there
was no readily available maintenance organization, so we
distributed the source.
5. Testing and Optimization
The quality assurance of software is important to its
reuse. Code that does not meet the software quality
standards of an organization will not be reused by the
organization. Inside an organization, thorough testing and
optimization of components can be justified since the
higher costs for these activities can be amortized across
the multiple reuses.
Before release, the code was inspected for conformance
to programming standards such as the use of standard
headers on code modules and so on, and run through lint,
and coverage analyzed to 90% branch coverage. Code
portability was checked by moving the code to another
environment.
0-7695-0559-7/00 $10.00 ã 2000 IEEE
A rule of thumb sometimes used by designers of
reusable components is that if the reusable components is
more than 25% slower than an equivalent one use
component, it will not be reused. Optimization of code
components can, therefore, be important. Optimizations
must be done carefully, since increased optimization often
decreases code readability and maintainability. Bentley
provides a good summary of proper techniques[3]. For the
IR components, however, no systematic optimization was
done, nor have there been any requests for it from users.
6. Delivery Methods
A key question with a component collection is how to
make the components available? The first plan for the IR
components was for a disk to be included with the book,
but for various logistical reasons that didn’t work. So,
plan two was to make the code available via ftp. I put the
code for each chapter in a separate directory at a Virginia
Tech ftp site (ftp.vt.edu). I originally stored the code for
each sub-collection in a separate directory. I started
getting requests to put the code in a single file to make
downloading easier. I did that by creating a compressed
tar file and putting that on the web site. Then I started
getting email from people outside the U.S. saying that
they couldn’t get into the ftp site. I referred them to the
ftp site technician. I think that people usually got the
code they wanted, but the problem persisted. I decided to
put the components into software repositories as well.
In the 1990’s the U.S. government started supporting
research and development of reuse repositories. Two such
were Asset and Mountainnet. I submitted the IR
components to both libraries. Submission of components
to the library required that I fill out a template describing
the components. The components were available in these
repositories for several years. Government funding for the
repositories was stopped in 1998, and the repositories are
now no longer available.
In 1994, Prentice-Hall licensed the book to Dr. Dobbs
who created a CD-Rom containing the IR book, and
several other algorithms books [5]. The text of the book,
code included, was put into a hypertext format and a
search engine was included.
Many other web sites now either reference the IR code
ftp site, or keep a copy of the code. There is, however, no
mechanism for keeping consistency among web sites
offering the code. This is a version control problem (see
maintenance section below).
GNU (Gnu’s not Unix) is a collection of software
managed by the Free Software Foundation[10]. While
examining the holdings of the GNU library, I saw they
had nothing on IR. I contacted the Free Software
Foundation offering the IR components. After several
email exchanges, the following facts emerged.
1. GNU would like to have the code.
2. Some rework of the code would be necessary to put
the code into the GNU format.
3. Having the code in GNU would require a
commitment to long term maintenance (see
discussion of maintenance below).
4. Putting the code in GNU would require that the code
meet the GNU standards for free software. This
requires, among other things, that the code in the
GNU library make no reference to the book, and that
the code be freely available for modification by any
user. This raised many copyright and other legal
issues that have not yet been resolved.
7. Legal Ownership
Legal ownership of components is concerned with
three types of legal claims: copyright, patents, and trade
secrets[11]. A copyright protects the expression of an
idea. Copyright has traditionally been used to protect
books and other print material, and music. Current
copyright law allows copying of software for backup and
archival purposes. Copyright protection is relatively
inexpensive and easy to obtain. Copyright claims need
not be formally filed, though failure to do so may limit
legal claims.
There has been some work on assuring versions of
software using encryption methods [13]. In this scheme,
each component would be assigned a unique identifier.
Once published, the component could not be changed
even by the author without changing the identifier. This
method might also be used to protecting copyrighted
software components. Collberg and Thomborsen describe
a method called watermarking for embedding a secret tag
in a component that can be used to uniquely identify the
component, and therefore to tell if it has been stolen[4].
A patent protects an idea, rather than the expression of
the idea. Current patent law restricts others from using the
patented idea for seventeen years after the patent is
granted. Software, algorithms, and processes are typically
patented rather than copyrighted. Obtaining a patent is a
long expensive process, involving an extensive search to
determine if the patent is original. Patents are granted by
government agencies such as the U.S. patent office. Over
20,000 software patents were issued from 1994-96 [1].
Twenty-nine of ninety two respondents to a survey on
software reuse agreed at least somewhat with that they
were inhibited from reusing software by legal issues [7].
Legal issues, unfortunately, are likely to grow in
importance as reuse crosses organizational boundaries and
0-7695-0559-7/00 $10.00 ã 2000 IEEE
moves into the open marketplace. Our experience with
GNU and with the user who wanted a legal document
giving him the right to use the IR components (see usage
section below) reinforces this point.
8. Maintenance and Configuration
Management
Perhaps the most difficult problems about the
components concern maintenance and configuration
management. Maintenance is expensive. Maintenance
costs can easily exceed half of the total costs for a
software project, and numbers for reusable component
collections are probably similar. Code contributors
usually do not want to be responsible for maintenance, so
component collections like the IR components usually do
not have adequate maintenance support. In this section,
the main issues of maintenance are briefly reviewed.
Software configuration management is about how to
monitor and control changes to software, in this case
reusable software assets. Versions of reusable assets must
also be coordinated with other software lifecycle items to
produce correct and consistent product releases.
Configuration management has three major activities:
• Version control. Reusable software components, like
any software product, will have versions because of error
fixes and enhancements. To build a system using these
assets, one will need to know which version to use. Old
versions of assets must be recoverable for reference, and
so they can be used to make corrections and
enhancements. As software assets change, they form
successive versions. Version control is the activity of
keeping track of these versions. To handle this problem,
the IR components were put under change control using
SCCS (source code control system). Since the code
appears in various places—ftp site, cd-rom, various other
web sites, keeping these versions current and coordinated
is a very hard problem. One solution to knowing for sure
which version of a component you have is to use
encryption techniques on the component [13].
• Change control. Change control is the procedure for
requesting changes, deciding what changes to make,
making changes, and recording and verifying changes.
Changes to reusable assets in a library should not be
made haphazardly, but must be made under a controlled
process, though this is often not the case. Change
requests for the IR code generally comes via email. I put
reports of known bugs at the ftp website, but reports of
the same bug keep coming in, in part because the code
appears in so many places
• Build control. Keeping track of which versions of
work products go together to form a release, and
generating derived assets and systems correctly, is called
build control. Build control for reuse has two aspects.
One is the general specification of which versions of
assets to use in a system build. The other aspect is that
reusable assets may themselves be composites of other
items, so specifications of how to build assets may also
be required. Build control for the IR components is
handled with Make.
9. Searching and Understanding
Much early work on reuse focused on the building of
reuse libraries and methods for indexing components and
searching for them. Many researchers began to feel that
this aspect of reuse was sufficiently understood, and that
too much attention was given to it. The focus of reuse
research moved to design of reusable components, domain
analysis, and so on.
The internet is probably the main source now
consulted by software engineers looking for reusable
software outside their own development environment. The
main types of indexing used on the web are free text
keyword searching, and to a lesser degree enumerated
classifications. Searching on the web is made difficult by
the size and dynamics of the database, and by the fact that
different search engines will find different web pages
given the same query.
In teaching reuse courses to graduate students at
Virginia Tech, I found that they had difficulty finding
existing components on the web. For example, in one
course students needed to find stemmers on the web. I
had searched myself and knew that several different ones
could be found. Typical of their input was the following
email I received from the student who eventually received
the highest grade in the class.
"I'm still a little confused about what we should produce
for the code analysis part of the project. I know we will
try to come up with a generic architecture by looking for
similarities in the code. I think this will be hard,
considering the fact that I have only found code for one
algorithm (Porter). Are we supposed to compare different
implementations of the same algorithm?"
I found in working with the students that they did not
know how to formulate good search queries.
Another problem is helping users understand reusable
software components. This is important because if
software engineers cannot understand components, they
will not be able to reuse them. Current methods for
representing reusable components are inadequate. A study
0-7695-0559-7/00 $10.00 ã 2000 IEEE
of four common representation methods for reusable
software components showed that none of the methods
worked very well for helping users understand the
components [8].
We are currently doing research on visualization
techniques, such as hypertrees, hierarchical trees, and
tables, for helping users understand reusable software
components [2]. We are using the IR components as a
testbed for this research. Our visualizations are grounded
in reuse design principles, such as the 3 C's model, and
in general principles of information design such as those
of Tufte. We use an extension of XML as a modeling
language for the components.
10. Usage
Because of the different venues used to distribute the
IR components, usage data and user feedback comes from
various sources. One source is email from users typically
asking where they can find copies of the code, reporting a
bug in the code, or occasionally asking if the code can be
included in a commercial application such as the
following one received recently.
What is the status of your stemming code
(implementation of the Porter algorithm) located in
ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/stem
mer/,
is it public domain or copyrighted? The reason that
I ask is I want to know if it is okay to use it in a
search engine I am creating for my commercial
website.
I typically pass these requests on to the editor of the
book at Pretntice-Hall who approves them and asks that
the source of the component be referenced in the code and
documentation of the system in which it will be used.
This time I got a followup message,
"The email address I used was from the read.me file
that came with the stemming code - the address is
frakes@sarvis.cs.vt.edu. My lawyer wants me to have
sign something confirming the info below - is your
address at Virginia Polytechnic Institute and
State University still valid?"
This message point to two problems--how to keep
information associated with the code, in this case my
email address up to date, and how to handle legal
problems. I sent the message onto the editor at Prentice-
Hall.
The proliferation of the code on various websites is
also an indicator of usage, as is references to the code in
various web pages. Some of the web pages are papers that
reference the book or code from the book, some are
syllabi for courses, others contain variants of some of the
components written in different languages. Another source
of feedback from users can be found in reviews of the
book at websites like amazon.com.
11. Current Status and A Proposal
I am currently working with available personnel (i.e. a
graduate student) to address some of the problems
identified above. Specifically the student is doing a
semester project to:
• place the code, which now has two versions, under
change and version control using RCS.
• place the code on at least two ftp servers
• convert the code to the GNU coding and “free
software” standards.
• create a web page for the code that provides
information and pointers to the distribution sites.
• Create Documentation that will allow continuity
in the maintenance of the software.
Experience with the IR code collection shows that
current methods of development, maintenance, and
distribution work, but need improvement. Some
recommendations follow.
There is much inefficiency in the development of
components that accompany texts. For example, there are
many books that provide code that implements the basic
data structures and algorithms of computer science such as
sorting, searching, lists, stacks, queues and so on. A
standard way of cataloging these data structures and
algorithms could be quite helpful. For example, each
unique algorithm or data structure specification might be
assigned a product number similar to an ISBN number for
a book. Implementations of these specifications might
also be assigned a number that references the number of
the implemented specification. Such components might
also include information on quality assurance, indexing
terms, repository locations, and so on.
This will only happen if it makes legal and financial
sense, and the legal and financial issues are far from
solved. The case of GNU, for example, shows the
complexity of issues related to “free software”. The recent
trend towards patenting software algorithms also adds to
the difficulty of freely sharing and reusing software.
There is also the continuing question of who will provide
resources for long term maintenance tasks. These
0-7695-0559-7/00 $10.00 ã 2000 IEEE
important problems must be solved if we are to make
better use of existing reusable software sources.
References
[1] Aharonian, G., 1995 US Patent Statistics. 1995,
http://www.baker.com/grandunificationtheory/archive
/199601/19960121.html.
[2] Alonso, O., & William B. Frakes (2000). Visualization of
Reusable Software Assets. In W. B. Frakes (Ed.), ICSR6 Sixth
International Conference on Software Reuse, . Vienna,
Austria: Springer-Verlag.
[3] Bentley, J. (1982). Writing Efficient Programs.
Englewood Cliffs, NJ: Prentice-Hall.
[4] Collberg, C., & Thomborsen, C. (1999). Software
watermarking: Models and dynamic embeddings. In
POPL’99, 26th Annual SIGPLAN–SIGACT Symposium on
Principles of Programming Languages, (pp. 311–324).
[5] Dr.Dobbs Essential Books on Algorithms and Data
Structures, 1999
[6] Frakes, W., & Baeza-Yates, R. (Eds.). (1992). Information
Retrieval: Data Structures and Algorithms. Englewood Cliffs,
N.J.: Prentice-Hall.
[7]Frakes, W. B., & Fox., C. J. (1995). Sixteen
Questions about Software Reuse. CACM, 38(6), 75-87.
[8] Frakes, W., & Pole, T. (1994). An Empirical Study of
Representation Methods for Reusable Software Components.
IEEE Transactions on Software Engineering, , V20 n8, pp.
617-630, 1994..
[9] Frakes, W., & Terry, C. (1996). Software Reuse and
Reusability Models and Metrics. ACM Computing Surveys,
28(2), 415-435.
[10] GNU Coding Standards Copyright 1998 Free Software
Foundation, Inc.
[11] Huber, T. Reducing Business and Legal Risks in
Software Reuse Libraries. in ICSR-3. 1994. Rio de Janeiro:
IEEE-CS Press.
[12] Latour, L., Wheeler, T., & Frakes, B. (1991). Descriptive
and Prescriptive Aspects of the 3 C's Model: SETA1 Working
Group Summary. Ada Letters, XI(3), 9-17.
[13] Moore, J. W. (1994). The Use of Encryption to Ensure the
Integrity of Reusable Software Components. Third
International Conference on Software Reuse, (pp. 118-125).
Rio de Janeiro: IEEE CS Press.
[14] Stroustrup, B. (1996). Language-technical Aspects of
Reuse. In Fourth International Conference on Software Reuse,
(pp. 11-19). Orlando, FL: IEEE CS Press.
0-7695-0559-7/00 $10.00 ã 2000 IEEE

More Related Content

PDF
L035478083
PPT
Component Search and Retrieval
PDF
Reuse Software Components (IMS 2006)
PPTX
Software component reuse repository
PDF
An Introduction to Software Architecture - Summary
PDF
An efficient tool for reusable software
PPT
Ch19
DOC
term paper for cbd models
L035478083
Component Search and Retrieval
Reuse Software Components (IMS 2006)
Software component reuse repository
An Introduction to Software Architecture - Summary
An efficient tool for reusable software
Ch19
term paper for cbd models

Similar to A Case Study Of A Reusable Component Collection (20)

PDF
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
PPT
Design Issue(Reuse) in Software Engineering SE14
PDF
Developing reusable software components for distributed embedded systems
PPT
PDF
A Methodology To Manage Victim Components Using Cbo Measure
DOCX
Learning activity 4
PPT
Advanced Software Engineering.ppt
PPTX
Component level design
PPTX
A glimpse-of-cbd
PPTX
Reuse and cloud in component based development
PPTX
Ch17-Software Engineering 9
PPTX
9 Component Based SE.pptx9 Component Based SE.pptx9 Component Based SE.pptx
PPT
Ch18
PDF
Solution Manual for Software Engineering, 9/E 9th Edition Ian Sommerville 202...
PDF
Download full ebook of instant download pdf
PPT
Software Reuse.ppt
PDF
Solution Manual for Software Engineering, 9/E 9th Edition Ian Sommerville dow...
PPTX
Software engineering: design for reuse
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
Design Issue(Reuse) in Software Engineering SE14
Developing reusable software components for distributed embedded systems
A Methodology To Manage Victim Components Using Cbo Measure
Learning activity 4
Advanced Software Engineering.ppt
Component level design
A glimpse-of-cbd
Reuse and cloud in component based development
Ch17-Software Engineering 9
9 Component Based SE.pptx9 Component Based SE.pptx9 Component Based SE.pptx
Ch18
Solution Manual for Software Engineering, 9/E 9th Edition Ian Sommerville 202...
Download full ebook of instant download pdf
Software Reuse.ppt
Solution Manual for Software Engineering, 9/E 9th Edition Ian Sommerville dow...
Software engineering: design for reuse
Ad

More from Jennifer Strong (20)

PDF
Step By Step How To Write A
PDF
Scholarship Personal Statement What To Includ
PDF
Essay Purposes, Types And Examples Examples
PDF
Someone To Write My Essay For Me - College Homewor
PDF
Effective Persuasive Writing. Persuasive Essay Topics F
PDF
002 Essay Example About Plagiarism C
PDF
15 Rhetorical Analysis Questions To Ask Your Stu
PDF
Basildon Bond Watermarked Pe
PDF
Admission Essay How To Write A Good Introductory Paragraph For An Essay
PDF
Fluid Lucky Behalf Ielts Writing Linking Words Business Analyst Evolve
PDF
Academic Conclusion. Conclusion Paragraphs. 20
PDF
The Best Argumentative Essay Topics. 100 Argumentative Essay Topics
PDF
Writing A Thesis Statement For Resea
PDF
The Reason Seeking Transfer AdmissionApplication Essay
PDF
Pmi Charleston Scholarship Essay
PDF
Printable Writing Paper (75) By Aimee-Valentine-Art.De
PDF
Descriptive Essay Topics
PDF
Paper Writers For Hire By Ac89Pen
PDF
Literary Narrative Essay Telegraph
PDF
Greatest Free Essay HttpsFreeessays.Page
Step By Step How To Write A
Scholarship Personal Statement What To Includ
Essay Purposes, Types And Examples Examples
Someone To Write My Essay For Me - College Homewor
Effective Persuasive Writing. Persuasive Essay Topics F
002 Essay Example About Plagiarism C
15 Rhetorical Analysis Questions To Ask Your Stu
Basildon Bond Watermarked Pe
Admission Essay How To Write A Good Introductory Paragraph For An Essay
Fluid Lucky Behalf Ielts Writing Linking Words Business Analyst Evolve
Academic Conclusion. Conclusion Paragraphs. 20
The Best Argumentative Essay Topics. 100 Argumentative Essay Topics
Writing A Thesis Statement For Resea
The Reason Seeking Transfer AdmissionApplication Essay
Pmi Charleston Scholarship Essay
Printable Writing Paper (75) By Aimee-Valentine-Art.De
Descriptive Essay Topics
Paper Writers For Hire By Ac89Pen
Literary Narrative Essay Telegraph
Greatest Free Essay HttpsFreeessays.Page
Ad

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
Renaissance Architecture: A Journey from Faith to Humanism
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Institutional Correction lecture only . . .
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Supply Chain Operations Speaking Notes -ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial disease of the cardiovascular and lymphatic systems
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pre independence Education in Inndia.pdf
Insiders guide to clinical Medicine.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
O7-L3 Supply Chain Operations - ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf

A Case Study Of A Reusable Component Collection

  • 1. A Case Study of a Reusable Component Collection William B. Frakes Computer Science Department Virginia Tech, Falls Church wfrakes@vt.edu Abstract This paper reports on practical issues in the development, distribution, use, and evolution of a reusable component collection in the domain of information retrieval. 1. Introduction Software reuse is the use of existing software knowledge or artifacts to build new software. There are many types of software reuse [9]. The reuse described in this paper is ad-hoc, black box, compositional, code reuse. Ad hoc means that the reuse is not part of a repeatable mandated organizational process. Ad hoc reuse is by far more common than systematic reuse, though the latter is thought to be more powerful. Black box reuse is reuse of a software item without modification. Compositional reuse means that the software system was built by a human programmer out of components, as opposed to generating a system automatically from specifications. The reuse described in this paper is primarily vertical rather than horizontal since it is focused in the domain of information retrieval, though some of the components such as string searching might also be considered horizontal. One source of reusable software is the code that is developed to accompany books. This paper concerns code from a book on data structures and algorithms for information retrieval (IR) systems [6]. Information retrieval systems retrieve textual documents from a database in response to queries submitted to the system by users. IR systems can be defined more formally using set and function notation as follows. D = set of textual documents D’=subset of D Q=set of queries M=matching function Systems in the domain of information retrieval can now be specified as follows. S : S computes D’=M(D,Q) That is, all systems S such that S returns a subset of documents D’ of D that match the set of queries Q are IR systems. One of the goals for the book was development of reusable IR code. Authors were asked to develop software components for their chapters in C following industrial coding guidelines. This was partly successful, and with some rework, the following components were developed and tested: • Lexical Analysis and Stop List operations - this code breaks text into words and removes words considered unimportant for indexing. • Stemmer Code - implements the Porter stemming algorithm. A stemmer conflates words by finding a common root form of the words. • Thesaurus Construction - supports the automatic construction of thesauri from source text. • Boolean Operations - implements standard Boolean operations (AND, OR, NOT) on sets of documents. • Hashing Algorithms - including an algorithm for minimal perfect hashing. • String searching - implementations for basic algorithms for finding patterns in text strings. This paper is about practical issues encountered in the creation, distribution, and use of the components. These issues are not particular to the domain of information retrieval, nor particular to C functions. They may well arise in any domain and for any type of reusable asset. 0-7695-0559-7/00 $10.00 ã 2000 IEEE
  • 2. 2. What is a component? What is a component? The term is ambiguous. A component can be any lifecycle object or part thereof. Usually a code component is a subroutine (function or subprogram), or an object or class, but it could also be many other things like macros, header files, subsystems, processes, or patterns. This paper discusses collections of C functions. This simplifies things a little since this is a kind of reuse familiar to many. Even this kind of reuse, however, can still be complicated. The 3 C’s model of reuse design [12] says that there are three aspects of a reusable component--the concept, the content, and the context. The concept corresponds to the abstract functionality of a component such as might be specified in an abstract data type or a formal algorithm specification. The purpose of such abstractions is to focus on the essence of the component, whatever that might be, and ignore other details, usually implementation details. The chapters in the book provide the specifications for the concepts of the components. The content is the implementation of the component. This involves selection of a programming language and a design. The implementations of the components in C are the content. The transition from concept to content involves moving from the problem, or domain, space to the solution space. The problem space is only concerned with the concepts and operations of the domain in question--in this case information retrieval. The solution space involves the concepts and operations of the implementation environment--in this case the C language. The context is the environment needed to use the components. Context for code components might be the required machine, operating systems, compiler version, and so on. The code for the IR components was developed for and tested on a Unix system and certain assumptions were made regarding implementation. Porting the code to DOS, for example, required changes to make filenames have the required length of no more than eight characters. 3. Language Software reuse is now generally regarded to be a good thing, and most modern languages make some claim for their support of reuse. The C language, for example, was designed for extensive reuse in the sense that it is a small language extensively augmented by reusable function libraries. Newer languages like C++ provide reuse of higher level programming constructs such as objects, classes, and templates and directly supports type polymorphism via function overloading. A summary of the reuse aspects of C++, for example, can be found in [14]. I selected C as the component implementation language because C was and is a widely known and used language in both industry and academia. It is also the programming language I know best, and the one I’ve used to develop industrial software. There are also many good free software engineering support tools for C, including free compilers. Was C the best choice? This of course opens the door to language lawyering. Let me just say that the components got developed. Some of the components have been rewritten in other languages, sometimes with attribution of the source, sometimes not. Versions of the stemmer, for example, have appeared on the web in Perl and Java. 4. Source or Binary? The argument is sometimes made that only the executable code for reusable components should be distributed, not the source code. The reasoning here is that distributing source code means that it will be modified which will break the design abstraction, thus losing much of the reuse benefit. Executable distribution could be done in C by making and distributing archive files containing object code for the functions. This assumes that all of the users will have an environment where the archives can be used. Distributing only executable code may be a good idea if the user of the components can be assured that someone is available to fix problems and make enhancements as needed. With software such as the IR components there was no readily available maintenance organization, so we distributed the source. 5. Testing and Optimization The quality assurance of software is important to its reuse. Code that does not meet the software quality standards of an organization will not be reused by the organization. Inside an organization, thorough testing and optimization of components can be justified since the higher costs for these activities can be amortized across the multiple reuses. Before release, the code was inspected for conformance to programming standards such as the use of standard headers on code modules and so on, and run through lint, and coverage analyzed to 90% branch coverage. Code portability was checked by moving the code to another environment. 0-7695-0559-7/00 $10.00 ã 2000 IEEE
  • 3. A rule of thumb sometimes used by designers of reusable components is that if the reusable components is more than 25% slower than an equivalent one use component, it will not be reused. Optimization of code components can, therefore, be important. Optimizations must be done carefully, since increased optimization often decreases code readability and maintainability. Bentley provides a good summary of proper techniques[3]. For the IR components, however, no systematic optimization was done, nor have there been any requests for it from users. 6. Delivery Methods A key question with a component collection is how to make the components available? The first plan for the IR components was for a disk to be included with the book, but for various logistical reasons that didn’t work. So, plan two was to make the code available via ftp. I put the code for each chapter in a separate directory at a Virginia Tech ftp site (ftp.vt.edu). I originally stored the code for each sub-collection in a separate directory. I started getting requests to put the code in a single file to make downloading easier. I did that by creating a compressed tar file and putting that on the web site. Then I started getting email from people outside the U.S. saying that they couldn’t get into the ftp site. I referred them to the ftp site technician. I think that people usually got the code they wanted, but the problem persisted. I decided to put the components into software repositories as well. In the 1990’s the U.S. government started supporting research and development of reuse repositories. Two such were Asset and Mountainnet. I submitted the IR components to both libraries. Submission of components to the library required that I fill out a template describing the components. The components were available in these repositories for several years. Government funding for the repositories was stopped in 1998, and the repositories are now no longer available. In 1994, Prentice-Hall licensed the book to Dr. Dobbs who created a CD-Rom containing the IR book, and several other algorithms books [5]. The text of the book, code included, was put into a hypertext format and a search engine was included. Many other web sites now either reference the IR code ftp site, or keep a copy of the code. There is, however, no mechanism for keeping consistency among web sites offering the code. This is a version control problem (see maintenance section below). GNU (Gnu’s not Unix) is a collection of software managed by the Free Software Foundation[10]. While examining the holdings of the GNU library, I saw they had nothing on IR. I contacted the Free Software Foundation offering the IR components. After several email exchanges, the following facts emerged. 1. GNU would like to have the code. 2. Some rework of the code would be necessary to put the code into the GNU format. 3. Having the code in GNU would require a commitment to long term maintenance (see discussion of maintenance below). 4. Putting the code in GNU would require that the code meet the GNU standards for free software. This requires, among other things, that the code in the GNU library make no reference to the book, and that the code be freely available for modification by any user. This raised many copyright and other legal issues that have not yet been resolved. 7. Legal Ownership Legal ownership of components is concerned with three types of legal claims: copyright, patents, and trade secrets[11]. A copyright protects the expression of an idea. Copyright has traditionally been used to protect books and other print material, and music. Current copyright law allows copying of software for backup and archival purposes. Copyright protection is relatively inexpensive and easy to obtain. Copyright claims need not be formally filed, though failure to do so may limit legal claims. There has been some work on assuring versions of software using encryption methods [13]. In this scheme, each component would be assigned a unique identifier. Once published, the component could not be changed even by the author without changing the identifier. This method might also be used to protecting copyrighted software components. Collberg and Thomborsen describe a method called watermarking for embedding a secret tag in a component that can be used to uniquely identify the component, and therefore to tell if it has been stolen[4]. A patent protects an idea, rather than the expression of the idea. Current patent law restricts others from using the patented idea for seventeen years after the patent is granted. Software, algorithms, and processes are typically patented rather than copyrighted. Obtaining a patent is a long expensive process, involving an extensive search to determine if the patent is original. Patents are granted by government agencies such as the U.S. patent office. Over 20,000 software patents were issued from 1994-96 [1]. Twenty-nine of ninety two respondents to a survey on software reuse agreed at least somewhat with that they were inhibited from reusing software by legal issues [7]. Legal issues, unfortunately, are likely to grow in importance as reuse crosses organizational boundaries and 0-7695-0559-7/00 $10.00 ã 2000 IEEE
  • 4. moves into the open marketplace. Our experience with GNU and with the user who wanted a legal document giving him the right to use the IR components (see usage section below) reinforces this point. 8. Maintenance and Configuration Management Perhaps the most difficult problems about the components concern maintenance and configuration management. Maintenance is expensive. Maintenance costs can easily exceed half of the total costs for a software project, and numbers for reusable component collections are probably similar. Code contributors usually do not want to be responsible for maintenance, so component collections like the IR components usually do not have adequate maintenance support. In this section, the main issues of maintenance are briefly reviewed. Software configuration management is about how to monitor and control changes to software, in this case reusable software assets. Versions of reusable assets must also be coordinated with other software lifecycle items to produce correct and consistent product releases. Configuration management has three major activities: • Version control. Reusable software components, like any software product, will have versions because of error fixes and enhancements. To build a system using these assets, one will need to know which version to use. Old versions of assets must be recoverable for reference, and so they can be used to make corrections and enhancements. As software assets change, they form successive versions. Version control is the activity of keeping track of these versions. To handle this problem, the IR components were put under change control using SCCS (source code control system). Since the code appears in various places—ftp site, cd-rom, various other web sites, keeping these versions current and coordinated is a very hard problem. One solution to knowing for sure which version of a component you have is to use encryption techniques on the component [13]. • Change control. Change control is the procedure for requesting changes, deciding what changes to make, making changes, and recording and verifying changes. Changes to reusable assets in a library should not be made haphazardly, but must be made under a controlled process, though this is often not the case. Change requests for the IR code generally comes via email. I put reports of known bugs at the ftp website, but reports of the same bug keep coming in, in part because the code appears in so many places • Build control. Keeping track of which versions of work products go together to form a release, and generating derived assets and systems correctly, is called build control. Build control for reuse has two aspects. One is the general specification of which versions of assets to use in a system build. The other aspect is that reusable assets may themselves be composites of other items, so specifications of how to build assets may also be required. Build control for the IR components is handled with Make. 9. Searching and Understanding Much early work on reuse focused on the building of reuse libraries and methods for indexing components and searching for them. Many researchers began to feel that this aspect of reuse was sufficiently understood, and that too much attention was given to it. The focus of reuse research moved to design of reusable components, domain analysis, and so on. The internet is probably the main source now consulted by software engineers looking for reusable software outside their own development environment. The main types of indexing used on the web are free text keyword searching, and to a lesser degree enumerated classifications. Searching on the web is made difficult by the size and dynamics of the database, and by the fact that different search engines will find different web pages given the same query. In teaching reuse courses to graduate students at Virginia Tech, I found that they had difficulty finding existing components on the web. For example, in one course students needed to find stemmers on the web. I had searched myself and knew that several different ones could be found. Typical of their input was the following email I received from the student who eventually received the highest grade in the class. "I'm still a little confused about what we should produce for the code analysis part of the project. I know we will try to come up with a generic architecture by looking for similarities in the code. I think this will be hard, considering the fact that I have only found code for one algorithm (Porter). Are we supposed to compare different implementations of the same algorithm?" I found in working with the students that they did not know how to formulate good search queries. Another problem is helping users understand reusable software components. This is important because if software engineers cannot understand components, they will not be able to reuse them. Current methods for representing reusable components are inadequate. A study 0-7695-0559-7/00 $10.00 ã 2000 IEEE
  • 5. of four common representation methods for reusable software components showed that none of the methods worked very well for helping users understand the components [8]. We are currently doing research on visualization techniques, such as hypertrees, hierarchical trees, and tables, for helping users understand reusable software components [2]. We are using the IR components as a testbed for this research. Our visualizations are grounded in reuse design principles, such as the 3 C's model, and in general principles of information design such as those of Tufte. We use an extension of XML as a modeling language for the components. 10. Usage Because of the different venues used to distribute the IR components, usage data and user feedback comes from various sources. One source is email from users typically asking where they can find copies of the code, reporting a bug in the code, or occasionally asking if the code can be included in a commercial application such as the following one received recently. What is the status of your stemming code (implementation of the Porter algorithm) located in ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/stem mer/, is it public domain or copyrighted? The reason that I ask is I want to know if it is okay to use it in a search engine I am creating for my commercial website. I typically pass these requests on to the editor of the book at Pretntice-Hall who approves them and asks that the source of the component be referenced in the code and documentation of the system in which it will be used. This time I got a followup message, "The email address I used was from the read.me file that came with the stemming code - the address is frakes@sarvis.cs.vt.edu. My lawyer wants me to have sign something confirming the info below - is your address at Virginia Polytechnic Institute and State University still valid?" This message point to two problems--how to keep information associated with the code, in this case my email address up to date, and how to handle legal problems. I sent the message onto the editor at Prentice- Hall. The proliferation of the code on various websites is also an indicator of usage, as is references to the code in various web pages. Some of the web pages are papers that reference the book or code from the book, some are syllabi for courses, others contain variants of some of the components written in different languages. Another source of feedback from users can be found in reviews of the book at websites like amazon.com. 11. Current Status and A Proposal I am currently working with available personnel (i.e. a graduate student) to address some of the problems identified above. Specifically the student is doing a semester project to: • place the code, which now has two versions, under change and version control using RCS. • place the code on at least two ftp servers • convert the code to the GNU coding and “free software” standards. • create a web page for the code that provides information and pointers to the distribution sites. • Create Documentation that will allow continuity in the maintenance of the software. Experience with the IR code collection shows that current methods of development, maintenance, and distribution work, but need improvement. Some recommendations follow. There is much inefficiency in the development of components that accompany texts. For example, there are many books that provide code that implements the basic data structures and algorithms of computer science such as sorting, searching, lists, stacks, queues and so on. A standard way of cataloging these data structures and algorithms could be quite helpful. For example, each unique algorithm or data structure specification might be assigned a product number similar to an ISBN number for a book. Implementations of these specifications might also be assigned a number that references the number of the implemented specification. Such components might also include information on quality assurance, indexing terms, repository locations, and so on. This will only happen if it makes legal and financial sense, and the legal and financial issues are far from solved. The case of GNU, for example, shows the complexity of issues related to “free software”. The recent trend towards patenting software algorithms also adds to the difficulty of freely sharing and reusing software. There is also the continuing question of who will provide resources for long term maintenance tasks. These 0-7695-0559-7/00 $10.00 ã 2000 IEEE
  • 6. important problems must be solved if we are to make better use of existing reusable software sources. References [1] Aharonian, G., 1995 US Patent Statistics. 1995, http://www.baker.com/grandunificationtheory/archive /199601/19960121.html. [2] Alonso, O., & William B. Frakes (2000). Visualization of Reusable Software Assets. In W. B. Frakes (Ed.), ICSR6 Sixth International Conference on Software Reuse, . Vienna, Austria: Springer-Verlag. [3] Bentley, J. (1982). Writing Efficient Programs. Englewood Cliffs, NJ: Prentice-Hall. [4] Collberg, C., & Thomborsen, C. (1999). Software watermarking: Models and dynamic embeddings. In POPL’99, 26th Annual SIGPLAN–SIGACT Symposium on Principles of Programming Languages, (pp. 311–324). [5] Dr.Dobbs Essential Books on Algorithms and Data Structures, 1999 [6] Frakes, W., & Baeza-Yates, R. (Eds.). (1992). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, N.J.: Prentice-Hall. [7]Frakes, W. B., & Fox., C. J. (1995). Sixteen Questions about Software Reuse. CACM, 38(6), 75-87. [8] Frakes, W., & Pole, T. (1994). An Empirical Study of Representation Methods for Reusable Software Components. IEEE Transactions on Software Engineering, , V20 n8, pp. 617-630, 1994.. [9] Frakes, W., & Terry, C. (1996). Software Reuse and Reusability Models and Metrics. ACM Computing Surveys, 28(2), 415-435. [10] GNU Coding Standards Copyright 1998 Free Software Foundation, Inc. [11] Huber, T. Reducing Business and Legal Risks in Software Reuse Libraries. in ICSR-3. 1994. Rio de Janeiro: IEEE-CS Press. [12] Latour, L., Wheeler, T., & Frakes, B. (1991). Descriptive and Prescriptive Aspects of the 3 C's Model: SETA1 Working Group Summary. Ada Letters, XI(3), 9-17. [13] Moore, J. W. (1994). The Use of Encryption to Ensure the Integrity of Reusable Software Components. Third International Conference on Software Reuse, (pp. 118-125). Rio de Janeiro: IEEE CS Press. [14] Stroustrup, B. (1996). Language-technical Aspects of Reuse. In Fourth International Conference on Software Reuse, (pp. 11-19). Orlando, FL: IEEE CS Press. 0-7695-0559-7/00 $10.00 ã 2000 IEEE