Unicode Explained Includes Index 1st Ed Korpela Jukka K
Unicode Explained Includes Index 1st Ed Korpela Jukka K
Unicode Explained Includes Index 1st Ed Korpela Jukka K
Unicode Explained Includes Index 1st Ed Korpela Jukka K
1. Unicode Explained Includes Index 1st Ed Korpela
Jukka K download
https://ebookbell.com/product/unicode-explained-includes-
index-1st-ed-korpela-jukka-k-11831694
Explore and download more ebooks at ebookbell.com
2. Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Unicode Standard Version 50 The 5th Edition 5th Edition The Unicode
Consortium
https://ebookbell.com/product/unicode-standard-version-50-the-5th-
edition-5th-edition-the-unicode-consortium-2366312
Unicode Demystified A Practical Programmers Guide To The Encoding
Standard 1st Edition Richard Gillam
https://ebookbell.com/product/unicode-demystified-a-practical-
programmers-guide-to-the-encoding-standard-1st-edition-richard-
gillam-2451878
The Unicode Standard Version 40 The Unicode Consortium
https://ebookbell.com/product/the-unicode-standard-version-40-the-
unicode-consortium-2159298
The Unicode Standard Version 62 Core Specification Edited By Julie D
Allen
https://ebookbell.com/product/the-unicode-standard-version-62-core-
specification-edited-by-julie-d-allen-4071642
3. Fonts Encodings From Unicode To Advanced Typography And Everything In
Between 1st Edition Yannis Haralambous
https://ebookbell.com/product/fonts-encodings-from-unicode-to-
advanced-typography-and-everything-in-between-1st-edition-yannis-
haralambous-22034648
Proposition Dajouter Lcriture Tifinaghe Tifinagh Unicode Propositions
Evolution Of The Tifinagh Script In Unicode P Andries
https://ebookbell.com/product/proposition-dajouter-lcriture-tifinaghe-
tifinagh-unicode-propositions-evolution-of-the-tifinagh-script-in-
unicode-p-andries-11949218
Europar 2008 Workshops Parallel Processing Vhpc 2008 Unicore 2008 Hppc
2008 Sgs 2008 Proper 2008 Roia 2008 And Dpa 2008 Las Palmas De Gran
Canaria Spain August 2526 2008 Revised Selected Papers 1st Edition
Michael Alexander
https://ebookbell.com/product/europar-2008-workshops-parallel-
processing-
vhpc-2008-unicore-2008-hppc-2008-sgs-2008-proper-2008-roia-2008-and-
dpa-2008-las-palmas-de-gran-canaria-spain-august-2526-2008-revised-
selected-papers-1st-edition-michael-alexander-2039510
Twenty Years Of Health System Reform In Brazil An Assessment Of The
Sistema Unico De Saude Couttolenc
https://ebookbell.com/product/twenty-years-of-health-system-reform-in-
brazil-an-assessment-of-the-sistema-unico-de-saude-couttolenc-5207718
Europar 2007 Workshops Parallel Processing Hppc 2007 Unicore Summit
2007 And Vhpc 2007 Rennes France August 2831 2007 Revised Selected
Papers Computer Science And General Issues 1st Edition Luc Boug
https://ebookbell.com/product/europar-2007-workshops-parallel-
processing-hppc-2007-unicore-summit-2007-and-vhpc-2007-rennes-france-
august-2831-2007-revised-selected-papers-computer-science-and-general-
issues-1st-edition-luc-boug-1293316
9. Unicode Explained
Jukka K Korpela
Tomcat
™
The Definitive Guide
Jason Brittain and Ian F. Darwin
Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo
15. Preface
Characters often seem simple on the surface, but they are at the heart of a wide variety
of data communications and data processing problems, including text processing,
typesetting, styling text, text databases, and the transmission of textual information.
Computers were invented just for computing. For quite some time, they were so ex-
pensive that their use was limited to the most important numerical calculations that
would have been impossible otherwise. Text was used mainly to add legends and
headings to numeric output, often using a very limited character repertoire, maybe even
lacking lowercase letters. As the cost of computing has dropped, computers have be-
come extensively used for human communication in text format. Most people think of
computers as communicators rather than calculators. People want to communicate in
different languages, and we also use notation systems that may require rich repertoires
of characters.
Unicode was developed to help make this both possible and smooth. Unicode was first
defined in the early 1990s, but its use has progressed fairly slowly. Modern computers
often use Unicode internally, but applications and users still tend to work with older
character codes, which are often very limited. It has been rather complicated to work
with Unicode in text processing, for example. At long last, however, these problems
are becoming easier to solve. Information technology is becoming really multinational,
supporting different languages, writing systems, and conventions. IT products need to
be at least potentially suitable for use in different cultural environments, or “localiza-
ble.” Unicode itself is just part of the technical basis for all this, but it is an indispensable
part.
The technological basis of using Unicode, though still imperfect, is much better than
most people’s capabilities for making use of it. Even computer professionals often don’t
know how to work with large repertoires of characters. The bottleneck is lack of a basic
knowledge and skills, not a lack of hardware or software.
The concept of a character is one of the most difficult basic concepts in information
technology, yet fundamental to text processing, databases, the Web, XML-based mark-
up, internationalization, and other areas. People who encounter Unicode when study-
ing such topics often run into serious difficulties. They mostly find material that as-
sumes that the reader already knows what Unicode is. It might be even worse: it is very
ix
16. easy to find incorrect or seriously confusing information about Unicode and characters,
even in new books. People find themselves in a maze of twisty little passages of char-
acters, fonts, encodings, and related concepts.
This book guides you through the Unicode and character world. It explains how to
identify and classify characters—whether common, uncommon, or exotic—and to
type them, to use their properties, and to process character data in a robust manner. It
helps you to live in a world with several character encodings.
Audience
Readers of this book are expected to be familiar with computers and how computers
work, broadly speaking. They are not expected to know computer programming,
though many readers will use the contents in system design and programming.
This book is intended for people with different backgrounds and needs, including:
• An end user of multilingual or specialized text-related applications. For example,
anyone who works with texts containing mathematical or special symbols, or uses
a multilingual database. These readers should probably explore Chapters 1
through 3 first, practice with that content, and then read Chapters 7 and 8.
• An IT professional who needs to understand Unicode and work with it. The need
might arise from text data conversion tasks, from creating internationalized soft-
ware or web sites, or from system design or programming in an environment that
uses Unicode.
• An IT teacher who needs a better understanding of character code issues, both to
understand the subject area better and to disseminate correct information. There
is rather little about character codes in curricula, and this is largely a chicken-and-
egg problem: there are no good textbooks, and teachers themselves don’t know
the topic well enough. The first three chapters of the book could provide the foun-
dation for a course, optionally coupled with other chapters relevant to a particular
curriculum.
• An IT student, hobbyist, or professional who keeps hearing about Unicode and
needs to work with technologies that use Unicode, such as XML.
Assumptions and Approach
Previous knowledge about character codes is not assumed. If you already know about
them, you may need to change your mental model a bit.
This book starts at the ordinary computer user’s level. Thus, it unavoidably contains
explanations that look trivial to some readers. However, these discussions might help
in explaining things to others when needed. The book also contains practical instruc-
tions on actually working with “special” characters, and an IT professional might find
x | Preface
17. this irrelevant. However, studying such issues and practicing with them will help a lot
in creating a background for more technical work with the infrastructures of character
usage.
In explaining practical ways of doing things, this book often uses Microsoft Windows
and Microsoft Office programs as examples. This is because so many people use such
software and need to know how to use Unicode in them. Moreover, even if you per-
sonally prefer other software, odds are good that you need to work with Windows and
Office at times. Information on using Unicode in some other environments can be
found in the following:
• Markus Kuhn: “UTF-8 and Unicode FAQ for Unix/Linux,” which is available at
http://www.cl.cam.ac.uk/~mgk25/unicode.html
• Tom Gewecke: “Unleash Your Multilingual Mac,” which is available at http://
hometown.aol.com/tg3907/mlingos9.html
After the first three chapters, this book gets more technical, and many of the issues
discussed are abstract and even formal. Therefore, understanding most of the material
in the initial chapters is essential for the rest. To most people, it is very difficult to read
about abstract things if you lack a concrete background that lets you map the abstract
concepts and rules to specific practice.
This book explores Unicode processing generally, but cannot go into great detail on all
parts of the Unicode character space. For much more information on ideographic
characters and processing of East Asian languages, see Ken Lunde’s CJKV Information
Processing (O’Reilly).
Except for the last chapter (Chapter 11), this book does not assume that the reader
knows about computer programming. However, some references to programming are
made throughout the book.
Contents of This Book
The book has three parts:
Part I
Chapters 1 through 3 provide a self-contained tutorial presentation of Unicode and
character data. It is aimed at anyone who has a basic understanding of computing,
and introduces characters in information technology, with some historical back-
ground. Although much of this part is well-known to many IT professionals, it
provides a consistent terminology that could give professionals (and especially
teachers) a model for talking to laymen about characters.
Part II
Chapters 4 through 6 give detailed information about using Unicode and other
character codes. These chapters are especially aimed at computer science students
and teachers, information technology professionals, and people involved in lin-
Preface | xi
18. guistic data processing and databases containing string data. Together with the
first part, this covers what every IT professional should know about characters. It
explains the principles and methods of defining character codes, describes some
of the widely used codes, presents code conversion techniques, and takes a detailed
look at Unicode. This includes properties and classification of characters, collation
and sorting, line breaking rules, and Unicode encodings.
Part III
Chapters 7 through 11 discuss relatively independent topics, to be read according
to each reader’s specific needs. They are topics that are important and even crucial
to many, but not necessary to all. For example, if you need to author or administer
multilingual web sites, you should read the section on characters in HTML and
XHTML. To be honest, I would suggest that most people need to read it at least
twice. Character code problems are intrinsically difficult, and very widely misun-
derstood. It takes time to digest the concepts and principles before you can really
start working with the algorithms and tools.
The chapters can be characterized as follows:
Chapter 1, Characters as Data
This chapter describes, at a general level but exemplified by simple and typical
cases, how computers represent and process characters. It defines fundamental
concepts like character set, code position, encoding, glyph, and font. At this point,
Unicode is the only character set discussed, to avoid confusion. To make the dis-
cussion more concrete and motivating, some features of writing systems are de-
scribed. The historical development of character codes is presented to the extent
that is necessary for understanding why even apparently simple characters, such
as dashes and é, still cause problems. The use of different encodings is illustrated
by examples of viewing email messages and web pages, using commands to select
the encoding if needed. The basic methods for finding, installing, and selecting
fonts are described.
Chapter 2, Writing Characters
This is a practical presentation of some common methods of entering characters,
including keyboard variation, special keys, changing keyboard settings, virtual
keyboards, character maps, automatic “correction” of character sequences, pro-
gram commands, and different escape notations. It is largely a collection of recipes,
useful, for example, to people who work daily with texts containing “difficult”
characters. For this reason, some quick reference tables for very commonly needed
characters are presented. However, it is also relevant to IT specialists who need to
understand the possible input methods when designing applications and systems.
The examples used are mostly from MS Windows and MS Office environments
but various alternatives (such as “Unicode editors”) are also discussed. HTML and
XML character reference and entity reference techniques are presented as well. The
chapter ends with an exercise for writing some specialized texts using some of the
techniques presented.
xii | Preface
19. Chapter 3, Character Sets and Encodings
This chapter describes some very widely used character codes and encodings,
mainly ASCII, ISO-8859-1 and other ISO-8859 standards, Windows Latin 1 and
relatives, and UTF-8. (However, the semantics of characters are described in Chap-
ter 8.) Some less common encodings such as DOS code pages are described in order
to give some basics for working with legacy data and legacy systems. A few widely
used multibyte encodings for East Asian languages are briefly described, too. The
section describes how conversions between the encodings can be performed, either
with the functions of commonly used programs or separate converters. It also dis-
cusses practical feasibility of the character sets in different contexts, such as email,
Internet discussion forums, and document interchange. MIME is presented to the
extent needed for dealing with the charset issue.
Chapter 4, The Structure of Unicode
An in-depth presentation of the fundamentals of Unicode, including design prin-
ciples, coding space, and special terminology. The nature of Unicode as an um-
brella standard based on a large number of older standards is explained, as well as
its relationship to ISO 10646. The unification principle as well as criticism of it is
described.
Chapter 5, Properties of Characters
This chapter describes the various properties defined for characters in the Unicode
standard and their relationship with some programming concepts. This is, in part,
a companion to the much more formal definitions in the standard itself. In partic-
ular, compatibility, decompositions, collation, sorting, directionality, and line-
breaking properties as well as Unicode normalization forms are described.
Chapter 6, Unicode Encodings
This chapter describes UTF-8 and other Unicode encodings in detail, including the
algorithmic descriptions and the practical considerations on choosing an encoding.
Chapter 7, Characters and Languages
The chapter describes some IT-related requirements of different languages and
writing systems, such as how to deal with right-to-left writing. This includes con-
versions between writing systems (transliteration or transcription). The interaction
between encoding, language, and font settings is described. Moreover, language
codes, language metadata, and language markup are described, illustrated with
XML examples.
Chapter 8, Character Usage
This chapter consists of sections devoted to different character blocks and collec-
tions that are practically important especially in the Western world. The first sec-
tion is more generic and discusses the relationship of character standards, orthog-
raphy, and typography. (Even in purely English-language text, typographically
correct punctuation requires characters beyond ASCII.) The chapter contains de-
tailed information about the semantics and usage of individual characters, al-
though the level of detail depends greatly on the importance of the character. All
Preface | xiii
20. the major blocks are briefly characterized to give an overview, but the emphasis is
onASCII,differentLatinsupplements,generalpunctuation,andmathematicaland
technical symbols.
Chapter 9, The Character Level and Above
Characters form but one “protocol level,” above which there are higher levels such
as markup level, record structure level, and application level. This chapter provides
guidelines for the coding of information at different levels when there is choice,
such as using markup versus character difference (largely still an open problem
despite the efforts of the W3C and the Unicode Consortium). This is particularly
important for processing of legacy data and for avoiding overly fine distinctions at
the character level. The chapter ends with a section on media types for text and
the difference between plain text, other subtypes of text, and application types such
as text-processing formats.
Chapter 10, Characters in Internet Protocols
This chapter describes how character encoding information is transmitted using
Internet protocols, including MIME and HTTP, and how content negotiation
works on the Web (for the purposes of negotiating on character encoding). This
constitutes a basis for a presentation of some fundamentals of multilingual web
authoring at the technical level. Moreover, the use of characters in the protocols
themselves, such as Internet message headers and URLs, is described, with focus
on the partial shift from pure ASCII to Unicode. In particular, the technical basis
of Internationalized Domain Names and Internationalized URLs is described.
Chapter 11, Characters in Programming
This chapter presents a number of ways to represent character and string data in
different programming languages, such as FORTRAN, C, C#, Perl, ECMAScript,
and Java™, as well as other computer languages such as XML and CSS. It empha-
sizes both the differences and similarities, which are illustrated with sample pro-
grams to perform simple manipulation of string data. The chapter is especially
intended for people who teach programming but also for people who study or
practice programming in an environment where character data is essential. Pro-
grams that cannot distinguish, for example, between an empty string, a space
character, the NUL character, and the digit zero will have large problems in a
Unicode environment. The chapter also examines requirements for modern pro-
cessing of character data, including the principle of being prepared to handle a
large character repertoire and that of separating internal encoding from input and
output encodings. The International Components for Unicode (ICU) activity and
its results are described. The chapter also contains a section on Common Locale
Data Repository (CLDR) and its future use in disciplined programming. This
largely goes beyond the character concept but is motivated by the use of Unicode
in CLDR and by the organizational connection with the Unicode Consortium.
xiv | Preface
21. Appendix, Tables for Writing Characters
The Appendix provides some commonly needed information useful for entering
characters. It includes tables of key sequences, as well as a mapping chart from the
Symbol font to Unicode.
Self-Assessment Test
To estimate your progress in knowledge about Unicode, you can perform the following
self-assessment test. Read the following statements and comment on each of them with
one of the following alternatives (using whatever symbols you find convenient, such as
those in parentheses): “I do not understand what the statement says” (??), “I know what
it says but I do not know whether it is true” (?), “true” (+), and false (–). Moreover, for
any “true” or “false” answer, consider what you would present as an argument in a
discussion in which someone says you’re wrong.
At any point in reading the book, and especially when you think you have learned
enough, reread the statements and perform the test again. You might regard the fol-
lowing as a spoiler, so it has been written backward so that you can hopefully ignore
it at this point if you like. It reveals what the test is about: .elpoep ot siht nialpxe ot
deen thgim uoy dna ,gnorw era yeht yhw wonk ot laitnesse si ti ecnis ,hguoht ,siht
gniwonk htiw deifsitas eb ton dluohs uoY .eslaf lla era yeht tub ,skoob ecnerefer ni neve
edam ylnommoc era stnemetats ehT
1. Unicode is a 16-bit character code.
2. Unicode contains all the characters used in the languages of the world.
3. Unicode is meant to replace all the other character codes.
4. Unicode cannot be used in real applications now; it is just a future plan.
5. Using Unicode, the size of a text file gets doubled.
6. We don’t need Unicode if we write only in English.
7. Unicode consists of 256 code pages.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Indicates computer code in a broad sense. This includes commands, options,
switches, variables, attributes, keys, functions, types, classes, namespaces, meth-
ods, modules, properties (does not include Unicode “properties”), parameters,
Preface | xv
22. values, objects, events, event handlers, XML tags, HTML tags, macros, the con-
tents of files, and the output from commands.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
The following special notations are used in this book to refer to characters:
“x”
Refers to character x by showing it within double quotation marks. For clarity,
characters that might be confused with other characters in the text—i.e., letters a–
z, A–Z, and some common punctuation, such as hyphens (-), commas (,), and
periods (.)—are enclosed in quotation marks.
U+nnnn
Refers to a character (or a code point) by its Unicode number. The number nnnn is
writteninhexadecimalnotation,usuallyinfourdigitsusingleadingzerosifneeded.
Web sites and pages are mentioned in this book to help the reader locate online infor-
mation that might be useful. Normally both the address (URL) and the name (title,
heading) of a page are mentioned. Some addresses are relatively complicated, but you
can probably locate the pages easily by using your favorite search engine to find a page
by its name, typically by typing it inside quotation marks. This may also help if the page
cannot be found by its address; it may have moved elsewhere, so the name may work.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
xvi | Preface
23. code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Unicode Explained by Jukka K. Korpela.
Copyright 2006 O’Reilly Media, Inc., 0-596-10121-X.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.
Safari® Enabled
When you see a Safari® Enabled icon on the cover of your favorite tech-
nology book, that means the book is available online through the O'Reilly
Network Safari Bookshelf.
Safari offers a solution that's better than e-books. It's a virtual library that lets you easily
search thousands of top tech books, cut and paste code samples, download chapters,
and find quick answers when you need the most accurate, current information. Try it
for free at http://safari.oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707 829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
http://www.oreilly.com/catalog/unicode
To comment or ask technical questions about this book, send email to:
bookquestions@oreilly.com
For more information about our books, conferences, Resource Centers, and the O’Re-
illy Network, see our web site at:
http://www.oreilly.com
Preface | xvii
24. Acknowledgments
The presentation of problems, solutions, and ideas owes much to people with whom
I have been in contact in character-related matters through years, such as (roughly in
chronological order by their influence) Timo Kiravuo, Alan J. Flavell, Arjun Ray, Ro-
man Czyborra, Bob Bemer, and Erkki I. Kolehmainen.
The reviewers, Andreas Prilop, John Cowan, and Jori Mäntysalo gave a very substantial
amount of valuable input, both on content and on presentation. Simon St.Laurent has
had an active and supportive role through the entire process as an editor.
xviii | Preface
25. PART I
Working with Characters
This part describes the fundamentals of representing character data in computers, in-
cluding Unicode and other important character codes. It also discusses several practical
ways of writing Unicode characters.
27. CHAPTER 1
Characters as Data
Computers were originally built to process numbers. Over the last few decades, they've
become increasingly better at handling text as well, but the transition from human
scribbling and beautiful typography to bits and bytes has been complicated. Going from
a paper document to a computerized representation of that document means learning
about how the computer handles text, and requires learning about characters, character
codes, fonts, and encodings. Unicode provides a set of solutions for some of these
problems, while retaining presentation flexibility for making text look as we feel it
should.
Introduction to Characters and Unicode
Computer programs use two basic data types in most of their processing: characters
and numbers. These basic types are combined in various ways to create strings, arrays,
records, and other data structures. (Inside the computer, characters are numbers, but
the ways that these numbers are handled is very different from numbers meant for
calculation.)
Early computers were largely oriented toward numerical computation. However, char-
acters were used early on in administrative data processing, where names, addresses,
and other data needed to be stored and printed as strings. Text processing on computers
became more common much later, when computers had become so affordable that
they replaced typewriters. At present, most text documents are produced and processed
using computers.
Originally, character data on computers had limited types and uses. For economic and
technical reasons, the repertoire of characters was very small, not much more than the
letters, digits, and basic punctuation used in normal English. This constitutes but a tiny
fraction of the different characters used in the world’s writing systems—about 100
characters out of literally myriads (tens of thousands) of characters. Thus, there was a
growing need for a possibility of presenting and handling a large character repertoire
on computers; Unicode is the fundamental answer to that.
3
28. Why Unicode?
Since you are reading this book, I assume you already have sufficient motivation to
learn about Unicode. Nevertheless, a short presentation follows that explains the ben-
efits of Unicode.
Computers internally work on numbers. This means that characters need to be coded
as numbers. A typical arrangement is to use numbers from 0 to 255, because that range
fits into a basic unit of data storage and transfer, called a (8-bit) byte or octet.
When you define how those numbers correspond to characters, you define a character
code. There are quite a number of character codes defined and used in the world. Most
of them have the same assignments for numbers 0 to 127, used for characters that
appear in English as well as in many other languages: the letters a–z plus their uppercase
equivalents, the digits 0–9, and a few punctuation marks. Many of the code numbers
in this so-called ASCII set of characters are used for various technical purposes.
For French texts, for example, you need additional characters such as accented letters
(é, ô, etc.). These can be provided by using code numbers in the range 128–255 in
addition to the ASCII range, and this gives room for letters used in most other Western
European languages as well. Thus, you can use a single character code, called Latin 1,
even for a text containing a mixture of English, French, Spanish, and German, because
these languages all use the Latin characters with relatively few additions.
However, you quickly run out of numbers if you try to cover too many languages within
256 characters. For this reason, different character codes were developed. For example,
Latin 1 is for Western European languages, Latin 2 for several languages spoken in
Central and Eastern Europe, and additional character codes exist for Greek, Cyrillic,
Arabic, etc. When only one language is used, you can usually pick up a suitable char-
acter code and use it. In fact, someone probably did that for you when designing the
particular computer system (including software) that you use. You may have used a
particular character code for years without knowing anything about it.
Character codes that use only the code numbers from 0 to 255 are called 8-bit codes,
since such code numbers can be represented using 8 bits.
Things change when you need to combine languages in one document and the lan-
guages are fundamentally different in their use of characters. In an English-German or
French-Spanish glossary, for example, you can use Latin 1. In English-Greek data, you
can use one of the character codes developed for Greek, since these codes contain the
ASCII characters. But what about French-Greek? That’s not possible the same way,
since the character codes discussed above do not support such a combination. A code
either has Latin accented letters in the “upper half” (the range of 128–255), or it has
Greek letters (α, β, γ, etc.) there. It would be impractical, and often impossible, to
define 256-character codes for all the possible language combinations.
As you probably know, the number of characters needed for Chinese and Japanese is
very large. They just would not fit into a set with only 256 characters. Therefore, dif-
4 | Chapter 1: Characters as Data
29. ferent strategies are used. For example, 2 bytes (octets) instead of one might be used
for one character. This would give 65,536 possible numbers for a character. On the
other hand, the character codes developed for the needs of East Asian languages do not
contain all the characters used in the world.
The solution to such problems, and many other problems in the world of growing
information exchange, is the introduction of a character code that gives every character
of every language a unique number. This number does not depend on the language
used in the text, the font used to display the character, the software, the operating
system,orthedevice.Itisuniversalandkeptunchanged.Therangeofpossiblenumbers
is set sufficiently high to cover all the current and future needs of all languages.
The solution is called Unicode, and it gives anyone the opportunity to say, “I want this
character displayed and the number is…” and have herself understood by all systems
that support Unicode. This does not always guarantee a success in displaying the char-
acter, due to lack of a suitable font, but such technical problems are manageable.
Much widely used software, including Microsoft Windows, Mac OS X, and Linux, has
supported Unicode for years. However, to use Unicode, all the relevant components
must be “Unicode enabled.” For example, although Windows “knows Unicode,” an
application program used on a Windows system might not. Moreover, the display or
printing of characters often fails since fonts (software for drawing characters) are still
incomplete in covering the set of Unicode characters. This is changing as more complete
fonts become available and as programs become more clever in their ability to use
characters from different fonts.
Unicode Can Be Easy
Unicode is both very easy and very complicated. The fundamental principles are simple
and natural, as the explanation above hopefully illustrated. The actual typing and
viewing of Unicode characters can also be easy, when modern tools are used. As we get
to complicated issues like sorting Unicode strings or controlling line breaking, you will
find some challenges. But this book starts from simple principles and usage.
For example, an average PC running the Windows XP system has a universal tool for
typing any Unicode character, assuming that it is contained in some font installed on
the system. The tool is called the Character Map, or CharMap for short. Figure 1-1
shows the user interface of this program. The program can be launched from the Start
menu, although you may need to look for it among “System tools” or something like
that. You can select a collection of characters from a menu, and then click on a character
to select it. The selected characters can be copied onto the clipboard with a single click,
and you can then paste them (e.g., with Ctrl-V) where you like.
There are many other similar tools, often with advanced character search features.
There are also ways to configure your keyboard on the fly so that keys and key com-
binations produce characters that you need frequently.
Introduction to Characters and Unicode | 5
30. What’s in a Character?
We use characters daily: we type them, and we read them on screen or on paper. We
use text-processing programs routinely, much like people used to use typewriters, pens,
or other writing tools. How could characters create problems?
Why Do We Need to Know About Characters?
If English is your native language, you are accustomed to using a small set of characters,
consisting of the letters A–Z and a–z, digits 0–9, and a few punctuation characters.
Most novels, newspaper articles, and memos contain no other characters. Since you
seem to be able to type these characters directly on a keyboard, why should you learn
more about characters and get confused? To be honest, character issues are confusing.
Suppose you use a computer only to write and edit texts in English, perhaps as a sec-
retary or a technical editor. You still have reasons to know about characters:
Figure 1-1. Character Map, part of Windows XP, lets you type any Unicode character
6 | Chapter 1: Characters as Data
31. • Computer technology has caused a decline in typography, and you can make a
positive impression by using correct punctuation instead of typewriter-style punc-
tuation. If you use a text-processing program, it probably takes care of using
“smart” quotation marks instead of "straight" quotes, but you need to learn how
to produce dashes—like this—and how to prevent bad line breaks.
• Normal English texts may contain special characters occasionally. Someone may
spell Caesar as Cæsar, or use a word like fiancé, rôle, or garçon the French way, or
use the per mille sign ‰ or the euro sign €. Michael Everson writes: “Despite
unfounded but widespread belief to the contrary (based doubtless on the preva-
lence of ASCII), diacritics (usually French ones) are often found in naturalized
English words. Examples are: à la carte, abbé, Ægean, archæology, belovèd, café,
décor, détente, éclair, façade, fête, naïve, naïvety (but cf. non-naturalized naïveté),
noël, œsophagus, résumé, vicuña” (http://www.evertype.com/alphabets/eng
lish.pdf). You may regard some of these spellings as foreign or obsolete, but people
may still use them in English. There are often good reasons to change the spelling
to something simpler, but not knowing how to produce the characters is not a good
reason.
• Your text may contain foreign names with some strange characters. Although it is
common to simplify the spelling, you can stand out positively by doing things
correctly. Suppose that someone’s surname is Hämäläinen and she works in an
important international position. She is probably accustomed to seeing her name
written as Hamalainen or Haemaelaeinen. But wouldn’t she be delighted if some-
one were polite enough and competent enough to spell her name right, just for a
change? However, she might not like it if someone tried to do so and failed, pro-
ducing Hmlinen or H{m{l{inen.
• You might even be asked to include quotations in a foreign language. You might
even need to work with a document in a foreign language, because someone has
to do that and this is your day for being that someone. In that case, you may need
to use foreign punctuation as well and to find a way to enter foreign characters
efficiently, in addition to just knowing a universal clumsy way of entering any
character.
• Texts increasingly contain technical and scientific special notations. Even casual
memos and messages may need to mention µm (micrometer) or to use the almost
equals sign ≈ or the male sign ♂. In scientific or technical texts, mathematical
formulas are often quite crucial and need to be exactly right, down to the choice
of each special symbol. The world is getting more technical and symbolic. Even
nontechnical texts like bridge columns contain special symbols, such as ♠.
In multilingual applications, characters and their codes are a major issue. Even a web
site with two or more languages or a bilingual dictionary can be regarded as multilingual
applications, and they create the problem of representing the characters of both or all
languages. For example, people using French and people using Russian on computers
probably work with their own tools, settings, and conventions, but if you need to create
What’s in a Character? | 7
32. a document that is bilingual in French and Russian, you need to make sure you can
work with both Latin letters with diacritic marks and Cyrillic letters. In effect, you
would need to use Unicode, one way or another.
If you are a computer professional, you need to be prepared to handle data-processing
problems that may involve characters of any kind. Someday someone will ask you to
work with a system for processing data in a strange language or with strange symbols
in it, perhaps even in a writing system where text runs right to left. It will be very difficult
if you have no background in working with such issues. Most people need quite some
time to digest character problems and techniques. You may find that, with something
you thought you knew for years, you have completely misunderstood some basics.
Even if you process only “normal” text, character code standards and specifications are
more important than they used to be. Modularity of software requires that you isolate
character-level processing from other levels. You should not test for a character varia-
ble’s value being equal to 32 to test whether it is a space character. Often, even a more
sensible test, against the character constant ' ', is suboptimal, and using a built-in func-
tion like isspace is better, since it takes care of other space-like characters as well. Tools
developed for such operations are increasingly based on general specification in char-
acter standards, especially the Unicode standard. They are supposed to define, in a
systematic and all-compassing way, the fundamental properties of characters, like be-
ing space-like, or being a letter, or allowing a line break before or after a character. To
use such definitions and software modules that implement them, you don’t need to
know every detail, but you need to know the principles and the ways to get at the details
when needed.
In addition, if you design or develop programs, databases, or systems, you will find that
it is extremely difficult to adapt them to processing different character sets, if they were
not designed to work that way. If the software is full of code that relies on using 1 byte
(octet, 8-bit entity) for one character, it may need an almost complete rewrite if it needs
to be modified to process Chinese text as well.
Characters as Units of Text
A character is a basic (or “atomic”) unit of written text. A piece of text is a sequence
of characters, also called a string. This does not necessarily mean that text is always
displayed so that its characters appear linearly one after another, although this is what
happens for English text, if we ignore the issue of division into lines. In other writing
systems, consecutive characters may be combined into one glyph in complex ways.
However, the text is still logically a sequence of characters.
Characters as abstractions
To store, process, and transfer data in digital form, we need an abstract concept of a
character. It would not be feasible to store the specific appearance of each written
character. Instead, we store information that tells which character it is, independent of
8 | Chapter 1: Characters as Data
34. Variation of appearance or different characters?
Problems arise when the concept of an abstract character has to be applied to concrete
situations. We know what the letter “A” is, but is it the same as the lowercase letter
“a”? That is, is the difference between them just variation in appearance, the same way
as the letter “A” in the Times font differs from the letter “A” in the Arial font? In fact,
the lowercase letters are a medieval invention, created by people who wrote text by
hand and needed forms that are more convenient for that.
Wecouldhavedefined“A”and“a”asjustvisualvariantsofthesameabstractcharacter,
but we didn’t. Quite early in the history of computers, this decision was made. It has
far-reachingimplications.Ifyouwishtoprocessinputdatasothatupper-andlowercase
letters are equivalent, to make things easier to people who type the data, you need to
do something special to take care of that.
To take things a bit further, consider the Latin letter “A” and its relationship to the
corresponding Cyrillic letter and the corresponding Greek letter, capital alpha. All three
letters look the same in most fonts, and they share a common origin. Yet they belong
to different alphabets: the Latin alphabet A, B, C, D…, which we use in English and
many other languages, the Cyrillic alphabet А, Б, В, Г…, which is used in Russian and
many Eastern European languages, and the Greek alphabet Α, Β, Γ, Δ… (alpha, beta,
gamma, delta…).
It would have been possible to identify the Latin “A” and its Cyrillic and Greek coun-
terparts. However, it was decided to keep them separate. Generally, Unicode (and
character standards in general) do not unify characters across writing system bounda-
ries. We might take this just as a fact of life and live with it. But we might also look at
its reasonableness. Consider the operation of converting text from upper- to lowercase.
The Latin letter “A” should become “a,” whereas the Greek letter alpha “Α” should
become α. It would be impossible to do this automatically if it were impossible to tell,
from the internal digital representation, whether the original data contains the Latin
“A” or the Greek “Α.”
Writing systems were invented by people, and characters are creations of mankind, not
nature. Thus, the identity of abstract characters is in a sense just a decision made by
some people. However, it is usually an informed decision.
Variation in shape turned into a character difference
In many cases, stylistic variation in drawing or printing a character has been “frozen”
so that a variant obtains a specific shape and meaning. The ancient Romans used the
letter “V” both as a consonant and as vowel. Later, it appeared in different variants,
such as a rounded one, like our “U.” People started using the original version and
different curved variants in different contexts. As such usage became systematic, con-
sistent, and common, the letter “U” was born.
10 | Chapter 1: Characters as Data
35. Therefore, we now have the independent characters “V” and “U.” They are, in turn,
written with stylistic variation, though now the general idea is that the variation should
not obscure the difference between these two characters. Yet, you might still see “V”
used for “U” for stylistic reasons, especially to imitate ancient inscriptions (SENATVS
POPVLVSQVE ROMANVS).
The letters “U” and “V” have later given birth to new characters that have originally
been formed as their typographic variants, as well as the letter “W,” originally a digraph
(VV). Special forms of this letter have been recognized as separate characters, such as
the modifier letter small w, ʷ. The story goes on. In different areas that need new sym-
bols, characters are created as variants or modifications of old characters. This seems
to suit the human mind better than the invention of new character shapes from scratch.
Characters and “abstract characters”
The Unicode standard defines different meanings for the term character. The first one
is: “The smallest component of written language that has semantic value; refers to the
abstract meaning and/or shape, rather than a specific shape (see also glyph), though in
code tables some form of visual representation is essential for the reader’s understand-
ing.” The second meaning is that “character” is a synonym for “abstract character,”
which is defined as “a unit of information used for the organization, control, or repre-
sentation of textual data.”
Thus, the difference seems to be that an abstract character may have a control purpose
only. Control purposes include line breaks, for example. In more common terminology,
“character” in Unicode often means a printable (graphic) character, whereas “abstract
character” means what is commonly called just “character,” which includes printable
and control characters.
On the other hand, the Unicode standard also uses the expression “abstract character”
to refer to a symbol that may be perceived by users as a character (“user character”),
although it cannot be represented as a single Unicode character (also known as encoded
character or coded character). In particular, a symbol with special marks (diacritic
marks) on it, such as ó, cannot always be represented as one character in Unicode but
may be a sequence of two or more characters.
The expression “semantic value” is somewhat misleading in this context. A character
such as a letter can hardly be described as having a meaning (semantic value) in itself.
It would be better to say that a character has a recognized identity and it may be some-
times used as meaningful in itself (as a symbol or as a one-letter word) but more often
as a component of a string that has a meaning. Moreover, the “smallest component”
part is somewhat vague. A character such as ú (letter u with an acute accent), which
belongs to Unicode, can often be regarded as consisting of smaller components: a letter
and a diacritic (acute accent). In fact, in Unicode, the character ú may be regarded
either as a character on its own or as a combination: as two successive characters, letter
“u” and a combining acute accent.
What’s in a Character? | 11
36. The intuitive concept of character varies by language and cultural background. If you
know the letter ä mainly from J. R. R. Tolkien’s books, you might regard it just as letter
“a” with a special mark that indicates that it is to be pronounced separately. You might
even regard the two dots just as optional decoration, as in “naïve” if spelled in the
French way. If your native language were Finnish, you would certainly treat ä as a
completely separate character, and you would have learned at school that it has its own
position in alphabetic order (a, b, c,…x, y, z, å, ä, ö). Similarly, in Swedish, the words
“här” (“here”), “har” (“has”), and “hår” (“hair”) must be kept clearly separate. To a
German, ä is different from “a,” but it is treated as primarily equivalent to “a” in al-
phabetic order and is in a sense a variant of “a” (“a Umlaut”).
Unicode, aiming at universality, generally recognizes written forms as separate char-
acters, if at least one language or commonly used notation system makes a difference.
Thus, “a” and ä are treated as distinct. If you wish to handle them as equivalent, you
need to program code that treats them that way.
Characters and other units of text
Although a character is a natural “atom” of text in data processing, it does not always
correspond to people’s intuitive idea of the basic constituents of text. Looking at text
in English, we might occasionally ask ourselves whether the ligature fi is two characters
or one. In other writing systems, similar questions arise more often. Unicode takes a
liberal approach to identifying a complex character in many cases. You can represent
fi as one character or (more often) as two characters, “f” and “i.” As mentioned above,
similar principles apply to letters with diacritic marks.
Peoplewhospeaklanguageswithmanydiacriticmarksorligaturesmayregardasymbol
like fi or ú as a single character, even though they are often coded as sequences of
characters. In some cases, it would not even be possible to code the symbol as a single
characterinUnicode,sinceUnicodedoesnotcontainallthecombinationsandligatures
that can be formed.
Moreover, although characters might be written separately, as in “ch,” their combina-
tion might be understood as a single entity by some people. In English, “ch” denotes a
particular sound and has thus some identity of its own. Some other languages treat the
combination as an inseparable unit even in alphabetic order: in a dictionary, words
would appear in an order like car, czar, char. Such treatment has become less common,
though, since it is somewhat more difficult to implement in automated processing.
Unicode treats “ch” as two characters but recognizes that it might constitute a unit in
ordering.
Partly for such reasons, the ordering of characters is rather complex. Unicode does not
prescribe a single ordering of characters and strings. Rather, it defines a basic (default)
ordering that can be used as basis for defining language-dependent and even applica-
tion-specific orderings.
12 | Chapter 1: Characters as Data
37. Characters Versus Images
Characters are normally represented in graphic form, as something that can be called
an image. However, there is a fundamental difference between an image and a charac-
ter. An image can be a particular rendering of a character, much like a spoken word is
a particular presentation of an element of a language. Moreover, most images are not
renderings of characters at all.
Character code standards mostly identify a symbol as a character only if it is actually
used in texts—e.g., in books, magazines, newspapers, and electronic documents. Char-
acters that are normally used only in product labels and other specialized contexts are
often borderline cases. However, they are often identified as characters if they are used
in conjunction with symbols that are undeniably characters.
A typical example is the estimated symbol ℮, a stylized variant of the letter “e.” It is not
used in normal texts, but only in European packaging to claim conformance to certain
standards in specifying a quantity. However, it is identified as a character, partly be-
cause it is used in packages in relation to text characters—e.g., in “℮ 200 g” (indicating
that the mass of the product is 200 grams, within tolerances defined in specific regu-
lation).
On the other hand, logos and identifying symbols are not treated as characters, even
though they might be accompanied by texts. By its nature, a logo consists of a name or
abbreviation in a particular graphic style. Hence, it would be unnatural to encode it as
a character or sequence of characters, although we might use a string of characters as
a replacement for a logo (e.g., when a document containing a logo needs to be converted
to plain text and the logo conveys essential information).
Similarly, most of the various political, ideological, or religious symbols are treated as
graphic symbols that are not characters. They are not normally used in texts. Their
shape may vary, but not as part of font variation. However, for various reasons, some
graphic symbols have been defined as characters in some character codes, contrary to
these principles. Unicode therefore contains them as characters, so that existing texts
using such characters can be encoded.
Generally, a graphic symbol is encoded as a character in Unicode, if
there is need for exchanging it in digital form in plain text. Decisions on
this are sometimes difficult and may be affected by tradition.
The distinction between a character and an image is often a practical decision to be
made by the author or editor of a document. In many cases, you have a choice between
a character and an image. For example, suppose that you are designing a user interface
for a document, program, or web page and you need graphic symbols for “Next” and
“Previous.” It may often be best to use words, but let us assume that you want to use
What’s in a Character? | 13
38. arrows pointing to the left and to the right. Beware that even at this fairly abstract level,
the decision is not culturally neutral: it implies left-to-right writing direction.
InUnicode,thereisalargishblockofarrowcharacters.Amongthem,afewlike ← and→
are widely available in commonly used fonts. However, they are not very prominent
graphically, even if shown in bold, in large font, and in color. Their graphic design is
character-like, not iconic. Some other characters in the Arrows block of Unicode look
more solid, but they are not as common in fonts. For buttons or links, specially designed
images may thus work better. On the other hand, in running texts, the arrow characters
often work well. If you wish to make references to other entries in an encyclopedia by
using arrows, then “→foobar” works better than a word preceded by a distinctive
graphic.
Generally, when deciding between the use of characters and the use of an image for
presenting a graphic symbol, the following items should be considered:
• Are there some Unicode characters that could be used, and are they suitable both
by their defined semantics and by their typical graphic appearance?
• Is it possible that the document will be rendered so that images are not displayed?
If yes, is it possible to specify a textual alternative to the image (such as the alt
attribute in HTML markup)?
• How safely would the character work, given all the possible problems with encod-
ings, fonts, etc.?
• Is it acceptable, and perhaps desirable, that the symbol changes size, shape, or color
when text font size, face, or color is changed?
• Is it possible that the data will be processed as a character string—e.g., stored in a
database or used in a search string?
For example, suppose we write about music and wish to refer to F-sharp and B-flat
using the conventional musical symbols: F♯, B♭. The Unicode approach would use the
special characters: music sharp sign ♯ and music flat sign ♭. However, these characters,
although part of Unicode since Version 1.1, are poorly supported in fonts. Even though
you could find them in some fonts at your disposal, their appearance might not fit into
your typographic design. You might end up using the number sign # and the letter “b”
as replacements. In web authoring for example, you might decide that although
B♭ would be technically quite correct (using a so-called character reference to
include the flat sign), it is safer to create a small image, say flat.gif, and embed it with
markup like B<img src="flat.gif" alt="‑flat">. This means that the flat symbol re-
mains in constant size if the text size is changed, but this is usually tolerable.
Sometimes character-looking symbols are not characters. Microsoft Word by default
changes the three-character sequence “-->” into a kind of arrow symbol (à). However,
this arrow is different from any Unicode character: it is just a glyph in the Wingdings
font. It is therefore something between a character and an image; as so many compro-
mises, it combines the drawbacks of the alternatives.
14 | Chapter 1: Characters as Data
39. Processing of Characters
The previous discussion mentioned that characters can be processed and used in many
ways that are not possible (or practical), if information is represented as images, sounds,
or in another nontext format. This includes:
• Searching for occurrences of a word or other fragment of text, using either a simple
search string or a text pattern
• Performing automatic replacements, such as substituting a string for another in all
occurrences
• Indexing the data for efficiency of searching and for creating an alphabetic index
or concordance (list of occurrences of words)
• Sorting text data—e.g., for presentation in alphabetic order
• Copying text from an application or data format to another, often via a clipboard
• Modifying text as in a text editor or text-processing application, by deleting, in-
serting, and replacing characters
• Selecting parts of text by user actions, such as painting or keyboard commands
• Recognizing constructs like words, syllables, morphemes (components of a word
with a meaning), and sentences
• Computing statistics on the use of characters, words, phrases, etc.
• Spelling and grammar checks
• Automatic or computer-aided translation
• Presenting texts in audible form, via speech synthesis, which is more natural these
days than you might expect from many science fiction films
Even the display of characters on screen or paper involves processing:
• Choice of font, which can be a complex process
• Application of bolding, italics, and other features, if requested
• Selection of contextual forms for characters
• Recognition of character sequences that should or could be rendered using liga-
tures or other special methods
• Formation of characters with diacritic marks, often requiring complex algorithms
• Adjusting spacing between characters and words, perhaps for justification of lines
• Breaking text into lines, perhaps using hyphenation
In particular, suppose that some document exists on paper only, or as a scanned image
only. The above lists of possibilities can be consulted when estimating whether the text
should be converted into text format. The conversion may require quite a bit of work,
including the identification of special characters occurring in the documents.
What’s in a Character? | 15
41. • Assigning a unique number to it; this number will never be changed
• Assigning a unique Unicode name for it; this will never be changed either, even if
itisfoundmisleadingororiginallymistyped,anditisbesttoregarditasamnemonic
identifier rather than a name in a normal sense
• Specifying a set of properties for it in a rigorous, formalized manner; they describe,
for example, the general class (letter, digit, punctuation, etc.) of the character, its
uppercase equivalent when applicable, etc.
• Making annotations—i.e., prose descriptions that clarify the meaning, often com-
paring the character with other characters, presenting alternate names for it, and
sometimes even describing possible variation in the visual appearance
For example, the plus sign is defined in Unicode as follows:
• The representative glyph looks much like +.
• The number is 2B, often written as 002B for uniformity, in hexadecimal (base 16)
notation, which means 43 in decimal (base 10).
• The name is PLUS SIGN.
• The general category is “Sm,” which is short for “Symbol, Math.” Line breaking is
permitted after the character. There are several other formalized properties as well;
we will discuss the various properties in detail in Chapter 5.
• There are no annotations for this character.
Annotations used to emphasize differences
The plus sign is not easily confused with any other character, and it has no widely used
alternate names in English. Therefore, no annotations were deemed necessary. For the
comma character “,” character number 002C, for example, there is an annotation that
says that the character has the alternative name “decimal separator.” This does not
mean that the decimal separator should be a comma (although most languages in fact
use a comma for that). It just means that in some contexts some people call the comma
“decimal separator.” This effectively identifies a comma used as a decimal separator
with the character number 002C, as opposed to treating it as a separate though similar
character. On the other hand, the annotations related to the comma character also
contain notes that refer to “Arabic comma,” “single low-9 quotation mark,” and “ide-
ographic comma” as separate characters. This can be read as a warning against con-
fusing the comma with those visually similar characters. For example, some languages
use a single low-9 quotation mark as an opening quote in some contexts (e.g., in Ger-
man: ‚gut‘); without a warning, you might be inclined to think that it’s just a special
use for the comma.
What’s in a Character? | 17
42. The representative glyphs
The definitions of characters in Unicode are logical and do not imply any particular
presentation of a character, either internally (in digital form, as bits) or visibly on paper
or screen. However, a representative glyph is given to clarify the identity of a character.
The Unicode standard explicitly says that the representative glyph is not a prescriptive
form of the character, but it lets a “knowledgeable user” recognize the character.
The glyphs used in Unicode code charts tend to be neutral and generic rather than
typographically well-designed. They typically lack artistic ambitions, and they have
been designed so that differences with other characters have been emphasized. That is,
glyphs for characters that are often rather similar in practice, especially if we consider
variation across fonts, have usually been designed to be sufficiently different from each
other.
The number and the Unicode name as identifiers
The number assigned can be regarded as identification only, although in practice, it is
used as a basis for the digital representation. The Unicode name is an alternative, more
mnemonic identifier. As a mental exercise, consider the possibility of sending infor-
mation by telephone so that you utter the names of Unicode characters, in order to
express something complicated like a foreign word or a formula. If both participants
have access to information about Unicode characters, the communication can be com-
pletely successful even though no visible characters are sent and no digital encoding is
used.
Thus, when characters are represented in digital form, each character is internally a
number, an integer. Numbers in turn are represented as sequences of bits, but this is a
different level. When a file contains the string “Hello” (without the quotation marks),
itreallycontainsfivenumberscorrespondingtothecharacters.Inmostcharactercodes,
this is the sequence 72, 101, 108, 108, 111.
A character code can assign numbers to characters arbitrarily, but once assigned in a
specification, they should not be changed. In practice, the assignments have been made
in a partly systematic way, so that related characters often have consecutive numbers.
Many modern standards, specifications, and instructions identify characters by their
Unicode numbers to achieve unambiguity. Previously, documents on matters like
mathematical or technical notations or transliteration of texts used to specify the sym-
bols to be used just by showing them as visual forms, as ink on paper. This turned out
to be particularly problematic in the computer era, when different people interpreted
such signs differently, resulting in incompatible encoding of data.
Suppose that you specify, for example, that in some notation, the double prime char-
acter (″), with Unicode number 2033 in hexadecimal, be used (say, to denote seconds
as a subdivision of a degree when expressing angles). Actually, the Unicode number
alone would suffice, but mentioning the name makes the specification more readable.
18 | Chapter 1: Characters as Data
43. In principle, you do not even need to write the character itself, though usually it helps.
By identifying the Unicode number, you have achieved several things:
• You have unambiguously identified the character you mean. People may still decide
to use some similar character instead, if they have difficulty typing the right char-
acter. Yet, it is clear which is the right character; others are various replacements.
• You have given a number that can be used as an index to large collections of in-
formation about the character, such as varying visual shapes for it, its defined
properties, fonts containing it, definitions of meaning, and comments on scope of
usage.
• The number can be used for typing the character by anyone who knows a general
input method for Unicode characters in a particular environment. Typical word
processors have at least one mechanism that produces a specific character, if you
just specify its Unicode number.
Thus, anyone who participates in creating or clarifying notational specifications should
know the principles of Unicode and should promote the use of Unicode numbers for
characters. You should probably expect resistance, since it is not quite easy to see the
benefits.
Unicode is more explicit
Older character standards, such as ASCII and the ISO 8859 family of standards, contain
substantially less information about characters. They rely on the names of characters
and the representative glyphs—and intuitive understanding related to the traditions of
using characters. The same applies to the ISO 10646 standard, which is the official
international standard that corresponds to Unicode. This means that we have two
standards that are fully in accordance, ISO 10646 and the Unicode standard, but the
latter contains a lot of additional information. Moreover, the Unicode standard is freely
available on the World Wide Web, which is why people speak about Unicode and not
ISO 10646, except in official standards and related documents.
The collection of all Unicode (or ISO 10646) characters is sometimes called the Uni-
versal Character Set (UCS). This expression is used especially in formal contexts, when
one needs to refer to ISO 10646 and does not want to mention Unicode. In normal
prose, we usually refer just to Unicode characters.
Spelling of names and the U+nnnn convention
The Unicode names of characters are written in all uppercase in the Unicode standard,
but this is just a convention. In fact, the standard itself spells the names in all lowercase
in some contexts. Uppercasing is often used to indicate (or hint) that a character is
referred to by its Unicode name. However, in this book, we use normal (mixed) case
for the names, except in some quotations.
What’s in a Character? | 19
44. We will use the conventional style of mentioning a Unicode character by its code num-
ber in hexadecimal (base 16) and prefixed with U+—e.g., U+002B. We could use just
the number, but then you might not always know whether we use a number for such
identification or just as a number.
This notation is used with at least four hexadecimal digits, so there are often leading
zeros. All characters in the so-called Basic Multilingual Plane (BMP) can be expressed
in four digits, but some newer characters need more.
We will normally mention first the Unicode name, then the code, often with a glyph
between them. Thus, while you might see a Unicode character mentioned as U+002B
PLUS SIGN in many sources, we will mostly say: the plus sign + U+002B.
Unicode Definitions of Characters
The definition of a character in Unicode is given partly in code charts, partly in the
Unicode Database, which contains large tables of data on characters, by property, to
be discussed in Chapter 5. Here we concentrate on the information in the code charts,
which are available via http://www.Unicode.org/charts/. Each code chart begins with a
table of glyphs, followed by notes on each character. The notes vary greatly in length
and nature, but they should always be consulted when in doubt about the identity of
character. Note that the code charts have been divided into two major groups, “Scripts”
(which contains letters, ideographs, and other characters to write different human lan-
guages) and “Symbols and Punctuation.” There is some overlap, since some blocks of
characters belong to both groups.
The description of a character in a code chart consists of the following, where the first
three items are given for every character (on one line), and others may or may not be
present:
• Unicode number
• Representative glyph (in normal text size)
• Unicode name, in uppercase; this name is fixed
• Old (Unicode 1.0) name, in uppercase on a line of its own
• Other name(s), preceded by an equals sign = and written in lowercase; these names
may be changed
• Comment(s) on usage, preceded by a bullet •
• Cross reference(s) to other characters, preceded by an arrow →; these references
often warn against confusing a character with another, similar-looking character
• Information that specifies the character as a decomposable character, using a no-
tation that begins with the symbol ≡ (indicating so-called canonical equivalence)
or with the symbol ≈ (indicating weaker correspondence)
Figure 1-2 shows the description of the full stop (period) character in a code chart.
20 | Chapter 1: Characters as Data
45. Definitions of Characters Elsewhere
Characters were defined and used long before Unicode. Even in our times, characters
are often used without identifying them with a reference to any character code stand-
ards. This creates ambiguity and potential diversity when text data is represented in
computer-readable form.
For example, the standards that define the SI, the International System of Units (an
extension of the metric system), use several special characters such as µ, ×, and Ω. The
authoritative formats of the standards are printed documents, and since they do not
specify code numbers or Unicode names for the characters, we are left in some uncer-
tainty. Some characters can be identified rather unambiguously, but it is unclear what
the “raised dot” character is, for example. This character, used in notations like N·m
(for newton meter), is usually interpreted as the middle dot U+00B7, but it can be
argued that a more appropriate interpretation is the dot operator U+22C5.
Similarly, the International Phonetic Alphabet (IPA) was originally defined about a
century ago. When it later became relevant to use it on computers, the characters had
to be identified as Unicode characters. This was far from trivial, since many IPA char-
acters can be regarded as normal Latin letters, or treated as separate symbols.
Even relatively new standards on transliteration or transcription—i.e., on conversions
between writing systems—fail to identify all characters unambiguously. For example,
many standards and tables for writing Russian words in Latin letters specify that the
so-called hard sign, ъ, is to be translated using a special character, but this character is
just shown as a glyph on paper. This is subject to different interpretations including
the ASCII quotation mark ", the right double quotation mark ”, and the double prime ″
(U+2033). The Unicode standard makes, in a code chart, the following note about the
modifier letter double prime ʺ (U+02BA): “transliteration of tverdyj znak (Cyrillic hard
sign: no palatalization).” This might seem to resolve the issue in principle, but in prac-
tice,thatcharacterisnotpresentinmostfonts,andwecanalsoaskwhethertheUnicode
standard is authoritative in transliteration issues. Problems similar to this also exist for
some apostrophe-like characters in transliteration systems for Arabic, for example.
Figure 1-2. Sample description of a character in a Unicode code chart
What’s in a Character? | 21
46. What’s in a Name?
The names of characters in character standards are assigned identifiers rather than
definitions. This is particularly true for Unicode, which now has an absolute principle
of name stability. A Unicode name will not be changed even if proved wrong.
Typically, the names are selected so that they contain only letters A–Z, spaces, and
hyphens; often the uppercase variant is the reference spelling of a character name.
The same character may have different names in different definitions of character rep-
ertoires. Generally, the name is intended to suggest a generic meaning and scope of use.
However, the Unicode standard warns (mentioning full stop “.” as an example of a
character with varying usage):
A character may have a broader range of use than the most literal interpretation of its
name might indicate; the coded representation, name, and representative glyph need to
be taken in context when establishing the semantics of a character.
Although the Unicode names can be misleading—a price that we pay for their absolute
stability—most of them aren’t. The great majority of Unicode names describe the char-
acter, and the name is often the only description that the Unicode standard gives about
a character individually. Thus, the name should be taken as describing the character,
unless there is an annotation that says otherwise.
The Unicode name is in English, in a sense. In many cases, it is normal English, but
often the name contains elements from other languages, such as the name in another
language but as (somehow) adapted to English spelling.
For many purposes, it would be desirable to refer to characters by some widely under-
stood names, in different languages. There will probably be a registry of such names,
though mostly only for those characters that are widely used in each language. It will
naturally contain English names as well, partly different for U.S. English and British
English. They will of course have much similarity to the Unicode names. The naming
is expected to take place in the context of Common Locale Data Repository (CLDR),
discussed in Chapter 11.
Names of characters vary a lot, even within a language. This applies
particularly to characters that are widely used in modern notations, but
without much tradition, such as the tilde ~ or the commercial at @. Do
not assume that people know from the name alone what you mean, even
if you speak the same language.
The Unicode standard mentions some colloquial names for characters, even in lan-
guages other than English. For the @ character, it mentions that the “common, hu-
morous German slang name” is “Klammeraffe,” which means “clinging monkey.” Un-
doubtedly, in some environments, the character might be better known under that
name than under any official name. However, you need to be careful in using the al-
22 | Chapter 1: Characters as Data
47. ternate names mentioned in the standard. It is better to look for information on actual
usage in a language and a subculture. Slang, by its nature, varies by time and people.
When you need to refer to a character and cannot just show it, try to mention commonly
known synonyms for it. It is not constructive to say just “use the reverse solidus.”
Instead, you can say “use the forward slash (that is, solidus), not the backslash (reverse
solidus).” Unicode names alone are often rather useless in difficult situations for iden-
tifying characters to people who are not familiar with Unicode. The same applies even
more to Unicode numbers.
Thus, you are not supposed to use the Unicode names for all characters in all contexts.
If you are used to calling the “.” character “period,” you need not start calling it “full
stop.” You need not spell out “capital Latin letter A” every time you mention capital
(uppercase) “A.” However, the Unicode names appear in many contexts, like in char-
acter selection menus in editors, so you need to know the idea.
You may wonder why Unicode assigns two immutable identifiers for a character: a
number and a name. If each of them is unique and guaranteed to remain unchanged,
what do you need the other one for? The short answer is that numbers are the basic
identifiers but names are needed too, since they have been used in programs and data
to uniquely identify characters. Although it might not be wise to write code that oper-
ates on character names that way, it would be unwise to intentionally break all such
code now.
Originally, names of characters were meant to act as identifiers across character codes.
Different code may assign different numbers to the character ±, but they can be ex-
pected to assign the same name, “plus-minus sign,” to it, or at least use names that can
be recognized as essentially the same. However, this idea never worked well, since the
names were in practice not always the same, or even essentially the same. Moreover,
Unicode has made the original idea unnecessary, since nowadays the Unicode numbers
are widely used to refer to characters across character codes, even when Unicode is not
otherwise used for representing characters.
Should We Be Strict About the Meanings of Characters?
People tendtousecharactersonthebasisoftheirvisualappearance.Youseeacharacter
like ß in some repertoire, and you start using it for the Greek letter beta, if you need it.
You see the character ø and you take it as the diameter sign, so you use it in a technical
context like “ø = 0.12 m” (saying that the diameter of something is 0.12 meters).
Unicode has strengthened such tendencies. People browse tables or menus of Unicode
characters and pick up the first one that looks right for the purpose they have in their
mind. Since Unicode has so many more characters than most old standards, there are
far more opportunities for getting lost: it is easy to find a Unicode character that more
or less looks like the one you need.
What’s in a Character? | 23
48. Then comes a purist and says that ß is a letter (sharp s) used in German, not any Greek
letter, and that ø is a vowel used in some Nordic languages, not a mathematical symbol.
Should we care?
Although you might realize the importance of using the right character, not just a right-
looking character, you may need to explain the issue to others. Moreover, we often
need to make compromises, and then it becomes essential to consider their impact.
Reasons for using the right character translate into risks that you need to prepare for,
when you cannot use the right character. So here are some basic reasons for being strict:
Some people see the difference
Although the character looks right to you, a specialist may well see a difference
between ß and β (sharp s versus small beta) or between ø and ⌀ (letter “o” with
stroke versus diameter sign). When you write a foreign word, anyone who speaks
that language as her native language is a specialist compared to you.
Font changes make differences noticeable
When the font is changed, the difference can become clearly visible. A typical ex-
ample is that the difference between degree sign ° (as in “50 °F” or “10 °C”) and
masculine ordinal indicator º (superscript letter “o,” used in Spanish) is very small
or nonexistent in many fonts, but very clear in many other fonts (e.g., ° versus º).
Your text might be rendered in different fonts even though you have carefully se-
lectedaparticularfont.Thisisparticularlytrueinwebauthoringandincooperative
authoring.
Conversions operate on characters, not appearance
Automated editing of text is based on defined properties of characters, not on their
appearance. For example, text-editing commands that operate on words will (or
at least should) treat ø as a letter, not as a technical symbol. Converting text to
uppercase would turn “ß-carotene” into “SS-CAROTENE,” since “SS” is the de-
fined uppercase version of ß.
Searching looks for characters, not appearance
A search function in a program, as well as a database search, works on characters.
When asked to find the string “β-carotene” (with beta), they will not find “ß-
carotene” (with sharp s). The same applies to pattern matching and replace func-
tions. Search routines may use some heuristics in their attempt to help users with
common errors in using wrong characters, just as they may help with misspellings
—as Google might say “did you mean pseudonym?” when you have typed “psue-
donym.” But don’t rely on such features.
Automated processing generally ignores appearance
For example, automatic speech synthesis and automatic translation, works on
characters as abstract entities, not on their visual appearance. If your text contains
“1º”, meant to mean “one degree” but incorrectly uses a masculine ordinal indi-
cator, it might be spelled out as “primero” (Spanish word for “first” in masculine
gender). Similarly, it might be translated incorrectly.
24 | Chapter 1: Characters as Data
49. Sometimes these considerations do not matter, or—more often—they need to be sup-
pressed in favor of other needs. If you only aim at producing a document to be distrib-
uted on paper and you have full control up to and including the print operation, then
the appearance is all that matters. But more often than not, documents are stored and
sent in digital form. Then you may need to take precautions against wrong processing,
perhaps document what you have done, and check things after various conversions and
other operations.
Characters differ in the definiteness of their meaning. Some well-known characters like
the hyphen - (known formally as hyphen-minus in Unicode) have a wide range of uses,
and you may need to use them liberally. Computer programs need to be prepared for
handling them accordingly. But other characters have specific semantics. The letter ø
and the technical symbol ⌀ have limited uses. They should not be confused with each
other or used for other purposes without careful consideration.
Ambiguity Among Characters
The identity of characters is defined by the definition of a character repertoire. Thus,
it is not an absolute concept but relative to the repertoire; some repertoire might contain
a character with mixed usage while another defines distinct characters for the different
uses. For instance, the ASCII repertoire has a character called “hyphen.” It is also used
as a minus sign, as well as a substitute for a dash, since ASCII contains no dashes. Thus,
that ASCII character is a generic, multipurpose character, and one can say that in ASCII,
hyphen and minus are identical. But in Unicode, there are distinct characters named
“hyphen” and “minus sign” (as well as different dash characters). For compatibility,
the old ASCII character is preserved in Unicode, too (in the old code position, with the
name hyphen-minus).
Similarly, as a matter of definition, Unicode defines characters for micro sign, n-ary
product, etc., as distinct from the Greek letters (small mu, capital pi, etc.) from which
they originate. This is a logical distinction and does not necessarily imply that different
glyphs are used. The distinction is important, for example, when textual data in digital
form is processed by a program (which “sees” the code values, through some encoding,
and not the glyphs at all). Note that Unicode does not make any distinction, for ex-
ample, between the Greek small letter pi (π), and the mathematical symbol pi denoting
the well-known constant 3.14159… (i.e., there is no separate symbol for the latter). For
the ohm sign (Ω), there is a specific character (in the Symbols Area), but it is defined
as being canonical equivalent to Greek capital letter omega (Ω)—i.e., there are two
separate characters but they are equivalent. On the other hand, Unicode makes a dis-
tinction between Greek capital letter pi (Π) and the mathematical symbol n-ary product
(∏), so that they are not equivalent.
If you think this doesn’t sound quite logical, you are not the only one to think so. The
point is that for symbols resembling Greek letters and used in various contexts, there
are three possibilities in Unicode:
What’s in a Character? | 25
50. • The symbol is regarded as identical to the Greek letter (just as its particular usage).
• The symbol is included as a separate character, but it is defined as equivalent to
the Greek letter. There are two kinds of equivalence: canonical and compatibility.
• The symbol is regarded as a completely separate character.
You need to check the Unicode references for information about each individual sym-
bol. As a rough rule of thumb about symbols looking like Greek letters, mathematical
operators (like summation) exist as independent characters whereas symbols of quan-
tities and units (like pi and ohm) are identical to Greek letters or equivalent to them.
How Do I Find My Character?
Suppose you have been requested to convert some printed or handwritten text into a
digital format. (At the end of this chapter, we have such an exercise.) For English text
with no special characters, you might be able to use a scanner. But what would you do
with characters that the scanner does not recognize reliably?
Such problems are fairly common. For example, you might need to check the spelling
of a foreign name from a printed reference book, or you might need to quote some
printed material. Even standards on various notations often fail to specify the characters
unambiguously: the authoritative format of a standard is usually a printed publication,
and all you have got there is ink on paper, glyphs.
The recognition of a character from its glyph can be quite difficult, and it may require
both factual and cultural knowledge about the subject area and the text. You also need
technical information on character standards, since you ultimately need to identify
glyphs as appearances of characters defined in the standards.
Looking for characters through lists or code charts is a rather hopeless task. The amount
of characters is huge, and many characters look very similar to each other. For example,
how can you know whether a glyph on paper is letter “a” with a caron (ǎ) or letter “a”
with a breve (ă)? Thus, you first need some information or guess on the nature of a
character. If you know or suspect that the character appears in a Romanian name, you
have a good starting point, since the character repertoire used in Romanian can be
found in a suitable reference. Similarly, if you know that a glyph like ₣ is a currency
symbol, you have almost identified it.
The following list suggests some general online resources for identifying characters:
“Where is my Character?” (http://www.Unicode.org/standard/where/)
An explanatory document by the Unicode Consortium. It explains some problems
caused by the variation of shapes of characters.
Unicode Code Charts (http://www.Unicode.org/charts/)
This is official information and covers all Unicode characters. It is organized first
by division into “Scripts” (writing systems for human languages, containing letters,
syllables, and word signs) and “Symbols and Punctuation.” These parts are further
26 | Chapter 1: Characters as Data
51. divided into large categories such as “European Alphabets.” Figure 1-3 illustrates
the appearance of the main page of the Code Charts.
Fileformat.info, section Unicode (http://www.fileformat.info/info/Unicode/)
This contains data taken from the Unicode site and organized for viewing in dif-
ferent ways. It also contains information on Unicode support in different fonts. As
you get down to information on individual characters, their properties are dis-
played in a compact format, which is great when you are ready to use it.
Database of characters at the EKI (http://www.eki.ee/letter/)
Although not as exhaustive in character repertoire as the above, this database lets
you search for characters in a few ways and shows some essential extra information
on usage: it lists languages that use a character and character encodings (charsets)
that contain it. Although these lists are not complete, they are often helpful. For
example, they tell that letter “a” with a caron (ǎ, U+01CE) is used in Yoruba and
in Romanization of Bulgarian and Chinese, whereas the letter “a” with a breve
(ă, U+0103) is used in Romanian and Vietnamese and Romanization of Khmer, as
shown in Figure 1-4. However, the information is not always completely reliable;
in particular, the character used when writing Bulgarian as Romanized—i.e., in
Latin letters—is not “a” with a caron but “a” with a breve, according to standards.
Which Characters Does Each Language Use?
For details on the use of characters in different languages, you need to consult grammar
guides and textbooks on the languages themselves. However, there is an extensive
compilation of basic information in The World’s Writing Systems by Peter T. Daniels
and William Bright (Oxford University Press). There is brief description of character
usage in a few languages in The Chicago Manual of Style, 15th Edition (The University
of Chicago Press). Online, you can find “The Alphabets of Europe,” by Michael Ever-
Figure 1-3. Part of the interface to online Unicode code charts
What’s in a Character? | 27
52. son, at http://www.evertype.com/alphabets/. It is extensive and based on detailed re-
search, although it partly applies different criteria to different languages: for some lan-
guages, it includes only the basic modern alphabet; for others, it lists historical char-
acters and other characters that are not used in normal writing. The CLDR database,
discussed in Chapter 11, contains information on the use of letters in different lan-
guages.
Variation of Writing Systems
The most widely used writing systems, or scripts, can be classified as follows:
Alphabetic scripts
Denote sounds with letters, though usually not in a strict one-to-one manner. Ex-
amples: Latin, Greek, and Cyrillic scripts, each of which exists in different versions.
Consonant scripts, or abjads
Basically denote consonants, leaving vowels to be inferred; however, consonant
scripts may have letters for long vowels, and in some situations even short vowels
are written using small signs attached to consonants. Examples: Hebrew and Ara-
bic scripts.
Figure 1-4. Sample information on a character in the eki.ee database
28 | Chapter 1: Characters as Data
53. Abugida scripts
These use consonant letters that imply a particular vowel after the consonant, when
used in the base form. Alternatives with other vowels or without any vowel are
indicated by additional marks. Many South and Southeast Asian scripts belong to
this category—e.g., the Devanagari script used for many Indic languages.
Syllabic scripts
Use basically one character for each syllable. Examples: the Hiragana and Katakana
scripts, used for Japanese.
Ideographic scripts
Use basically one character for one (short) word. The most widely known ideo-
graphic script is Han, often known as Chinese script, though it is also used (in part)
for other languages as well, especially Japanese and Korean, and therefore often
called “CJK.”
Consonantal writing may sound impossible, because it introduces so much ambiguity.
However, although an individual written form of a word is often ambiguous, the am-
biguities are usually resolved easily from the context by a person who understands the
language well. Moreover, languages written with a consonantal script typically have a
structure that makes this easier than for English, for example. When vowels are mainly
used to express variations of a common theme expressed by a word root, consisting of
a pattern described by a combination of consonants, the vowels can usually be inferred
from the grammatical context.
The word “script” is often used in character code contexts instead of “writing system.”
It is important to distinguish it from the use of the word “script” to denote a program-
ming concept—a certain type of a computer program, such as a Perl script.
Some scripts, such as the Latin script, are written with spaces between words, and a
space is normally a permissible line break point. Hyphenation may introduce other
break points. Other scripts may permit line breaks more freely.
The Latin script and many other scripts are written left to right, with lines proceeding
from top to bottom. These are not universal properties of human writing, and even the
Latin script is historically based on a script that was written right to left. Unicode ad-
dresses the problem of left-to-right versus right-to-left writing in two ways: by defining
inherent directionality for characters and by defining control characters for affecting
writing direction. For example, Hebrew and Arabic letters have inherent right-to-left
directionality. Special methods are needed when text in such letters contains names or
quotations that have the opposite directionality, or vice versa.
In Latin scripts, each character is normally displayed as a separate image on screen or
paper, though the spacing between characters may vary. In other scripts, the formatting
of texts for visual presentation can be essentially more difficult: the shape of a character
may depend on context; adjacent characters can be written together (using a ligature
or using cursive writing where characters join smoothly); and a character might be
displayed as an auxiliary symbol above, below, before, or behind another character.
Variation of Writing Systems | 29
54. Glyphs and Fonts
It is important to distinguish the character concept from the glyph concept. A glyph
is a presentation of a particular shape a character may have when rendered or displayed.
It has even been said that any character is an abstract idea, whereas glyphs for the
character are its different visible manifestations.
Each character we use in English normally has the same basic shape, and glyphs for it
differ in typographic design only. It is obvious that “T” in the Times font represents
the same character as “T” in the Arial font, for example. However, the letter “a” has
two rather different shapes (compare “a” in normal Times font and “a” in Times italic).
When you write literally by hand, you may draw characters differently in different
positions of a word. For example, a word-final “s” may be quite different than a word-
initial “s.” In typewritten or typeset text, or in text displayed or printed on computers,
such distinctions are not made, even in so-called handwriting-style fonts.
In Greek writing, a word-final sigma (ς) is rather different from a normal small sigma
(σ), although they are logically the same character. The first and last letter of the word
σοφός (sophos, “wise”) are the same but are written differently. However, since this
is a special case, character codes usually solve this by encoding them as two separate
characters, and Unicode follows suit, even without defining any equivalence between
them.
In other writing systems, the variation can be much bigger, especially if the writing
systems imitate handwriting. In Arabic, letters have two or four contextual forms, which
can be quite different from each other. Figure 1-5 shows the four forms of an Arabic
letter, usually called “ba” or more exactly bāʾ, though the Unicode name is Arabic letter
beh (U+02BE). The forms are (from right to left!) for use as isolated, at the start of a
word, in the middle of a word, and at the end of a word. As you can see, for example,
the word-final form (on the left) has a part that helps in joining the character with the
previouscharacter.Eachoftheseforms,inturn,canappeardifferentlyindifferentfonts.
In the ISO-8859-6 character code (Latin/Arabic), for example, each Arabic letter has
one code position only. This leaves it to rendering engines to determine the context
(position within a word) and to use the correct contextual form. Unicode, on the other
hand, contains both such characters (effectively, taken from ISO-8859-6) and each of
the contextual forms as a separately coded character. This lets you write Arabic so that
the rendering process can be very simple, at the cost of extra work in writing. However,
even using Unicode, you are normally supposed to use the more abstract Arabic letters.
It is ultimately a matter of definition whether two graphic presentations are glyphs for
the same character or distinct characters. However, it is normally not an individual’s
Figure 1-5. The four contextual forms of the Arabic letter “ba”
30 | Chapter 1: Characters as Data
60. This ebook is for the use of anyone anywhere in the United States and most other parts of the
world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this ebook or online at
www.gutenberg.org. If you are not located in the United States, you will have to check the laws of
the country where you are located before using this eBook.
Title: American Weasels
Author: E. Raymond Hall
Release date: July 21, 2013 [eBook #43272]
Most recently updated: October 23, 2024
Language: English
Credits: Produced by Chris Curnow, Richard Tonsing, Joseph Cooper
and the Online Distributed Proofreading Team at
http://www.pgdp.net
*** START OF THE PROJECT GUTENBERG EBOOK AMERICAN WEASELS ***
61. University of Kansas Publications
Museum of Natural History
Vol. 4, pp. 1-466, plates 1-41, 31 figures in text
December 27, 1951
AMERICAN WEASELS
BY
62. E. RAYMOND HALL
University of Kansas
Lawrence
1951
University of Kansas Publications, Museum of Natural History
Editors: E. Raymond Hall, Chairman, A. Byron Leonard, Edward H. Taylor, Robert W. Wilson
Vol. 4, pp. 1-466, plates 1-41, 31 figures in text
December 27, 1951
University of Kansas
Lawrence, Kansas
PRINTED BY
FERD VOILAND, JR., STATE PRINTER
TOPEKA, KANSAS
1951
23-3758
63. Plate 1.
Coloration of head and foreparts in
ten subspecies of long-tailed weasel,
Mustela frenata. All figures are of
males, approximately × 1/2.
In regions of heavy rainfall (see figs.
2 and 3) there is an increase in
pigmentation and extent of blackish
color backward over the neck and a
decrease in extent of the white facial
markings. In regions progressively
more arid (see figs. 3 to 7) there is a
decrease in pigmentation and extent
of blackish color and an increase in
extent of the white facial markings.
As shown by rearing mammals from
humid regions in arid regions, and
vice versa, the color is not visibly
altered in one or a few generations;
the color is an hereditary character.
Beginning with the southernmost
64. subspecies (fig. 1) and continuing
northward to the northern
subspecies (fig. 10) there is a
darkening, next a lightening, and
finally a darkening closely
conforming to amounts of
precipitation in the geographic
regions concerned. A fuller
discussion of this correlation is given
on page 51.
Fig. 1. Map showing localities of capture of
specimens depicted in plate 1.
American Weasels
BY
E. RAYMOND HALL
65. CONTENTS
PAGE
Introduction 7
Paleontological History 10
Skeleton and Dentition 12
Disparity in Numbers of Males and Females 19
Materials, Acknowledgments and Methods 21
Variation 24
Variation with Age 24
Secondary Sexual Variation 26
Individual Variation 28
Seasonal Variation 30
Variation in Coloration and Molt 30
Variations of Taxonomic Worth 44
Distribution and Speciation 54
History of Classification 69
Chronological List (annotated) of Specific and Subspecific Names Applied to American
Weasels
71
Check-List of American Species and Subspecies of the Genus Mustela 81
Artificial Key to American Species of the Genus Mustela 83
Diagnosis of the Genus 83
Explanation of Systematic Treatment 84
Systematic Accounts of Species and Subspecies 87
Mustela erminea 87
Mustela rixosa 168
Mustela frenata 193
Mustela africana 406
Explanation of Cranial Measurements 417
Table of Cranial Measurements 418
Literature Cited 442
Index 461
American Weasels
By E. Raymond Hall
66. INTRODUCTION
The weasel's agility and speed take it in and out of retreats, over obstacles and across open places in
amazingly rapid fashion and are responsible for the animal's actions being described as "quick as a
flash." The common long-tailed weasel of the United States measures approximately a foot and a half in
length, of which the tail comprises a third; but the round, slender body is scarcely more than an inch
and a half in diameter. Brown above and whitish below in summer dress, the animal is sleek as well as
lithe and graceful. It is easy to understand, therefore, why the Bavarian name Schönthierlein (pretty
little creature) and the Italian name donnola (little lady) were bestowed upon it. The Spanish name is
comadreja (godmother).
In the winter, in temperate and northern regions, the coat becomes pure white except for the black tail-
tip. In this dress the correct name for the animal is ermine, a mammal whose fur is known to all and
justly esteemed, especially for its luster in artificial light, where it is scarcely excelled in enhancing the
beauty of gems and their feminine wearers.
In relation to its weight, the weasel is thought to be unsurpassed, and perhaps it is unequalled among
mammals, in the effectiveness with which it exercises its carnivorous heritage; it kills with speed and
strength a wide variety of animals including many much larger than itself; and it has been known to
attack even man himself when he stood between the weasel and its intended prey. In structure and
temperament it is so highly specialized for offense that, when opportunity affords, it sometimes kills, for
storage in its larder, far more than enough to meet its immediate needs. After speaking of this
tendency, Elliott Coues (1877:129) has said:
"A glance at the physiognomy of the weasels would suffice to betray their character. The teeth are
almost of the highest known raptorial character; the jaws are worked by enormous masses of muscles
covering all the side of the skull. The forehead is low and the nose is sharp; the eyes are small,
penetrating, cunning, and glitter with an angry green light. There is something peculiar, moreover, in
the way that this fierce face surmounts a body extraordinarily wiry, lithe, and muscular. It ends in a
remarkable long and slender neck in such a way that it may be held at right angle with the axis of the
latter. When the creature is glancing around, with the neck stretched up, and flat triangular head bent
forward, swaying from one side to the other, we catch the likeness in a moment—it is the image of a
serpent." Although Coues' colorful description more closely links the weasel with the symbol of evil than
pleases me, his description does emphasize the raptorial character of the weasel.
Even though most weasels are intractable as pets, they have a value to man, as, for instance, when he
is plagued by mice. In a field where mice and other small rodents are so abundant as to damage
cultivated crops, the weasel is the farmer's best friend. A weasel may inhabit one den until the rodents
thereabouts are almost exterminated in an area two or three hundred yards across; in this way the
weasel acts as a control, locally, as well as a check more widely, on the increase in size of populations of
kinds of rodents upon which it preys. The smaller species are mousers of remarkable efficiency and can,
if necessary, follow a mouse to the end of the mouse's burrow. The slender body allows the weasel to
pass through any burrow or hole into which it can thrust its head. This ability in an organism as highly
specialized for killing other animals as is the weasel, has earned for it a bad name in connection with
poultry yards. Authentic instances are recorded in which a weasel, gaining entrance through a knot-hole
to a coop of young chickens, killed several dozen of the fowls. In other instances, however, weasels
have lived under buildings close by a poultry yard without even molesting the birds in the slightest; in
the latter instances the weasels probably were present because there was an abundant supply of rats
and mice. At least three poultry raisers (see page 214) have encouraged weasels to live in their poultry
yards feeling that the good they do by destroying rats outweighs the damage caused by the occasional
weasel which turns to the fowls; the idea is that the individual weasel can be eliminated if he becomes
destructive.
67. Although tending to be nocturnal, weasels are almost as active by day as by night. Their young,
numbering 4 to 9, are born in a nest in a burrow and as with other members of the Order Carnivora,
are blind, and incapable of looking after themselves at the time of birth. In Mustela frenata of Montana,
breeding occurs in July and August, and the young are born in the following April and May. Wright
(1948A:342) showed that the gestation period could not have been less than 337 days in one individual
and that it averaged 279 (205-337) days in 18 instances. Findings of the same author (1942B:109)
showed that the embryos are implanted only 21 to 28 days before the young are born. In the preceding
part of the "long gestation period, the embryos lie dormant in the uterus as un-implanted blastocysts.
The young female weasel [of M. frenata] mates when 3 or 4 months old." Consequently, in the spring,
all females of this species may produce young (Wright, 1942A:348). The circumboreal species Mustela
erminea likewise has been shown to have a delayed implantation of the ova. Each of these two species,
M. frenata and M. erminea, has only one litter per year; but the weasel, Mustela nivalis, of the Old
World seems to lack the delayed implantation, in this respect resembling the ferret (subgenus Putorius)
as it does also in its ability to have more than one litter per year (see Deanesly, 1944). The manner of
reproduction in the South American species M. africana and the circumboreal species M. rixosa at this
writing is unknown.
The genus Mustela includes the true weasels, the ferrets and minks. The ferrets commonly are treated
as a subgenus, Putorius, along with the Old World polecat. The minks usually are accorded subgeneric
distinction under the name Lutreola, and the true weasels comprise the subgenus Mustela, the three
subgenera together, along with some other subgenera which are mostly monotypic, comprising the
genus Mustela. Considered in this way, the group of true weasels, subgenus Mustela, has a geographic
range roughly coextensive with that of the genus Mustela. This range includes Asia and Europe,
Northern Africa, North America and northern South America. Java has its weasel. Australia and nearly all
the oceanic islands lack weasels, and the animals are absent from roughly the southern half of Africa
and the southern half of South America. Other small mustelids, weasellike in shape and with
corresponding habits and dentition, take the place of true Mustela in the southern half of Africa and in
the corresponding part of South America.
In America the subgenus Mustela occurs from the northernmost land in Arctic America southward to
Lake Titicaca in the Andes of South America, a distance of approximately 6900 miles. Felis, I think, is
the only other genus of land mammals in the western hemisphere that has a geographic range as
extensive from north to south. Felis does not range so far north but does range farther south. The one
species, Mustela frenata, ranges from Lake Titicaca northward to about 57° N in British Columbia or for
approximately 5000 miles in a north to south direction and from within the Alpine Arctic Life-zone
through the Tropical Life-zone. In North America, weasels occur in almost every type of habitat, being
absent only in the extremely desert terrain of western Arizona and western Sonora and in adjoining
parts of California and Baja California. Even this area, along the Colorado River, may support some
weasels; evidence suggesting that it does so is given in the account of Mustela frenata neomexicana.
68. PALEONTOLOGICAL HISTORY
The paleontological record fails to show the precise ancestry of Mustela. The genus has been found in
deposits of Pleistocene age, but, so far as I can ascertain, not in deposits of earlier times. The
Pleistocene remains are not specifically distinct from Recent (living) species, and in only a few instances
(see M. f. latirostra and M. e. angustidens) are they even subspecifically distinct from the Recent weasel
living in the same area today. It is true that fossil remains from deposits of several stages of the Tertiary
beds have in the past been identified in the literature as Mustela, but most of these identifications were
made many years ago when the generic name Mustela was used in a far broader and more inclusive
sense than it is today and much of the fossil material was so fragmentary that the generic identity could
not be ascertained, at least at that time. Because the generic identity could not be ascertained, the
fossil material was tentatively assigned to the genus Mustela, the "typical" genus of the family
Mustelidae instead of to some other more specialized or less well-known genus of the family. To satisfy
my curiosity about these species of "Mustela" of a geological age earlier than the Pleistocene I have
personally studied nearly all of the original specimens from North America and have found each to be of
some genus other than Mustela. Also, such study as I have been able to make of the Old World fossils
themselves that have been referred to the genus Mustela up to 1938, and my study of the illustrations
and descriptions of the others from there lead to the same conclusion; that is to say, none that is true
Mustela is known up to now from deposits older than the Pleistocene.
When, in 1930 (pp. 146-147), I wrote about the taxonomic position of three American genera of fossils
(known only from lower jaws), each of which had been previously referred to the genus Mustela, I said
that they pertained "to that section of the weasel family (Mustelidae) which comprises the polecats, true
weasels, ferrets, minks and martens. The fossil specimens . . . are smaller than any other later Tertiary
members of the group yet described, and are more primitive than any of the above mentioned Recent
relatives. Of the three extinct genera . . . Miomustela [Lower Pliocene or Upper Miocene of the Lower
Madison Valley, Montana] is the most primitive and Martinogale [Pliocene, 18 mi. SE Goodland, Sherman
County, Kansas] is the most advanced. This view rests largely on the character of M=1 which in
Miomustela has a deeply basined, short, narrow talonid with a thick, high metaconid situated partly
posterior to the protoconid. In Martinogale the talonid is incipiently trenchant, long, broad, and it has a
lesser developed metaconid which is situated more anterior [ly]. Pliogale [Lower Pliocene, Humboldt
County, Nevada] is intermediate in this respect.
"These three forms are of special interest as possible ancestors of the subgenus Mustela, true weasels.
No members of this subgenus, nor related forms which can with any degree of certainty be regarded as
directly ancestral to them, have yet been described from Miocene or Pliocene deposits. Palaeogale of
the Old World and Bunaelurus of North America, each of Oligocene age, have been placed by Schlosser
(1888, p. 116) and Matthew (1902, p. 137) as members of the primitive group of mustelids ancestral to
Mustela. This course seems logical; and with no truly intermediate links between these forms of the
Oligocene on the one hand, and Mustela which first appears in the Pleistocene, on the other, more
definite statements about ancestral positions of the small Oligocene forms can hardly be made. The
deciding considerations for authors who placed Palaeogale and Bunaelurus as ancestral to Mustela were
the absence of a metaconid on M1 and the trenchant talonid of that tooth. These characters are found
also in Mustela. On the other hand certain structures in the basicranial region of Palaeogale and more
especially of Bunaelurus indicate that these genera possibly are not close to the ancestral form of
Mustela . . . Martinogale may stand near the ancestral form of Mustela and . . . Pliogale may be
ancestral to Martinogale. Pliogale, in turn, may have had an ancestor similar to Miomustela. If this
should prove to be the case, Palaeogale and Bunaelurus might be regarded as an independent branch
which displays merely a parallelism to Mustela in the loss of the metaconid on M1 and the development
of a trenchant talonid on that tooth. The writer would make it clear that he does not hold such to be the
case. The ancestral relation of Martinogale to Mustela is presented merely to show the possibility, and
69. not the special probability, of such an origin for Mustela. Knowledge of the tympanic bullae and other
structures of the basicranial region would go far toward answering the question and until these
structures are known [in mustelids of the Later Tertiary,] some uncertainty will remain."
At the present writing I can add to the above statement only a few facts. The discovery of better
material of Bunaelurus than was available to previous workers led Simpson (1946), correctly I think, to
synonymize Bunaelurus with Palaeogale. Simpson figures the cranial foramina in Palaeogale. The
differences, between Palaeogale and Mustela, in cranial foramina, possibly are only the result of the
elongation of the tympanic bullae. The bullae of the subgenus Mustela are seen to be much elongated
posteriorly if comparison is made with the bullae of earlier mustelids. Consequently, it might be
concluded that there is nothing in the arrangement of the cranial foramina which would preclude the
derivation of Mustela from Palaeogale. However, the anterior situation of the carotid foramen—well
forward along the medial margin of the tympanic bulla—is a character typical of other mustelids and the
posterior location of this foramen in Palaeogale might indicate that it was not ancestral to Mustela.
70. SKELETON AND DENTITION
The outstanding features of a weasel's skeleton are its length and slenderness. Whereas the length of
the vertebral column measured from the atlas (the first cervical vertebra) to the last sacral vertebra is
175 per cent of the length of the hind leg (as measured from the head of the femur to the tip of the
longest claw), the corresponding percentage is only 116 in the raccoon. Stated in another way, the
vertebral column and the hind leg are of approximately equal length in a raccoon, but in a weasel the
vertebral column is one and three-fourths times as long as the hind leg.
VERTEBRAE
The vertebral column consists of 7 cervicals, and ordinarily 14 thoracics, 6 lumbars, 3 sacrals and,
depending on the species, 11 to 23 caudals. For the three species of which skeletons were examined,
variations from the normal number of vertebrae are noted in the following table:
Table 1
Data on vertebrae in three species of the subgenus Mustela
(Numerals in parentheses indicate number of specimens)
Mustela erminea Mustela rixosa Mustela frenata
Number of cervical vertebrae 7(75)
7(12)
7(65)
Number of thoracic vertebrae
14(71)
14(12)
14(54)
15(4)
15(13)
The dorsal vertebraconstituting the anticlinal
11th(18)
11th(12)
11th(40)
12th(7)
12th(27)
Number of lumbar vertebrae
5(2)
5(11)
6(73)
6(12)
6(54)
Number of sacral vertebrae
2(9)
2(3)
3(65)
3(10)
3(67)
4(1)
4(2)
Number of pseudosacral vertebrae
0(73)
0(12)
0(57)
1(2)
1(6)
11(1)
14(3)
15(2)
15(7)
16(3)
16(1)
17(9)
Number of caudal vertebrae
18(28)
19(11)
19(6)
20(14)
21(14)
22(7)
23(1)
71. Variation according to the species is evident in the number of caudal vertebrae, but in the other
categories of vertebrae no consistent difference in number according to species was found in the
material examined. Apparently there is also some geographic variation in the number of caudal
vertebrae within a species. For example, the one skeleton seen of Mustela rixosa eskimo (no. 219036,
U. S. Nat. Mus., from St. Michaels, Alaska) has only 11 caudal vertebrae, whereas in the 11 Mustela
rixosa rixosa from Roseau County, Minnesota, the usual number is 15 with extremes of 14 and 16.
Similarly specimens of Mustela frenata from Idaho and California almost always have 1 or 2 more caudal
vertebrae than do individuals of the shorter-tailed subspecies of the same species from eastern Kansas.
Of the vertebrae, only the cervicals, of which there are 7, were found to be constant in number. In M.
erminea, two of the seven individuals in which the anticlinal vertebra was the 12th (instead of the 11th)
had 15 instead of the customary 14 thoracic vertebrae. In M. frenata, seven of the twenty-seven
individuals in which the anticlinal vertebra was the 12th (instead of the 11th) had 15 instead of 14
thoracic vertebrae. The one M. erminea with a pseudosacral vertebra had only two instead of the
customary 3 sacral vertebrae but the same individual had 15 thoracic vertebrae. Of the six M. frenata
with a pseudosacral vertebra, two animals had only two instead of three sacral vertebrae. Conceivably,
therefore, the pseudosacral vertebra in each of the three instances mentioned may represent merely an
unfused sacral vertebra, instead of a true pseudosacral as occurs in four individuals of M. frenata.
TEETH
In American weasels, for example in Mustela frenata, the permanent dentition normally is
I 3 C 1 P 3 M 1
-, -, -, -, -, -, -, -
i 3 c 1 p 3 m 2
or 34 teeth in all. In most respects the dentition is typical for post-Tertiary mustelids but in several parts
is highly specialized for a diet of flesh, the degree of this specialization being second only to that of the
cats, family Felidae. The outstanding specialization is in the first lower molar, in which, as in the cats,
the internal cusp (metaconid) is completely suppressed and the heel (talonid) forms an elevated blade
for cutting food rather than a basin for crushing it. In one sense the tooth is simplified since it owes its
distinctive form to a reduction in number of parts; nevertheless, the distinctive form of the lower molar
clearly is correlated with a diet of flesh, and the tooth is correctly to be thought of as the lower blade of
a pair of shears; the upper blade is the fourth upper premolar. The reduction in size of the second (last)
lower molar and small size of the inner lobe of the one remaining upper molar probably are additional
modifications for a diet of flesh.
The absence of the last two upper molars and last molar in the lower jaw would be expected in any
mammal as highly specialized for a diet of flesh as is the weasel, but these teeth are absent also in
other Quaternary members of the family Mustelidae, many of which are substantially less specialized for
a diet of flesh than is the weasel. Therefore, in the weasel, it is reasonable to regard the absence of
these teeth more as a heritage than as an indication of a special adaptation. The absence of a first
premolar above and below, as in the weasel, is to be expected in any carnivore that has the first lower
molar and fourth upper premolar highly specialized for shearing, but the loss of these premolars and the
small size of the second premolars may be as much the result of a slight shortening of the face as it is a
result of a lengthening of the third and especially the fourth premolars. The lengthening of these more
posteriorly-situated teeth would appear to be an adaptation to a diet of flesh. The cause of the
lengthening of the mentioned teeth and the reason for the absence of the first premolars probably will
be unknown until the fossil record is more complete.
The teeth of American species vary little except in size. The absence of P2 in Mustela africana is the
only difference of a qualitative (presence or absence) nature that was detected. Also, the Central
American subspecies of Mustela frenata exhibit a tendency to early loss of P2 and thus foreshadow the
condition typical of M. africana.
72. As a whole the dentition of the weasel exhibits a high degree of specialization for a diet of flesh and this
specialization is fully as evident in the deciduous dentition as in the permanent dentition.
The deciduous, or milk, dentition, of Mustela frenata, as known from immature specimens of Mustela
frenata noveboracensis and Mustela frenata frenata available for this study, is comprised of canines, one
on each side above and below, and 3 cheek teeth on each side above and below. See figures 2-9. The
upper cheek teeth from anterior to posterior are: a minute peglike tooth in general similar to the first
premolar of the permanent dentition; a shearing tooth in general similar to P4 of the permanent
dentition; and an anteroposteriorly compressed tooth in general similar to M1 of the permanent
dentition. In the lower jaw, behind the canine, there is first a minute peglike tooth, second a two-rooted
tooth similar in general outline to a permanent third premolar, and finally a shearing tooth
corresponding in function to m1 of the permanent dentition.
No postnatal specimens which show deciduous incisors have been examined.
Selected, outstanding differences between the permanent teeth and the deciduous teeth are as follows:
In the deciduous teeth the canine above has on the posterior face a well-defined ridge extending from
the tip to the cingulum. This ridge is absent or at most faintly indicated in the permanent tooth. The
lower deciduous canine, in cross section is seen to have a marked indentation on the anteromedial
border in the region of the cingulum; this indentation is lacking in the permanent tooth. The anterior
one of the deciduous cheek teeth, both above and below, is single rooted and its crown-surface is only
about one-fifteenth as much as that of the anterior premolar of the permanent dentition. The second
deciduous cheek tooth below has two roots, usually fused, and differs from p4 of the permanent
dentition in having the tip of the principal cusp more recurved, in having the anterior basal cusp better
developed and the posterior heel less well developed.
The second deciduous cheek tooth above corresponds in function and general plan of construction to P4
of the permanent dentition but differs from that tooth in the more pronounced protostyle, longer
tritocone, more posteriorly located deuterocone and as noted by Leche (1915:322) separation of the
protocone and tritocone by a notch. The third upper deciduous tooth has a single cusp internally and
two cusps laterally. Thus it reverses the relation of parts seen in M1 where the internal moiety is larger
than the lateral or buccal moiety. The third deciduous tooth below differs from m1 in very much shorter
talonid and separation of the paraconid from the protoconid by a deeper notch.
All the features in which the last two deciduous teeth, both above and below, are described as differing
from their functional counterparts in the permanent dentition, are features found in the permanent
teeth of primitive fossil mustelids and certain fossil and Recent viverrids. Even so, taking into account
Leche's (1915) work, which shows that the milk teeth of some carnivores have structures lacking in the
corresponding permanent teeth of the same individual animal and also in the teeth of genera that seem
to be ancestral, a person suspects that some of the structural features mentioned above are not
inheritances of ancestral conditions but rather specializations of the milk dentition.
Figs. 2-9. Views of permanent and deciduous teeth of
Mustela frenata nigriauris. Incisors not shown. In each
73. instance teeth are of the left side.
Permanent dentition × 3. No. 32421, Mus. Vert. Zoöl., ♂,
adult; Berkeley, Alameda County, California; obtained
October 4, 1921, by D. D. McLean.
Deciduous dentition × 5. No. 132158, U. S. Nat. Mus., ♂,
juvenile; Stanford University, Santa Clara County,
California; obtained May 7, 1898, by W. K. Fisher.
Figs. 2-3. Lateral views of upper teeth, of adult and
juvenile respectively.
Figs. 4-5. Occlusolingual views of upper teeth of adult and
juvenile respectively.
Figs. 6-7. Lateral views of lower teeth of adult and juvenile
respectively.
Figs. 8-9. Occlusolingual views of lower teeth of adult and
juvenile respectively.
In other deciduous teeth there is clearer evidence of more specialization for a diet of flesh in the
deciduous teeth than in the permanent teeth. For example, the upper carnassial of the milk dentition is
even more highly sectorial than is the permanent tooth and strikingly like that of some of the cats. The
lower tooth that is effective in the shearing action bears no more trace of the metaconid than does the
permanent first lower molar. These features of the deciduous dentition suggest that it is more
specialized for a diet of flesh than is the permanent dentition. If this be the fact, it may seem especially
remarkable because the commonly employed term "milk teeth" suggests that the animal makes but little
or no use of these teeth in the short time that they are in place. Accordingly, the student may credit the
form of these teeth more to some indirect effects of inheritance than to natural selection acting directly
upon the teeth. But, after all, natural selection probably is responsible for the form of these teeth as is
indicated by the observations of Hamilton (1933:318-325). He found that these milk teeth are used for
eating solid food as soon as the principal shearing teeth are in place. This is three weeks after birth and
before all of the deciduous teeth have broken through the gums. These shearing teeth are used for
almost two months before being replaced by the permanent teeth and it is, therefore, evident that
natural selection could operate to fully as great a degree in determining the form of the deciduous teeth
as it may with the permanent teeth.
Hamilton (1933:325-326) found that the permanent dentition was complete at 75 days after birth in
captive specimens of Mustela frenata noveboracensis. In the same subspecies, he noted 28 days after
birth that the canines and carnassial teeth [second deciduous cheek tooth above and third below] had
erupted through the gums. Animals 45 days old, Hamilton found, were losing the milk dentition, and
had the gums broken through by several of the permanent cheek teeth.
Study of the cleaned skulls available of juveniles indicates that the deciduous teeth which persist
longest are, on each side of the mouth, the second cheek tooth above and the third cheek tooth below.
These teeth persist until after the permanent P4 and m1 have come into use. These permanent teeth
are situated immediately behind their functional counterparts of the milk dentition. P3 and p4 are the
teeth of the permanent dentition which ultimately push out the last milk teeth to be lost. Accordingly, in
the permanent dentition, P4 and M1 appear before P3 does, and m1 and m2 make their appearance
before p4.
74. DISPARITY IN NUMBERS OF MALES AND FEMALES (IN
ZOOLOGICAL COLLECTIONS)
The question has frequently been asked why twice as many male as female weasels are captured. This
is the proportion in research collections, as may be seen from table no. 2, and I am convinced that the
specimens in these collections are saved in approximately the same proportion as that in which they are
caught. Although it might be assumed, upon first consideration, that there are twice as many males as
females in nature, selective factors enter into the catch. For example, because a male weasel is
approximately twice as heavy as a female, it may be necessary for him, in a given length of time, to
travel twice as far as the female to obtain the required amount of food with the result that a given
number of traps or snares will catch twice as many males as females. Indeed, Glover (1943B:8) shows
that, on the average, in Mustela frenata noveboracensis in Pennsylvania, the male actually does travel
slightly more than twice as far as the female (704 feet versus 346 feet). From table no. 2, it may be
seen that in most winter months the ratio is 3 males to one female. This ratio is reasonable enough, in
view of what has been said, if it is considered also that the lighter weight of the female permits her
safely to step on the pans of traps that would be sprung by heavier males.
If in the breeding season, which is April through August in M. frenata, the female is passive and if the
male is restlessly searching for her, he may thus increase still more his chances of being caught in traps
set for weasels.
My own studies of live weasels in nature indicate that in the season when females are attending young
which are half grown, or larger, the adult male weasels live singly in dens of their own, separate and
apart from the females and their young (Hamilton, 1933:328, records adult males living with the female
and her young, but possibly this was when the young were less than half grown). Perhaps these males
at that time travel no farther than is necessary to obtain food for themselves. Females, at this time,
forage not only to meet their own needs, but for food to supply their young as well. At this time, in May
and June, as may be seen from table no. 2, almost as many adult females as adult males are caught.
The reason why only relatively more females than in other months, instead of actually more females
than males, are caught at this time probably is that the adult males also are extraordinarily active at this
time because they are in breeding condition. Perhaps the explanation in part is to be found in the lesser
weight of the female (approximately half of the male's weight) which, as indicated above, permits her to
step on the pan of a steel trap without springing it whereas the heavier male does spring the trap and
as a consequence is caught. Hamilton (1933:299-300), who mentions this selective factor, found an
equal number of males and females in the three newly born litters that came under his observation.
Table 2
Specimens of Mustela frenata (north of the range of M. f. frenata) arranged by sex and under each sex
by age
Male Female
adult
♂
♂
ad.,
% of
total
adults
subadult
♂
young
♂
juvenal
♂
total
number
of ♂
♂,
%
of
total
adult
♀
♀
ad.,
% of
total
adults
subadult
♀
young
♀
juvenal
♀
total
number
of ♀
♀
%
o
to
May 29 55 4 14 7 54 59 24 45 1 9 3 37
June 42 53 14 40 8 97 59 38 47 4 25 2 69
July 59 70 18 55 2 130 59 25 30 5 58 2 90
August 40 77 23 55 .. 113 74 12 23 2 25 .. 39
75. September 15 79 25 12 1 51 75 4 21 4 9 .. 17
October 11 58 46 7 .. 43 66 8 42 13 1 .. 22
November 41 70 48 1 .. 88 73 18 30 12 2 1 33
December 59 69 43 1 .. 108 73 26 31 15 ... .. 41
January 80 69 32 2 1 126 72 36 31 14 ... .. 50
February 45 66 19 5 .. 82 73 23 34 4 3 .. 30
March 38 72 2 ... .. 57 70 15 28 8 1 .. 24
April 30 67 2 4 3 39 67 15 33 .. 2 2 19
Totals 489 67 281 196 22 988 68 244 33 82 135 10 471
I suppose that in nature there are approximately equal numbers of male and female weasels and
further suppose that the selective factors which cause more males than females to be caught are the
greater distances traveled by the males and their greater weight.
76. MATERIALS, ACKNOWLEDGMENTS AND METHODS
At a late stage in the preparation of this manuscript a total of 5,457 specimens had been examined. For
the most part these were conventional study-specimens; that is to say, they were stuffed skins with the
skulls separate and each was accompanied by the customary data as to locality of capture, date of
capture, name of collector, external measurements and sex recorded on the labels by the collectors.
Skulls unaccompanied by skins, nevertheless, comprised a large share of the total and a small
proportion was made up of skins unaccompanied by skulls, mounted specimens, skeletons, and entire
animals preserved in liquid.
It was the recognition of this need for specimens from extensive areas from which no specimens
previously had been collected that influenced me, approximately a year after the study was begun,
to allot for it a long span of time. The procedure adopted, in general, was to study the weasels of
one species from a given geographic area in so far as the material warranted, then lay this aside
until additional critical material could be obtained, and finally, some months or a year later,
complete the account. In this fashion the manuscript of the American weasels received my
attention in each of the past twenty-five years (September, 1926 to date of publication). This is a
confession of fact rather than a recommendation of procedure. This type of procedure unduly
delays the diffusion of knowledge and for a variety of reasons justifiably annoys other students of
the subject. Nevertheless, many gaps have been filled that otherwise would have remained open.
Although specimens to solve several problems still remain to be collected and studied, it seems that
a point of diminishing returns has now been reached, which, in fairness to all concerned, calls for
publication of the results so far obtained.
For assistance in the entire undertaking, I am more indebted to Miss Annie M. Alexander than to
any other one person; she provided the means by which specimens from critical areas were
obtained, made it possible to examine the European collections, and assisted in other ways. The
late Professor Joseph Grinnell and Mr. Charles D. Bunker, among others, gave truly valuable
encouragement and assistance.
Collections containing weasels which were examined in the study here reported upon were as
follows:
Acad. Nat. Sciences of Philadelphia
American Mus. Nat. History
Baylor University
Berlin Zoological Museum
Boston Society of Natural History
Brigham Young University
British Museum of Natural History
California Academy of Sciences
Carnegie Museum
Charleston Museum
Coe College
Collection of J. Arnold
Collection of Stanley C. Arthur
Collection of Rollin H. Baker
Collection of William Bebb
Collection of R. H. Coleman
Collection of Ian McTaggart-Cowan
Collection of Stuart Criddle
Collection of Jack C vonBloeker
Collection of Alex Walker
Collection of Edward R. Warren
Colorado Museum of Natural History
Charles R. Conner Museum
Cornell University
Donald R. Dickey Collection
Field Museum of Natural History
Florida State Museum
Fresno State Junior College
Humboldt State Teachers College
Illinois Natural History Survey
Iowa State College
Iowa Wesleyan College
Kansas State Agric. College
Leland Stanford Junior University
Leningrad Academy of Science
Los Angeles Mus. Hist. Art and Sci.
77. Collection of John Cushing
Collection of Walter W. Dalquest
Collection of William B. Davis
Collection of J. M. Edson
Collection of Ralph Ellis
Collection of John Fitzgerald, Jr.
Collection of Mr. Green
Collection of Ross Hardy
Collection of Donald V. Hemphill
Collection of L. M. Huey
Collection of R. W. Jackson
Collection of Stanley G. Jewett
Collection of E. J. Koestner
Collection of J. E. Law
Collection of A. H. Miller
Collection of Lloye H. Miller
Collection of R. D. Moore
Collection of J. A. Munro
Collection of O. J. Murie
Collection of Robert T. Orr
Collection of Arthur Peake
Collection of Kenneth Racey
Collection of William B. Richardson
Collection Rocky Mt. Spotted Fever Lab.
Collection of Victor B. Scheffer
Collection of William T. Shaw
Collection of O. P. Silliman
Collection of W. E. Snyder
Collection of Frank Stephens
Collection of T. C. Stephens
Collection of D. D. Stone
Collection of Myron H. Swenk
Collection of Joe and Dean Thiriot
Collection of John Tyler
Louisiana State University
Mt. Rainier Nat'l Park Collection
Museum of Comparative Zoölogy
Mus. Polonais d'Hist. Nat., Warsaw
Mus. Vert. Zoöl., Univ. California
Museum of Zoölogy, Univ. Michigan
National Museum of Canada
Naturhistoriska Ricksmuseum, Sweden
Neuchatel University Museum
New York State Museum
Ohio State Museum
Oklahoma Agric. and Mech. College
Ottawa University, Kansas
Paris Museum
Provincial Museum of British Columbia
Royal Ontario Museum of Zoölogy
San Diego Society of Natural History
State Hist. and Nat. Hist. Soc. Colo.
State Normal School, Cheney, Wash.
Texas Cooperative Research Collection
United States National Museum
University of Arkansas
Univ. California Mus. Palaeo.
University of Idaho
Univ. Kansas Mus. Nat. History
University of Minnesota
University of Notre Dame
University of Oklahoma
University of Oregon
University of South Dakota
University of Utah
Univ. Washington Museum of Zoölogy
University of Wisconsin
Univ. Zool. Mus., Copenhagen
The largest single collection is in the United States National Museum, where the specimens of the
National Museum proper and the United States Biological Surveys Collection, together, provide
essential materials including a large share of the holotypes. Specimens in all of the North American
collections including Canada and México have been made available, by loan, and in 1937 materials
were examined in the principal collections of northern and central Europe. After the materials in
North American collections were assembled, special effort, with considerable success, was made in
each of several winters, to obtain specimens from areas not previously represented in collections.
To the many persons who were in charge of the collections consulted, to those who at my request
sought critical specimens, and to those who assisted in various stages of assembling data and in
preparation of the manuscript, I am grateful indeed. Likewise, I am deeply appreciative of the
grants-in-aid received from the Carnegie Institution of Washington, the University of California
Chapter of Sigma Xi, the John Simon Guggenheim Memorial Foundation and the Kansas University
Endowment Association. I am mindful also of an obligation to those who appropriated funds, by
legislative action, for research use by The University of California and The University of Kansas.
For assistance with the illustrations I am indebted to the late Major Allan Brooks for Plate 1 , to
Mrs. Mary Blos for figures 25-31, to Miss Ann Murray for figures 11-13, to Mr. W. C. Matthews for all
the photographs, to Mrs. Freda L. Abernathy for figures 2-9, 18-22, 24, and for retouching all the
78. photographs except the following which were retouched by Mrs. Virginia Unruh: figs. d of plates 2,
3, 4, 9, 10, 11, 16, 17; figs. i of plates 5, 6, 7; figs. h, j, k of plate 7; figs. f and g of plates 12 and
13; and figs. c and d of plate 14. To Mrs. Unruh I am further indebted for figures 1, 16, 17 and 23
and for much terminal assistance with preparing most of the illustrations for the engraver.
The methods of study, after specimens were assembled, included first comparisons of specimens of like
age and sex from each of several localities to ascertain the constant features by which full species were
distinguishable, one from the other. For example, it was found that in every individual from Trout Lake,
Washington, of the species here designated Mustela erminea, the postglenoidal length of the skull
amounted to more than 47 per cent of the condylobasal length whereas it was less than 47 per cent in
all individuals here designated as Mustela frenata, from the same locality. Testing of specimens from
other localities by means of this and other selected characters permitted the outlining of the geographic
ranges of the full "species-groups." By comparing specimens of other nominal species and by examining
specimens from localities geographically intermediate between the nominal species, I found
intergradation and therefore arranged the nominal species as subspecies of a single species.
Intergradation here is understood to be the result of crossbreeding in nature between two kinds of
animals in the area where the geographic ranges of the two kinds meet. Presence of intergradation
between two kinds of weasels was basis for according them subspecific rank. Absence of intergradation
in nature at every place where the geographic ranges of two kinds met or overlapped, and absence of
intergradation by way of some other kind, or chain of kinds, was basis for according each of the two
kinds full specific rank. By thus applying the test of intergradation, or lack of it, I found that there were
four full species of weasels, of the subgenus Mustela, in all of the Americas.
Next, the specimens of one species were arranged in trays in a geographic sequence. The specimens
from any one locality were segregated by sex and under one sex from one place were arranged from
oldest to youngest, that is to say by age. The four series with the largest numbers of individuals of a
given age were selected. Seventeen cranial measurements and three external measurements were
recorded for each individual of each of these four series. For each measurement, the coefficient of
variation, standard deviation and probable error were computed. The four samples subjected to such
analysis were a series of adult males, one of adult females, one of subadult males and one of subadult
females. Also, studies of each sex were made to ascertain seasonal changes in pelage. After data were
obtained on ontogenetic (age) variation, secondary sexual variation, seasonal variation, and degree of
individual variation by studying specimens in the manner described above, tests were made for
subspecific (geographic) variation by comparing series of specimens of like sex, age and season, from
different localities. For each one of several geographically variable features noted, a map was prepared
for animals of each sex. When all the data thus obtained were codified, subspecific ranges were, in a
sense automatically, obtained. On the resulting map showing geographic ranges of subspecies for a
species, a type locality was accurately plotted for each name that had been applied to the species, and
names then were applied in accordance with the international rules of zoölogical nomenclature.
79. VARIATION
Variation with Age
The kind of variation which results from increasing age has been dealt with extensively for the skull (of
the Old World Mustela erminea) by Hensel (1881) and for the external features and to some extent for
the skull by Hamilton (1933) in the North American forms M. erminea cicognanii and M. frenata
noveboracensis.
The young of both erminea and frenata are hairless and blind at birth. In M. frenata noveboracensis,
the eyes open on approximately the 37th day. When 2 to 4 months old, the tail is pointed at the tip.
This is because the terminal hair of the tail, including the black tip, is short and lies flat on the tail. In
subadults and adults the hair on the terminal part of the tail is as long as that on the basal part, and the
tail appears to be of uniform diameter all the way out to the end.
In the western subspecies of M. frenata, and in its tropical subspecies, animals so young as to have
pointed tails commonly have the underparts of the body more intensely colored than do adults. The
young may have salmon-colored instead of yellowish fur on the underparts.
Otherwise, in animals that have attained approximately adult proportions—which appears to be at
approximately 6 months of age in males—there are no variations which are ascribable to increasing age
in the color-pattern or pelage that cause the systematist to confuse species or subspecies.
Of the several parts of the skull in juvenal animals, the braincase and width of the posterior part of the
palate are most nearly of the size attained in the adult, the facial part of the skull at birth is the least
developed, and the interorbital region is, in relation to its ultimate adult size, intermediate in stage of
development. The permanent teeth are acquired when the animal is approximately eleven weeks old.
Four age groups, based on characters of the dentition and skull, have been recognized. They are:
Juvenile.—One or more deciduous (milk) teeth present. Birth to three months of age.
Young.—Sutures widely open between the maxillae and nasals and between the premaxillae and
nasals. Three to seven and a half months of age.
Subadult.—Sutures between maxillae and nasals visible but indistinct. Seven and a half to ten
months of age.
Adult.—Bones of rostrum coalesced with no traces of sutures visible to the naked eye. More than
ten months old.
The skull as a whole increases in size until the animal is two-thirds of the way through the stage
designated as young. After this time the width of the rostrum, as measured across the hamular
processes of the lacrimals, increases until approximately a third of the way through adulthood. The
interorbital breadth decreases from late subadulthood to adulthood and even in adults there appears to
be a slight decrease in this part of the skull with increasing age.
The average zoölogist will readily distinguish skulls of juveniles and young from adults but usually fails
to distinguish subadults from adults. Nevertheless, subadults must be distinguished from adults if
geographic variation is to be measured accurately. The reason for this is that such differences in the
form (not size) of the skull as result from increasing age equal and often exceed the differences of a
geographic sort which serve for distinguishing subspecies that have adjoining geographic ranges. All
sutures in the skull, except those between the tympanic bulla and the braincase, and those on the
dorsal face of the rostrum, are obliterated while the animal is a subadult. Most kinds of mammals retain
sutures throughout life or until the animals are well into adulthood. Therefore, skulls of weasels offer
80. fewer features for estimating age than do those of most mammals and the skulls of weasels that are
subadults or older are more difficult to classify accurately as to age than are the skulls of most other
mammals. More reliance on shape of entire skull and less reliance on extent and shape of any individual
bone is necessary in estimating the age of a weasel. Wright (1947:344) shows that the weight of the
baculum (os penis) is a certain means of differentiating adults from males of lesser age. When
approximately eleven months old, Mustela frenata oribasus of western Montana molts from the white
winter coat into the brown summer coat. At that time spermatogenesis starts for the first time and the
weight of the baculum increases from less than 30 milligrams to more than 52 milligrams.
In the autumn and early winter, most of the specimens are subadults. Ordinarily the few adults obtained
in these seasons can easily be segregated from the subadults because ontogenetic development in the
twelve additional months of life of each of the older animals has obliterated the sutures on the rostrum,
heightened (vertically) and lengthened (anteriorly) the sagittal crest, widened the rostrum, and
produced still other changes in form that are revealed by direct comparison of specimens of the two
ages.
Secondary Sexual Variation
The secondary sexual variation, which has been detected, is in size of the animal, relative length of the
tail and shape of the skull. The female is the smaller. In the small Mustela rixosa and apparently in
Mustela africana the secondary sexual difference in size is relatively slight. In Mustela frenata and
Mustela erminea, males are approximately twice as heavy as females, the degree of difference very
definitely depending upon the subspecies. For example, in M. e. richardsonii the recorded weights are
175 and 69 grams as opposed to 81 and 54 grams in M. e. cicognanii. In general, within one species the
greatest difference in size of males and females is in those subspecies in which the animals are of large
size. The secondary sexual variation in size is much more than the individual variation in either sex. The
same is not true of secondary sexual difference in length of the tail (relative to the length of the head
and body), which in eighteen subspecies of M. erminea is from 1 to 7 per cent longer in males than in
females. In two subspecies, M. e. haidarum and M. e. olympica, the tail is a fraction of a per cent the
longer in females if we may rely upon the few specimens for which collectors' measurements are
available.
In both M. erminea and M. frenata the skull of the female is approximately 45 per cent lighter than that
of the male, or put in the opposite way, the skull of the male is 83 per cent heavier than the skull of the
female. The difference in this respect varies greatly depending on the subspecies. For example, the skull
of the male is 127 per cent heavier than that of the female in M. e. richardsonii but only 33 per cent
heavier in M. e. anguinae. In Mustela frenata, the subspecies noveboracensis shows most sexual
dimorphism in weight of skull (3.6 and 1.7 grams) and olivacea the least (5.3 and 3.8 grams). In
general, the difference in this respect is less in subspecies the individuals of which are of small size.
Therefore, as might be expected, the secondary sexual variation in weight of the skull is less in M.
rixosa, individuals of which are of small size, than in M. erminea or than in M. frenata, in general of
larger size. Nevertheless, in M. africana, in which the individuals are of large size, there appears to be
less sexual dimorphism in weight of the skull than in M. frenata or than in M. erminea, although it
should be remarked that there are too few data for M. africana to allow of forming a trustworthy
conclusion concerning the amount of secondary sexual variation in that species.
The secondary sexual variation in shape of the skull consists of a slenderness in the female. In relation
to the basilar length the spread of the zygomatic arches is more in males and, except in the one
subspecies M. f. altifrontalis, the rostrum is broader. Also the interorbital region is relatively broader in
males of most subspecies. In most subspecies of both M. frenata and M. erminea the tympanic bullae
are relatively (to the basilar length) longer in females. The maximum sexual dimorphism occurs in M.
erminea arctica and the minimum dimorphism in M. e. haidarum, M. e. anguinae and M. e. muricus.
Taking into account all of the subspecies of each of the North American species, the shape of the skull
differs most in M. erminea and least in M. frenata. In the latter species the greatest difference in shape
81. of the skull, as was true also of its weight, is in the subspecies M. f. noveboracensis. In these two
subspecies, M. f. noveboracensis and M. e. arctica, in addition to the secondary sexual variation already
mentioned in the skull, females have the braincase smoother and more rounded, the postorbital-,
mastoid-, and lacrimal-processes relatively smaller, and the ventral face of the tympanic bulla at its
anterior margin more nearly flush with the floor of the braincase.
In the weasels, subgenus Mustela, the disparity in size of the two sexes is almost or quite as much as in
any other fissiped carnivore. It is because of this large degree of difference that the skulls of the two
sexes are described separately in the following systematic accounts. The need for such treatment was
recognized by Reinhold Hensel (1881:127) more than sixty years ago when he wrote in the introduction
to his "Craniologische Studien," of Mustela, as follows: ". . . die Geschlechtsdifferenzen am Schädel
vieler Säugethiere . . . so gross sind, dass man diese wie Schädel verschiedener species behandeln
muss, während in anderen Ordnungen (Rosores, Edentaten) die Schädel solche Unterschiede
nichtzeigen." In the past, failure to appreciate the large amount of secondary sexual variation has
resulted in erroneous deductions as regards characters of certain geographic races and has been the
cause of some nomenclatural confusion, as for example, in Mustela frenata macrura, where the female
was named as a separate species (Mustela jelskii).
Individual Variation
Individual variation is here considered to be the variation in one species which can occur between
offspring of a single pair of parents, after variation ascribable to differences in age, sex, and season is
excluded. Individual variation, therefore, is a term here used in a composite sense; it includes variations
which probably represent different genetic strains within certain populations and variations induced
within one generation by environmental factors.
In skulls of weasels, the individual variation in size is more than it is in relative proportions. Hensel (op.
cit.) has stressed that weasels, like other carnivores, produced "dwarfed" individuals more than do
herbivorous mammals. I cannot vouch for the accuracy of this view, but can say that individual variation
is not greater than in some other fissiped carnivores. Impressions to the contrary probably result largely
from failure to recognize age-variation. When skulls of a large series from any one locality are arranged
first by sex, and under each sex according to probable age on the basis of extension anteriorly of the
sagittal crest and of degree of postorbital constriction, individual variation is seen to be less than a
cursory examination, even of only one sex, would suggest.
Study of a large series of one age of one sex of one species from one locality shows that some parts, of
the skull for example, vary more than other parts. In illustration, among 22 male topotypes of Mustela
frenata washingtoni the least interorbital breadth varied 25 per cent (9.0 mm. to 12 mm.) whereas the
length of the tooth-rows varied only 13.3 per cent (15.6 mm. to 18.0 mm.). In color the individual
variation definitely is more in areas of intergradation between subspecies than in other areas. Details of
one such instance of intergradation are given in the account of Mustela frenata spadix.
Statements to the effect that there is much individual variation in the color of weasels, were made
mostly fifty years or so ago by writers who had but few specimens from widely separated localities.
Where marked climatic differences exist between localities only a few miles apart, marked differences
occur in coloration of the weasels from the different localities. Much of what formerly was mistaken for
individual variation now proves to be geographic variation. Individual variation actually is of slight
amount in comparison with that in mammals generally. Differences in size and relative proportions of
parts usually are correlated with geographic differences in color. The color does fade slightly in the
period between molts. Also as a result of the seasonal color change, in autumn along the upper margin
of the Austral Life-zone, some individuals become white whereas others become white on only the
underparts, the upper parts changing only to lighter brown. Probably it would be correct to say that this
variation was a combination of seasonal and individual variation rather than either one alone.
As might be supposed, individual variation is not the same in all species or subspecies. For example, p2
is always absent in Mustela africana and always present in certain subspecies of M. frenata. In some
82. other subspecies of M. frenata, p2 is absent approximately as often as present. In the writer's
experience, when only a few specimens are available for comparison, individual variation is more
difficult to distinguish from specific and subspecific (geographic) variation than is age-variation or
secondary sexual variation.
Among the larger series of specimens examined, only one instance of what might be called a mutation
in the old sense of a large, sudden change, was detected. That was the loss of the second lower molar
in many (less than a third) of the specimens from Newfoundland. The six instances of abnormal
coloration described on pages 41 to 43, might be regarded as mutations of large magnitude but no
evidence was found of repetition of an abnormality in any one population. Otherwise, in every instance
where plotted, the manifestations of a variation arranged themselves about the mean in such a way as
to form a smooth, unimodal curve.
Seasonal Variation
When subspecific and specific variations are the objectives of study, seasonal variation must be
understood, in order to be excluded from consideration, in the same way that variations ascribable to
age, sex and individualism must be understood in order to be excluded from consideration. In weasels,
change in color of the pelage is the seasonal variation most important for the systematist to
understand. Other seasonal variations in the pelage are hairiness versus nakedness of the pads of the
feet, length of the pelage on the body, and possibly the density of the pelage on the body. In the
northern half of North America, roughly speaking, seasonal change in color is so pronounced (white in
winter and brown in summer) as to be easily recognized. South of this area, in the Austral and Sonoran
life-zones, the color of the winter pelage differs only slightly from that of the summer pelage. In these
more southern latitudes the winter pelage in almost all subspecies is of lighter color than the summer
pelage and has a smoky suffusion. With material of the two seasons in hand for comparison, close
attention to the variation will permit the systematist to recognize the difference in shade of brown as
seasonal variation and not geographic or specific variation. Farther south still, in the Tropical Life-zone,
seasonal difference in color was not detected in the material studied. Seasonal change in color is
discussed in the section immediately following.
Variation in Coloration and Molt
In all American weasels (subgenus Mustela) the color, at least in summer, is brown with more or less
white or whitish on the underparts. In one species, Mustela africana, there is a longitudinal stripe of
brown on the middle of the light-colored underparts; this stripe is absent in each of the other three
American species. Two species, M. erminea and M. frenata, always have a black tip on the tail. Of the
other two species, M. africana lacks the black tip and M. rixosa may or may not have a few black hairs
in the tip of its tail. White or light yellowish facial markings occur in subspecies of M. frenata from the
southwestern United Stated to Central America. Subspecies having the most extensive light-colored
facial markings have the remainder of the upper part of the head black. In weasels without light facial
markings the upper parts of the head all are brown. In the two species, M. erminea and M. frenata, the
extent to which the light color of the underparts extends down the insides of the legs and out on the
underside of the tail, or the absence of light color on these parts, is a matter of geographic variation.
The same can be said for M. rixosa except that first its tail is unicolored and second individual variation
as well as geographic variation accounts for the color pattern on the underparts and legs in animals
from the southeastern part of the range of the species.
The most remarkable feature of the coloration of weasels is the winter whitening. This occurs in the
northern part of North America in each of the three species of weasels found on that continent. The
black tip of the tail in M. erminea and M. frenata remains black in winter. If an individual of M. rixosa
has black hairs on the tip of its tail in summer, there are thought to be black hairs there also in winter.
Otherwise the winter pelage is all white in northern areas in each of the three species. In this white
winter coat the animal is known as ermine.
83. The underlying cause seems to be protective coloration. At any rate, weasels are always white in winter
if they are from areas where snow lies on the ground all winter, every winter, or almost every winter;
and they are always brown if from areas where there is never, or rarely, snow in winter. The changes in
color are effected by molt, one in autumn and one in spring. Animals that are brown in winter undergo
the same two molts as do those that are white in winter. The capacity to acquire a white coat or a
brown coat in winter is an hereditary matter just as one man grows red hair and another grows black
hair. In the weasels, however, all individuals in the north turn white in winter and if one that was born
there is kept through successive winters in the warmer south where there is no snow, he will still turn
white each winter. A weasel born in a southern area, where all are brown in winter, molts into a brown
(not white) winter coat even when kept in a cold, snowy, northern area where native weasels of the
same species all turn white. Obviously, therefore, neither snow nor temperature is an immediate cause
and, as we have said, the color in winter is a matter of heredity. The time of the molt, we now know, is
determined by the amount of light. When nights grow longer and days shorter, a point is reached at
which the lesser light received through the eyes causes the pituitary gland to cease producing a
gonadotropic hormone. Directly or indirectly, the lack of this hormone stimulates molt and, probably
enzyme action, or the lack of it, causes the melanoblasts of the cells in the hair follicle to be without
pigment. Hence the hair grown from a follicle under such conditions lacks pigment (melanin) and is
white. In spring, as the days grow longer and the nights shorter, the increasing amount of light received
day by day through the eyes stimulates the pituitary gland to produce the gonadotropic hormone which
directly or indirectly, stimulates molt and, probably by enzyme action, the melanoblasts are caused to be
present in cells of the hair follicle and the melanoblasts provide granules of melanin pigment which are
incorporated in cells of the growing hair. These granules of pigment give the hair its color.
Evidence in support of this hypothesis is given below.
Along the Pacific Coast from British Columbia southward, M. erminea (see fig. 25 on page 95) is brown
in winter. This is an area where snow rarely falls and the temperature in winter ordinarily is above
freezing. In the remaining part of the American range of this species the temperature in winter is below
freezing much of the time and snow remains throughout the winter or for long periods. In this colder
part of the animal's range, only white coats occur in winter. M. frenata likewise has a white coat in
winter in the part of its geographic range where snow and freezing temperatures prevail throughout
most of the winter and a brown coat in warmer, snowless areas to the southward and along the Pacific
Coast. The third species, M. rixosa, exhibits a corresponding correlation between coat color and climate.
On the Asiatic continent, several species, including M. erminea, provide parallel correlations and
nowhere are there any exceptions for the subgenus Mustela. These data are an important part of the
material on which we have based the induction that the underlying cause of seasonal change in color is
a need for protective coloration.
As regards molt, most naturalists who have written upon the subject regard it as responsible for the
change from the white winter coat to the brown summer coat. However, the change from brown
summer coat to white winter coat has been thought by several writers to be effected by change in
coloration of the individual hairs. Among those holding this opinion there may be cited Bell (1874:197)
in reference to Mustela erminea, and Coues (1877:123) in reference to American specimens to which he
applied the same name. More lately Hadwen (1929) has taken this same view, and Gunn (1932) also
discusses the possibility of the hairs changing color. Bachman (1839:228-232), Macgillivary (1843?:158),
Audubon and Bachman (1851 (vol. 2):62), Schwalbe (1893:538), Pearson et al. (1913:447), Miller
(1930, 1931A), Hamilton (1933:300) and Rothschild (1942), among others, have been inclined to the
opinion, or positively affirm, that the color change in autumn is the result of a molt. The papers cited
above contain, in turn, references to many other printed accounts dealing with this question.
To my mind, it has not so far been demonstrated that the change in color of weasels in autumn is
accomplished without a molt. Also so far as I am aware, no explanation has been given of how the
pigment may disappear from the hair of weasels. Metchnikoff's (1901:156) idea that the senile
whitening of the hair in man is accomplished by phagocytes which remove the pigment granules would
hardly seem to explain the relatively sudden and complete autumnal change occurring in weasels.
Anyhow, Danforth (1925:108), and some other students have thought that the action of these
84. phagocytes was at most a factor of slight importance in the whitening of hair. Whatever be the
complete answer to the question of how the weasel changes color in autumn, at least one specimen of
long-tailed weasel, which is in process of color change in autumn, presents clear evidence of molt of the
overhairs. This specimen of M. f. longicauda is no. 188408, U. S. Nat. Mus., taken on November 12,
1897, at Rapid City, South Dakota. Other specimens of M. erminea which were taken in autumn similarly
show molt to be in progress. For these and other reasons, I am inclined to the opinion that the
autumnal change in color, like the one in spring, is effected by molt. During the period of the autumnal
color change, Noback (1935:27) had a captive M. f. noveboracensis and, each morning, found clumps of
brown hair on the floor of its cage; this was strong indication that molt was responsible for the color
change in this instance.
However, I freely admit that the evidence does not prove that the change from brown to white can be
accomplished only by molt; in the present state of knowledge it would be unscientific to deny that the
change were possible of accomplishment by other means. Also, it is true that the fifteen specimens
before me of Mustela frenata, subspecies included, in process of change from brown to white, with the
exception of the one from Rapid City, South Dakota, if taken individually, do not, in macroscopic
examination, show definite molt lines or other absolutely convincing evidence of molt. However, these
same specimens, insofar as examined microscopically, do show overhairs all white, or overhairs
pigmented throughout. The lighter color of the proximal parts of the overhairs in itself should not be
accepted as evidence of color change, for in the fresh summer pelage, the same condition exists. Also,
careful macroscopic examination suffices to show that in the transitional pelage of autumn, the brown
overhairs generally are longer than the intermixed white overhairs.
Whether the underfur behaves in exactly the same way as the overhair, I have not myself definitely
ascertained, but I assume that the underfur is molted twice each year, at least in the northern
populations of Mustela frenata and in the other species of more northern distribution. Schwalbe's (1893)
work, including sectioning of the skin and study of the hair follicles, led him to conclude that the
underfur was molted twice each year in Mustela erminea.
In Mustela frenata noveboracensis, M. f. nevadensis, and M. f. nigriauris, measurements taken on adult
males show the overhairs to be longer in the winter pelage than in the summer pelage of specimens
from the same locality. For example, in M. f. nigriauris from Berkeley, California, the overhairs of the
summer coat (July and August) average 8 millimeters in length on the hinder back and 7 mm. on the
belly, but average 9.5 mm. and 8 mm. respectively in January-taken specimens possessing the full
winter coat. At Ann Arbor, Michigan, in the summer coat, the longest hairs on the hinder back average
approximately 12 mm., and those on the belly, 9.5 mm., against 13 mm. and 9.5 mm. respectively in
winter. Although general observations initially led me to believe that the black, terminal hairs of the tip
of the tail are longer in the winter pelage than in the summer pelage, actual measurements fail to show
a difference in length.
The change from one coat to the other in the long-tailed weasel has been described among others by
Miller (1930, 1931A), Hamilton (1933) and Glover (1942) on the basis of captive specimens. In a
general way, the progress of the molt in their specimens agrees with that which I have been able to
make out from examination of skins taken in the wild. There is, however, this difference: Their
specimens show a more spotted pattern when in process of hair-change than do specimens taken in the
wild. Probably the more or less unnatural conditions under which these captive animals lived modified
the normal progress of molt.
In wild-taken specimens of the species Mustela frenata, subspecies included, the spring molt begins on
the mid-dorsal line and proceeds laterally, producing, at almost any given time, a relatively sharp molt
line separating the white winter hair from the incoming brown summer coat. However, in autumn the
change takes place first on the belly, then on the sides, and finally makes its appearance over all the
upper parts at about the same time, with the result that the upper parts have a salt-and-pepper
appearance without at this time any sharply defined molt lines. In general, the molt pattern can be said
to be reversed in the two seasons; in spring, it begins on the back and in autumn, on the belly. The
difference in spring and autumn color pattern is better illustrated on plate 39 than by additional
85. description. Swanson and Fryklund (1935:123) have observed that the "spring molt proceeds
differently" than the fall one in Mustela rixosa, and Barrett-Hamilton (1903:309) in commenting on the
European hare (and the stoat?) remarks, "In spring the moult, and with it the brown colour, progresses
in exactly the opposite order . . ." as compared with the white color of autumn, which that particular
writer thought resulted from removal of pigment from the hairs rather than from molt.
The tail, excepting the black tip, lags in the molt in many instances, with the result that, especially in
spring, it may retain a few white hairs as late as does the belly. In autumn it is less tardy and so far as I
have observed, becomes white at about the same time that the general area of the back changes color.
On the tail, the black tip itself, as clearly shown in more than a score of specimens, is molted at
approximately the same time in autumn as is the pelage of the body. However, the long black hairs,
which appear in, say, November, appear to increase in length until January. In spring, the long black
hairs of the tip of the tail seem not to be shed at the same time as the rest of the winter pelage, but
remain approximately six weeks longer and then are replaced by long black hairs of the summer coat.
At any rate, this is the picture presented by a half dozen specimens of M. f. nevadensis and M. f.
longicauda which do show a spring molt to be in progress on the black tip of the tail. Schwalbe similarly
(1893:536-537) has suggested that the black tip of the tail in Mustela erminea in spring is not molted
until about two months after the pelage on the rest of the body is changed. Schwalbe (loc. cit.) thinks
also that in M. erminea studied by him, the black tip of the tail in autumn is replaced approximately one
month in advance of the pelage on the rest of the body. As indicated above, my specimens of Mustela
frenata, subspecies longicauda and nevadensis, do not show this discrepancy in autumn. I have
considered the possibility that the black tip of the tail, in some species of Mustela, is molted only once
while the remainder of the coat was undergoing two molts. My inconclusive data lend but little support
to this possibility.
The difference in pattern of color between specimens taken in autumn and spring is known to some fur-
trappers of my acquaintance who have suggested that molt occurs in spring, whereas the individual
hairs change color in autumn. Reference to plate 39 will show how gross comparisons might lead one to
this erroneous explanation of the color change.
As to time of molt: In eight subspecies of Mustela frenata, namely, noveboracensis, occisor, primulina,
spadix, longicauda, arizonensis, nevadensis and effera, material is available to indicate that the
autumnal molt begins in October and is completed in November, and that the spring molt occurs in
March or April. A condensed list of specimens providing basis for this statement is as follows:
M. f. noveboracensis: 26 specimens in transitional pelage taken in autumn and 14 taken in spring;
M. f. occisor: One topotype has acquired one-fifth of the winter pelage on October 22, 1896; M. f.
primulina: 2 in November, one in March, and 2 in April are in process of change; M. f. spadix: 6
autumnal specimens and one in April show pelage change; M. f. longicauda: 7 autumnal specimens
and one in April show pelage change; M. f. arizonensis: 12 specimens in autumn and 3 in spring are
in process of molt; M. f. effera: One November-taken male has acquired four-fifths of the winter
coat and another taken on April 21 at Fort Rock, Oregon, is half finished with the spring molt.
It may be added that no marked difference in time of either autumnal or spring molt is apparent as
between the more northern and more southern localities from which the mentioned specimens come.
With more complete material I would expect to find a difference in this regard.
The material of the other, more southern, subspecies of Mustela frenata has not been adequate to show
the time of molting or the number of molts which occur in one year.
Animals in the northern part of the range of Mustela frenata acquire a white winter coat, whereas those
in the southern part acquire a brown winter coat, and in an intervening area the winter coat may be
either brown or white. By plotting on a map the localities of capture of all specimens examined in the
winter coat, it was possible to outline this intervening area as shown in figure 10 on page 37. However,
Dearborn (1932:36) shows that in Michigan some animals have a brown coat in winter at places farther
north than figure 10 shows to be the case. Hamilton's (1933-306) map for New York shows the same to
be true in that state. Accordingly, the boundaries of the area shown in figure 10, in which both brown
86. and white long-tailed weasels occur in winter, are known to be only approximate; with full information
available the belt would be represented as wider.
Fig. 10. Map showing the region (in black)
where both the brown and white winter
pelage is found in the long-tailed weasel,
Mustela frenata.
Hamilton (1933:302) has pointed out that "Where half of the weasels remain brown, these brown
winter specimens are always males." The results of my own examination of specimens not studied by
Hamilton, in a general way provide confirmatory data. More exactly, my examination reveals that at the
most northern localities where brown specimens occur, only males are in this coat. In explanation, it
may be said that in plotting on a map localities of capture of specimens in the winter coat, thirteen
places were found where both sexes were represented and where both brown and white winter coats
were found. With the two sexes, it is theoretically possible to have nine different combinations of coat
color. With males all brown, there might occur females (1) all brown, (2) all white, or (3) some brown
and some white. In addition to these three combinations, we might have three more by finding the
mentioned types of female coat color repeated where all males are white, and three more, or nine in all,
by substituting a population of males some of which were brown and some of which were white. Seven
of these possible combinations actually were found. The two combinations not found were all white
males with all brown females, and all white males with females both brown and white. In the three
instances where the males all were brown and the females all were white, the localities of capture were
in the northern part of the variable area. This indicates that where the brown winter coat occurs at
northern localities, the brown individuals are all males. Farther south, of course, the females, too,
acquire the brown winter coat.
Stated in another way, there is a broad belt across North America from the Atlantic to the Pacific in
which males of Mustela frenata at any one locality may be either brown or white in winter. Inside this
broad belt there is a narrower one, approximately half as wide, in which females at any one locality may
be either brown or white.
87. In support of the idea that color of the winter coat is an hereditary matter and that it is not dependent
on temperature, the following evidence derived from my transplanting specimens of Mustela frenata
supports the idea that color of the winter pelage is dependent on heredity and not on temperature or
snowfall.
A male captured on June 24, 1937, in the brown summer coat in Salt Lake City, Utah, was received by
me at Berkeley, California, five days later and kept in captivity almost six months. On November 17,
1937, half the pelage was white and on December 27, 1937, when next examined, the animal was in the
full, white, winter coat as it was on January 25, 1938, when it died. Native weasels all turn white in
winter in Salt Lake City, but in Berkeley native weasels always are brown in winter.
A juvenile or young animal, a male, captured in May, 1936, at Lafayette, Contra Costa County,
California, was kept there until August 13, 1936, when transferred to Calneva at the north end of Lake
Tahoe, California. The weasel was kept at Calneva until its death on December 23, 1937. In both the
winter of 1936-'37 and in that of 1937-'38, the winter coat was brown as in animals from its place of
origin (Contra Costa County) and unlike weasels of the Tahoe region nearly all of which turn white in
winter.
Two females, each approximately two months old, captured on May 1, 1936, at James Landing, 4 miles
northwest of San Pablo, Contra Costa County, California, were kept in Berkeley, California, until August
13, 1936, when they were transferred to the mouth of Blackwood Creek, on the west side of Lake
Tahoe, California. On October 25, 1936, both weasels escaped. On December 25, 1936, the headless
body of one of these was found approximately 300 yards south of the mouth of Blackwood Creek. The
animal had been dead at most a few days when found and was in the brown winter coat. At the place
of its origin all weasels are brown in winter but at the mouth of Blackwood Creek only 2 of 60 weasels
caught there in the winter coat were brown; the other 58 were white. The headless weasel was
identified, as one of the two formerly in captivity, by means of certain short toes, the ends of which had
been clipped off when the animal was a captive. No trace of the second female was found.
A female of unknown age, in white winter pelage, captured 4 miles southeast of Tahoe City, California,
and kept there until April 3, 1937, on which date it was brought to Berkeley, California, molted to brown
in the spring. The first signs of the brown coat were noted on April 14. On May 24 or 25 she gave birth
to 4 young which lived less than ten days. In the following winter this animal acquired a white coat. As
previously noted, weasels native to the Berkeley area, where this female was kept, have brown coats in
winter.
The weasels were in every instance kept in cages out-of-doors. The sides of the cages were open to the
elements. A nest box in each cage provided shelter. All were of the species Mustela frenata.
The significant results, it seemed to me, were that the winter coat was the kind found in the area where
the weasel originated instead of the kind found in weasels native to the areas in which the specimens
were held in captivity.
That the time of molt is determined by the amount of light has clearly been shown by Bissonnette
(1944:223) for American weasels of the two species Mustela erminea and M. frenata. In his words (op.
cit.:246) "Reducing the daily periods of light induced molting and regrowth of new fur. . . . In the
Bonaparte weasels [Mustela erminea], white replaced brown. . . . Increasing daily light-periods caused
molting and change to dark brown. . . . Incomplete molts in both directions (toward white or toward
brown) were produced as a result of early reversal of increase or decrease of daily light-time. . . . That
this stimulus is received through the eyes and acts through the anterior pituitary gland is indicated by
Bissonnette's [1935:159] studies on ferrets, a nearly related animal. That the thyroids and sex-glands
are not essential is at least suggested . . . by Lyman's (1942) study on the varying hare [Lepus
americanus]." It can be added that Lyman (1943:451) demonstrated in Lepus americanus that the
effect of light is received through the eyes. He demonstrated this by masking the animals. To Wright
(1942B:109) who studied the two American weasels, M. erminea and M. frenata, it seemed likely that
the pituitary produced or released gonadotropic hormone at about the time of the spring molt and that
this molt and the spring changes in the reproductive tracts of the weasels might be caused by a
stimulus from a common source. Later, Wright (1950:130) injected a gonadotropic hormone into long-
88. Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com