Graphics Of Large Data Sets Visualizing A Million 1st Edition Antony Unwin
Graphics Of Large Data Sets Visualizing A Million 1st Edition Antony Unwin
Graphics Of Large Data Sets Visualizing A Million 1st Edition Antony Unwin
Graphics Of Large Data Sets Visualizing A Million 1st Edition Antony Unwin
1. Graphics Of Large Data Sets Visualizing A
Million 1st Edition Antony Unwin download
https://ebookbell.com/product/graphics-of-large-data-sets-
visualizing-a-million-1st-edition-antony-unwin-983666
Explore and download more ebooks at ebookbell.com
2. Here are some recommended products that we believe you will be
interested in. You can click the link to download.
The Graphics Of Verse Experimental Typography In Twentiethcentury
Poetry Daniel Matore
https://ebookbell.com/product/the-graphics-of-verse-experimental-
typography-in-twentiethcentury-poetry-daniel-matore-58106636
The Graphics Of Communication Exploring The Graphic Arts And Design
Ivan Bradley
https://ebookbell.com/product/the-graphics-of-communication-exploring-
the-graphic-arts-and-design-ivan-bradley-36333890
The Minard System The Complete Statistical Graphics Of Charlesjoseph
Minard Rendgen
https://ebookbell.com/product/the-minard-system-the-complete-
statistical-graphics-of-charlesjoseph-minard-rendgen-54727010
Fundamentals Of Graphics Communication With Autodesk 2008 Inventor Dvd
5th Edition Gary Bertoline
https://ebookbell.com/product/fundamentals-of-graphics-communication-
with-autodesk-2008-inventor-dvd-5th-edition-gary-bertoline-2451838
3. Fundamentals Of Graphics Communication 6th Edition Gary R Bertoline
William Ross Eric N Wiebe Nathan Hartman
https://ebookbell.com/product/fundamentals-of-graphics-
communication-6th-edition-gary-r-bertoline-william-ross-eric-n-wiebe-
nathan-hartman-10446548
Fundamentals Of Graphics Using Matlab Ranjan Parkeh
https://ebookbell.com/product/fundamentals-of-graphics-using-matlab-
ranjan-parkeh-11106820
Semiology Of Graphics Diagrams Networks Maps Jacques Bertin
https://ebookbell.com/product/semiology-of-graphics-diagrams-networks-
maps-jacques-bertin-7125576
The Grammar Of Graphics Second Edition 2nd Leland Wilkinson
https://ebookbell.com/product/the-grammar-of-graphics-second-
edition-2nd-leland-wilkinson-2099452
Computer Vision And Graphics Proceedings Of The International
Conference On Computer Vision And Graphics Iccvg 2022 Leszek J
Chmielewski
https://ebookbell.com/product/computer-vision-and-graphics-
proceedings-of-the-international-conference-on-computer-vision-and-
graphics-iccvg-2022-leszek-j-chmielewski-49133150
7. Statistics and Computing
Brusco/Stahl: Branch-and-Bound Applications in Combinatorial Data Analysis.
Dalgaard: Introductory Statistics with R.
Gentle: Elements of Computational Statistics.
Gentle: Numerical Linear Algebra for Applications in Statistics.
Gentle: Random Number Generation and Monte Carlo Methods, 2nd Edition.
Härdle/Klinke/Turlach: XploRe: An Interactive Statistical Computing Environment.
Hoermann/Leydold/Derflinger: Automatic Nonuniform Random Variate Generation.
Krause/Olson: The Basics of S-PLUS, 4th Edition.
Lange: Numerical Analysis for Statisticians.
Lemmon/Schafer: Developing Statistical Software in Fortran 95
Loader: Local Regression and Likelihood.
Ó Ruanaidh/Fitzgerald: Numerical Bayesian Methods Applied to Signal Processing.
Pannatier: VARIOWIN: Software for Spatial Data Analysis in 2D.
Pinheiro/Bates: Mixed-Effects Models in S and S-PLUS.
Venables/Ripley: Modern Applied Statistics with S, 4th Edition.
Venables/Ripley: S Programming.
Wilkinson: The Grammar of Graphics, 2nd Edition.
10. Preface
Analysing data is fun. It is fascinating to find out about different topics,
and each dataset brings new challenges. Whether you are looking at times
of goals scored in Bundesliga soccer games, ultrasound measurements of
babies in the womb, or company bankruptcy data, there is always some-
thing new to be learnt. Graphic displays play a major role in data anal-
ysis, and all the main authors of this book have spent a lot of research
time studying graphics and ways to improve and enhance them. We have
also spent a lot of time actually analysing data. The two go together.
One of the major problems over the years has been representing data
from large datasets. Initially, computers could barely read in a large
dataset, so whether you could display it sensibly or not was not rele-
vant. Gradually, computers have been able to cope with larger and larger
datasets, and some weaknesses of standard data graphics have become
apparent. In tackling these problems, we have become aware that there
is knowledge around as to how to display large datasets, but it is not
readily available, certainly not in one place. We hope that our book will
help others interested in visualizing large datasets to find out more eas-
ily what has been done and to contribute themselves. More especially, we
hope it will help data analysts in analysing their data.
This book grew out of discussions at a visualization workshop held
in Augsburg in 2002. The main authors of each guest chapter (Dianne
Cook, Ed Wegman, Graham Wills, Simon Urbanek, Steve Marron) were
all there, and we are grateful to them both for agreeing to contribute to
our book and for many insightful discussions at the workshop and on
other occasions.
Many people have contributed to our understanding and knowledge of
graphics and data analysis. Discussions at conferences, debates via email,
and, especially, debates about how to analyse particular datasets have all
left lasting impressions. Experimenting with software, our own and that
of many others, has also taught us a great deal. We would like to thank
in no particular order Günther Sawitzki, Fred Eicker, Paul Velleman,
11. vi Preface
Luke Tierney, Lee Wilkinson, Peter Dirschedl, Rick Becker, Allan Wilks,
Andreas Buja, Debbie Swayne, John Sall, Bob Rodriguez, Rick Wicklin,
Sandy Weisberg, Bill Eddy, Wolfgang Härdle, Junji Nakano, Moon Yul
Huh, JJ Lee, Adi Wilhelm, Daniella DiBenedetto, Paul Campbell, Carlo
Lauro, Roberta Siciliano, Al Inselberg, Peter Huber, Daryl Pregibon,
Steve Eick, Audris Mockus, Michael Friendly, Sigbert Klinke, Matthias
Nagel, Rüdiger Ostermann, Axel Benner, Friedrich Pukelsheim, Chris-
tian Röttger, Thomas Klein, Annerose Zeis, Marion Jackson, Enda Treacy,
David Harmon, Robert Spence, Berwin Turlach, Bill Venables, Brian
Ripley, — and too many people associated with R to name individually.
Thanks are also due to former colleagues in the Statistics Department
at Trinity College Dublin for provocative exchanges both over coffee and
long into the night at the Irish Statistics conferences (graphical discus-
sions, though not necessarily about graphics).
Some people read parts (or all!) of the book and made pertinent, help-
ful comments, but we also benefitted from encouraging and constructive
criticism from Springer’s anonymous referees. For help with the proof-
reading we are indebted to Lindy Wolsleben, Estelle Feldman, Veronika
Schuster, and Sandra Schulz. Being students, Veronika and Sandra were
properly respectful and careful. Lindy and Estelle called a spade a spade,
especially when we had called it a spode. Our thanks to all of them; au-
thors need both kinds of help. Needless to say (but we will say it anyway),
any remaining errors are our fault. John Kimmel has been a supportive
editor and it is always a pleasure to chat with him at the Springer stand
at meetings. (We hope that the sales figures for our book will be good
enough that he will still be prepared to talk to us in future.) We would
also like to thank the six anonymous reviewers who gave significant in-
put at various stages of the project. It takes a good editor like John to find
these people.
Graphics research depends on good, if not exceptional, software to
turn visualization ideas into practice. In 1987, one of us (AU) got a small
grant from Apple Computers to write graphics software for teaching. (It
is hard to believe, but I actually wrote some code myself initially. When
I saw how good the students employed on the project were, I vowed
never to write any code again.) In Dublin, it was the students Michael
Lloyd, Graham Wills, and Eoin McDonnell who led the way. In Augsburg,
where the Impressionists’ software packages have been developed, we
have benefitted from the skills of George Hawkins, Stefan Lauer, Heike
Hofmann, Bernd Siegl, Christian Ordelt, Sylvia Winkler, Simon Urbanek,
Klaus Bernt, Claus-Michael Müller, René Keller, Markus Helbig, Tobias
Wichtrey, Alex Gouberman, Sergei Potapov, and Alex Gribov.
In 1999, John Chambers generously donated his ACM award money to
the creation of a special prize for the development of statistical software
by students. Recognition of the important contribution software makes to
12. Preface vii
progress in research was long overdue, and we are delighted that three of
our students have already won the prize.
Data analysis is a practical science and much of our knowledge and
ideas stems from project collaborations with colleagues in universities
and in firms. It would be impossible to talk about the problems arising
from large datasets, without actually working on problems like these. We
are grateful to our project partners for their cooperation and for actually
sharing their data with us. It is all too easy to forget just how much work
and effort is involved in collecting the data in a dataset.
Finally, as academics, we are grateful to our universities for sup-
porting and encouraging our research, to the Deutsche Forschungsge-
meinschaft for some project support, and, especially, to the Volkswagen
Stiftung, whose initial funding led to the founding of the Department
of Computer Oriented Statistics and Data Analysis in Ausgburg, which
brought us all together.
Apple is a registered trademark of Apple Computer, Inc., AT&T is a registered
trademark of AT&T Corp., DataDesk is a registered trademark of Data Descrip-
tion, Inc., IBM is a registered trademark of International Business Machines
Corporation, Inxight TableLens is a registered trademark of Inxight Software,
Inc., JMP, SAS and SAS-Insight are registered trademarks of SAS Institute Inc.
and/or its affiliates, OpenGL is a trademark of Silicon Graphics, Inc., PostScript
is a registered trademark of Adobe Systems Incorporated, S-PLUS is a registered
trademark of the Insightful Corporation, SPSS is a registered trademark of SPSS
Inc., UNIX is a registered trademark of The Open Group., Windows is a regis-
tered trademark of Microsoft Corporation. Other third-party trademarks belong
to their respective owners.
18. 1
Introduction
Antony Unwin
Permit me to add a word upon the meaning of a million, being a
number so enormous as to be difficult to conceive.
Francis Galton, Hereditary Genius p.11
1.1 Introduction
Large datasets are here to stay. The automatic recording of information,
whether supermarket purchase data, internet traffic, weather data or
other scientific observations, has meant that there is no limit to the size
of datasets to be analysed. You might ask, if a dataset is so large, why
not just take a big sample? But samples will not pick out outliers, local
structures, or systematic errors in the data. They will not be large enough
for multivariate categorical analyses. And you have to be careful how you
sample or what subset you use. In Yates’s 1966 Fisher Memorial lecture
(Yates; 1966), he remarked that
Instead, an analysis, often involving elaborate theory and requir-
ing much numerical computation, was made on a particular batch
of data, and conclusions were published. Only when similar anal-
yses were performed on other batches of data was it found that se-
rious contradictions existed between the different batches and that
the original conclusions had to be considerably modified.
Being Yates, he did not shrink from citing an example of a study by
Fisher where this happened.
Of course, it is not always necessary or sensible to analyse all the data
available. A sample of telephone calls would give a good estimate of ex-
pected revenue, whereas records of all telephone calls would be needed
to spot unusual, but rare, patterns (and possible fraud). However, as
datasets grow, so do subsets of datasets. There are increasingly many
19. 2 1 Introduction
applications where methods are needed for coping with larger volumes of
data.
The aim of our book is to look at ways of visualizing large datasets,
whether large in numbers of cases or large in numbers of variables or
large in both. Visualization is valuable for exploring data and for pre-
senting information in data, whether global summaries or local patterns
(to use the terminology of Bolton et al.; 2004). Data analysts, statisticians,
computer scientists — indeed anyone who has to explore a large dataset
of their own — should benefit from reading this book.
Why is visualizing data important? Data visualization is good for data
cleaning, for exploring data — as John Tukey put it (Tukey and Tukey;
1985): “There is nothing better than a picture for making you think
of questions you had forgotten to ask (even mentally)” — for identify-
ing trends and clusters, for spotting local patterns, for evaluating mod-
elling output and for presenting results. Visualization is essential for Ex-
ploratory Data Analysis.
Visualization is an important complement to statistical approaches
and is essential for data exploration and understanding, and as Ripley
(2005) says: “Finding ways to visualize datasets can be as important as
ways to analyse them.” There are plenty of graphical displays that work
well for small datasets and that can be found in the commonly available
software packages, but they do not automatically scale up. Dotplots, scat-
terplots, and parallel coordinate plots all suffer from overplotting with
large datasets; just think of drawing a scatterplot of a million points.
The number “a million” is a useful symbolic target, because visualizing
a million cases already raises interesting problems and because, despite
what the Galton quotation introducing the chapter says, a million cases
is a comprehensible size. The UK Breast Cancer study followed a mil-
lion women, there are about one million pitches thrown every five years
of major league baseball, and the television quiz show, popular in many
different countries, asks “Who wants to be a millionaire?” (even though
exchange rate differences mean that a million is not worth the same in
every country).
Graphics have played an important role in the presentation of statisti-
cal data for at least two hundred years, if Playfair (2005) is taken as the
starting point. Sometimes they have been used more, sometimes less, and
often statisticians have taken them very seriously. In 1973, Yates wrote
in his preface to the reissue of Fisher’s Statistical Methods for Research
Workers (Fisher; 1991), referring to the first edition of 1925: “Following
the introductory chapter comes a chapter on diagrams. This must have
been an innovation surprising to statisticians of that time.” Talking about
diagrams at all still seems to surprise some statisticians.
After a period in which graphics tended to be downplayed because
early computers could not produce good graphics, they have become in-
creasingly popular in recent years. Presentation of results is one use of
20. 1.1 Introduction 3
graphics but they are also valuable for exploring and analysing data.
It is difficult to tell if this use has also increased, but it would be very
surprising if it had not, thanks to the ready availability of easy to use
software. Generally speaking, software for working with data is excellent
for looking at small datasets. Although this is what people mostly need,
it is not always enough. Firms now usually have their own data ware-
house with substantial data holdings. Scientific studies can include exten-
sive amounts of automatically recorded data. There are even many large
datasets that have been made public on the web (e.g., TAO, the Tropical
Atmosphere Ocean Project, or GESIS, German Social Survey Data). All
of this puts heavier demands on software, and both analytic and visual-
ization methods need to be revised and improved to deal with these new
challenges.
You could argue that the challenges of dealing with large datasets are
not new at all. More than a century ago, Francis Galton (1892) thought
about what a million might look like. He describes in his 1869 book,
Hereditary Genius: An Inquiry into its Laws and Consequences, count-
ing leaves in a long avenue to give him a sense of how much a million
must be (p.11):
Accordingly, I fixed upon a tree of average bulk and flower, and
drew imaginary lines — first halving the tree, then quartering, and
so on, until I arrived at a subdivision that was not too large to al-
low of my counting the spikes of flowers it included. I did this with
three different trees, and arrived at pretty much the same result:
as well as I recollect, the three estimates were as nine, ten, and
eleven. Then I counted the trees in the avenue, and, multiplying
all together, I found the spikes to be just about 100,000 in number.
Ever since then, whenever a million is mentioned, I recall the long
perspective of the avenue of Bushey Park, with its stately chest-
nuts clothed from top to bottom with spikes of flowers, bright in the
sunshine, and I imagine a similarly continuous floral band, of ten
miles in length.
That is all very well for visualizing the idea of how much the number a
million might be but does not help to visualize the characteristics of a
population of a million. Being Galton, he had considered that too and in
the same book (p. 28) he included a diagram showing how he imagined
the distribution of the heights of a million men would look (Figure 1.1).
Galton suggested that each man would stand with his back to the wall
and a mark would be made registering his height. In effect, this gives
a jittered dotplot, except that there are so many dots in the centre of
the distribution that there is just a solid black blob. (Incidentally, with
Galton’s explanation few people would have problems understanding the
diagram. Would that all graphics were explained so convincingly.)
21. 4 1 Introduction
Fig. 1.1. Galton’s diagram visualizing
a million in 1869.
How would a million be visualized
today? If you have ever drawn a
histogram or a scatterplot of a mil-
lion cases, you know that it is pos-
sible, but that there are problems.
The screen resolution of a computer
cannot be high enough to show very
small bars in the histogram, and
in regions of high density the scat-
terplots look like black blobs with
huge numbers of points piled on top
of one another. (It is noteworthy —
and useful — that the weaknesses
of the two kinds of plot arise at op-
posite extremes of the distributional
densities.) So what should be visual-
ized? If the distributional form of the
bulk of the data is of interest, then
the histogram will be fine for one-
dimensional views (and it may give
some information about outliers too).
If individual outliers are of interest,
then the scatterplot will be pretty
good (and it will give a fair bit of dis-
tributional information as well). One aim might be described as global,
attempting to summarise the main structure, and the other as local, at-
tempting to identify individual features. Ideally, both kinds of plot are
needed to satisfy both aims.
1.2 Data Visualization
Many different words can be used to describe graphic representations of
data, but the overall aim is always to visualize the information in the
data and so the term Data Visualization is the best universal term. Other
terms have different connotations. At a seminar in Munich in 2004 where
researchers from all fields interested in visualization met, one person
thought the word ‘plot’ was being used to describe the story in a statistical
graphic, not the graphic itself. Would that every plot had a good plot!
Deciding on which graphics to use is often a matter of taste. What
one person thinks are good graphics for illustrating information may not
appeal to someone else. It may also happen that different people inter-
pret the same graphic in quite different ways. There are no absolutes like
the 5% significance level for statistical tests (and whatever reservations
there may be about relying on such artificial limits, they are a help in
22. 1.2 Data Visualization 5
setting widely accepted standards). Buja et al. (1988) and others (e.g.
Gelman; 2004) have suggested comparing a graphic by eye with sets of
similar graphics generated from an appropriate model. This is too much
effort to be applied in every case but can be useful in certain applica-
tions. Like most graphical methods, it relies primarily on the interocu-
lar trauma test (if it hits you between the eyes, it’s significant). Results
inferred from graphics should always be treated with caution. They are
reassuring if they confirm something already found analytically, but they
should otherwise be regarded as speculative until confirmed by other, ide-
ally statistical, means.
Fig. 1.2. An early graphic of a big dataset: the US Population in 1890 from the
official census report.
23. 6 1 Introduction
Graphical displays have always been important in statistics, although
they are used more by some statisticians than by others. Mostly they have
been used to display data distributions, sometimes of raw data, some-
times of model residuals. A fascinating early example for a large dataset
can be found in the official report on the 11th US census, which has
been published on the web (http://www.census.gov/prod/www/abs/
decennial/1890.htm). It includes a horizontal barchart of the 1890
population of 63 million by State and Territory, Figure 1.2. It is fascinat-
ing to observe that California (the most populous state today) was only
of middling size with a population of just over a million. As computer
power has increased and computer software has become more sophisti-
cated, more complex graphical tools have been developed and now not
only data are displayed but also structures such as networks, graphical
models, or trees.
It is appropriate to talk of visualization and not just of graphics. How-
ever, this does not mean the field of Scientific Visualization, where su-
perbly realistic 3-D graphics are presented, nor does it mean Information
Visualization, which is concerned with the visualization of information
of all kinds (see Card et al.; 1999; Fayyad et al.; 1999; Spence; 2000,
and the annual InfoVis meetings, http://www.infovis.org, for more
details as well as for many intriguing and effective displays). Data Visu-
alization is related to Information Visualization, but there are important
differences. Data Visualization is for exploration, for uncovering informa-
tion, as well as for presenting information. It is certainly a goal of Data
Visualization to present any information in the data, but another goal
is to display the raw data themselves, revealing the inherent variability
and uncertainty. A provocative example of an unusual data visualization
can be found in WordCount (http://www.wordcount.org) by Jonathan
Harris (www.number27.org), 2004, Figure 1.3. The 86,800 different words
that appear at least twice in the British National Corpus (a collection of
texts comprising 100 million words in all) are displayed in order of fre-
quency. The word ‘visualization’ was sought and then centered between
the words one rank more frequent and one rank less frequent. The com-
bination of the detail above and the overview below is very effective.
Fig. 1.3. Where the word visualization comes in frequency of use amongst the texts
of the British National Corpus.
24. 1.3 Research Literature 7
1.3 Research Literature
Research literature on Data Visualization can be found in both statis-
tics and computer science. Statisticians are interested in data description
and displays arising out of modelling data structure, reflecting statistical
properties. Computer scientists’ work is closely tied to information visu-
alization. Unfortunately, as a search of the web reveals, the term Data
Visualization means different things to different people. There is a book
The Art of Data Visualization (Post et al.; 2002), which is about computer
graphics (rendering issues and scientific visualization) rather than sta-
tistical graphics. The company Interactive Data Visualization produces
computer games. Many software packages are available, which are capa-
ble of producing extraordinary, though not necessarily informative, dis-
plays of data.
There are several classic books on statistical graphics and each takes
a different view of the topic. Tufte (1983) is full of good advice and
strong opinions, pointing out the misleading effects of elaboration of
statistical graphics. Cleveland (1994) concentrates on the sensible de-
sign of basic plots. Both these books have many interesting examples.
Bertin (1983) attempts to provide a (French) philosophical approach, of-
fering plenty to think about, though the practical implications are not
always clear. Wilkinson (2005) describes a flexible system for imple-
menting software for drawing statistical graphics. None of these books
discusses the difficulty of visualizing large datasets, and when advice is
given (e.g., in Cleveland, sunflower plots for scatterplots with overlap-
ping points), it is of little practical help. All four authors have contributed
much to our overall grasp of statistical graphics and all give good guid-
ance for datasets that are not very large. Friendly has collected a num-
ber of excellent statistical graphics (and some awful ones to show just
how bad they can get) on his web-based Gallery of Data Visualization
(http://www.math.yorku.ca/SCS/Gallery/). Murrell’s book (Mur-
rell; 2005) describes how to draw all kinds of static graphics using the
software R, but does not make any recommendations as to which are good
or bad, nor does he discuss interpretation of graphics.
Practical statisticians have often emphasised the importance of graph-
ical displays, but until PCs and laser printers became available in the
1980s, it was not easy to produce good graphics with a computer. It is
fascinating to read how attitudes to graphics changed as statistical com-
puting developed. While on the computational side many algorithms be-
came practical for the first time as computers (electronic as distinct from
human) took over the burden of calculation, it was difficult to produce
graphics and the quality of reproduction was poor (cf. Figure 1.5).
There is a thought-provoking paragraph in Yates’s Fisher Memorial
lecture of 1966 on graphics:
25. 8 1 Introduction
Graph plotters are another development which is of great impor-
tance to the statistician. Computers can now readily produce di-
agrammatic representations of numerical material, though few of
these devices are yet available in this country. They may well rev-
olutionise the presentation of much statistical material, but their
proper utilisation will require much hard thinking by statisticians.
Given the graphics displays that are to be found in much of the media
nowadays, it is clear that there is still a lot of hard thinking to be done!
Chambers (1967) (the future designer of S) gave a paper at the first
RSS meeting on statistical computing and posed the question “What do
we need to provide in order to take full advantage of the new computers?
More and better graphical techniques, surely, since these provide a more
condensed summary of information on the spot.” He also remarked in con-
nection with the availability of “new input-output media (e.g., graphical
displays units)” that: “There is great potential power here for data anal-
ysis but statisticians are not yet ready to exploit it.” Some of us might
argue that this situation has not changed a great deal in almost 40 years.
Wegman and Carr (1993) prepared a summarising review on graph-
ics for the Handbook of Statistics. With regard to large datasets, they
emphasised the limits of human perceptual ability and considered the
restrictions this placed on statistical displays.
Huber wrote a review of the 1995 meeting on Massive Datasets (Hu-
ber; 1999) including the following comments on visualization. In his opin-
ion (p. 637) visualization “runs into problems just above medium sets.”
This is reasonable, though pessimistic, as a medium-sized dataset would
have only about 10,000 cases and 10 variables (cf. Table 1.1 later on in
the chapter). Perhaps he was thinking of visualizing all individual cases,
because later on in the same article (p. 646) he wrote “direct visualiza-
tion of larger than medium datasets in their entirety is an unsolved (and
possibly unsolvable) problem.”
The most important comment in his paper concerning visualization is
that “visualization for large data is an oxymoron — the art is to reduce
size before one visualizes. The contradiction (and challenge) is that we
may need to visualize first in order to find out how to reduce size.” It is an
art, because standard summarising displays like histograms or barcharts
may not be appropriate or sufficient for the data in hand. In that case,
micro-level or local visualization of cases may be necessary to suggest
what kind of macro-level or global visualization might work.
There are not yet many papers on graphics for large datasets. One
exception is Eick and Karr’s (2002) on what the authors call visual scal-
ability. They outline both the limits of human perception and the limits
of the then current equipment. They demonstrate how multi-windowing,
interaction, and good organisation can go a long way to displaying large
datasets effectively. Another is Ward et al. (2004), which emphasises the
26. 1.4 How Large Is a Large Dataset? 9
need for multiresolution strategies, that is, representing the data at dif-
ferent levels of abstraction or detail. Some computer scientists have pub-
lished research on visualization of large datasets and have experimented
with novel ideas. Keim (2000), for instance, considers pixel-oriented ways
of encoding multivariate information for large datasets.
In the United States, the National Visualization and Analytics Cen-
ter (NVAC) has published a book (Thomas and Cook; 2005) on analysing
large datasets for combatting terrorism. They refer (p. 4) to taking ad-
vantage of “the human eye’s broad bandwidth pathway into the mind to
allow users to see, explore, and understand large amounts of information
at once.” They seek new methods using “multiple and large-area computer
displays to assist analysts” (p. 84), a different approach to the one in this
book, which concentrates on improving performance on single screens,
the practical situation most of us face.
1.4 How Large Is a Large Dataset?
What is meant by large, when large datasets are referred to, tends to
change over time and depends on what methods are to be applied to
the data. Computers have more storage and more power, and tasks that
were onerous last year become run-of-the-mill this year. The Lanark-
shire Milk Experiment was a large-scale study from 1930, made famous
amongst statisticians by Student’s devastating criticism (Student; 1931).
Figure 1.4 shows the odd pattern of average growth for the 10,000 school-
girls in the study (there was a similar pattern for the boys). Each child
was measured twice, once in winter and again six months later in sum-
mer. The display shows the averages at each age and links them together
to estimate an average growth curve for girls over time. Such a display
would be easy to produce today, but it must have taken substantial effort
seventy years ago to organise the necessary calculations.
It would be interesting to know what statisticians have thought of as
large over the years, but it turns out to be difficult to pin down. There is
Huber’s (1992) classification, where he divided datasets from tiny up to
huge based on storage space required, see Table 1.1. He estimated that
a large dataset might have a million cases with 10 variables. Wegman
(1995) extended the table by a further factor of 102
(“monster” datasets).
Once electronic computers were used more commonly in statistics,
their capabilities started to determine how big the datasets that could
be analysed were. There are several levels of looking at this. Firstly, you
can consider the amount of data that can be stored (which is related to
the Huber scale of dataset sizes and depends on the capacity of the stor-
age media available) and identified (which depends on the software avail-
able). Secondly, you can think about what analyses can be carried out on
27. 10 1 Introduction
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
x
x
x
x
x
•
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
7
1/2
Diagram 4
WEIGHT OF GIRLS
Average weight at commencement of experiment
Average weight at end of experiment
Control
Raw milk “feeders”
Pasteurised “feeders”
Age
Weight
–
lbs.
6
1/2 8
1/2 9
1/2 10
1/2 11
1/2 12
1/2
5
1/2
70
65
60
55
50
45
40
35
C 51 688 718 802 820 729 494
R 16 332 335 414 408 373 261
P 26 353 352 410 406 340 246
Numbers in each Group
Fig. 1.4. A reproduction of one of Student’s displays from the Lanarkshire Milk
Study. Notice the apparent irregularity of weight increase with age, which drew
attention to some of the problems in the design of the study and in how it was
carried out.
the data. The requirements for some methods obviously grow too fast with
numbers of cases (e.g., hierarchical clustering) and some software needs
all the data in main memory, which automatically makes that a limit-
ing factor. Thirdly, you can demand that analyses be carried out within
an “acceptable” time. For instance, interactive analyses need very fast
response times, much faster than anything dreamed about in the days
when users expected to wait a few hours for their output to turn up.
A search of the major statistical journals using JSTOR threw up a
number of comments in published papers on the size of datasets, but most
Table 1.1. Huber’s Classification of Dataset Sizes
Size Description Bytes
Tiny Can be written on a blackboard 102
Small Fits on a few pages 104
Medium Fills a floppy disk 106
Large Fills a tape 108
Huge Needs many tapes 1010
28. 1.4 How Large Is a Large Dataset? 11
were not quantified. They are listed in Table 1.2 for illustration and some
are discussed in more detail in chronological order.
Table 1.2: Quotations on Dataset Size
Year Comment Author
1959 The phrase “If large-scale storage is not
available” implies that a data set of 1,000
cases would have been large.
Harris
1965 “the analysis of the data recorded by Tel-
Star, an early communications satellite, in-
volved tens of thousands of observations
and challenged contemporary computing
technology.”
Chambers
1966 “The need for better editing is well known
to those concerned with extensive data
sets.”
Yates
1967 Datasets of “modest bulk”. Page
1967 “Vast” datasets. Gower
1975 “It is now possible to access large data sets
directly from magnetic tape.”
McNeill &
Tukey
1978 For SPSS “any one analytical use of the file
is limited to using at most 500 variables”,
though up to 5,000 could be loaded in all.
Muller
1981 “There is now a collection of computer sub-
routines designed to summarize large data
sets in histogram form.” In the example he
used, statistics were calculated for 20,000
samples of size 50 and histograms with 800
(!) cells were prepared for each statistic.
Dickey
1981 Restricted in their analysis at one site be-
cause the software there could only handle
88,000 real numbers.
Aitken et al.
1981 “Substantial” data sets in the census Kruskal
1982 Moderate data sets have less than 500
cases and large have more than 2,000 (for
linear-logistic models).
Koch
1986 “... allows even very large data sets to be
explored interactively” and referred to a re-
gression data set with 11,000 cases.
Gilks
continued on next page
29. 12 1 Introduction
Table 1.2: continued
Year Comment Author
1986 “The increased use of computing has in
turn increased the importance of develop-
ing methods for interpretation of large vol-
umes of data. . . ”
Eddy
1987 “What is large depends on the frame of ref-
erence. If available plotting space for a scat-
terplot is a one-inch square, 500 points can
seem large. For our purposes, N is large if
plotting or computation times are long, or if
plots can have an extensive amount of over-
plotting.” At that time, 50,000 points was
large (based on rendering time) but the au-
thors pointed out: “The representation of 1
million or more data points in each plot is
feasible.”
Carr et al.
1987 “a moderate amount of data, say several
hundred observations.”
Becker et al.
1990 “A regression model for 5,000 cases with 6
variables would be a high sample size for
immediate evaluation (c. 3 seconds), but far
too big for even rough bootstrapping (esti-
mated to take an hour).”
Sawitzki
1990 “Computing plain medians was not feasible
because there were nRC = 2,621,400,000
data values in all, which could not be stored
in central memory.”
Rousseeuw
1991 “2% of the total census records is a very
large data file”. For a UK population of 65
million this would be about a million cases.
Marsh et al.
1993 “The data are listed in Good and Gask-
ins (1980) as a histogram of 172 bins of
length 10 MeV constructed from the loca-
tions of 25,752 events on a mass-spectrum.
For such a huge data set in one dimen-
sion...”
Gu
1996 “huge samples (size 100,000)” and “a large
number of groups (100 say)”
Sasieni &
Royston
continued on next page
30. 1.4 How Large Is a Large Dataset? 13
Table 1.2: continued
Year Comment Author
1996 “large surveys such as the NCVS may have
60,000 or more observations, and only re-
cently has research begun on how to plot
massive data sets.”
Fesco et al.
1998 A “large” data set has 3667 cases. (Scatter-
plots for surveys)
Korn &
Graubard
1999 “We focus our attention on a very small
[sic] part of the information available in
these data; namely the birth weight of the
4,017,264 registered singleton births.”
Clemons &
Pagano
In 1959, Harris described data plotting with an IBM 650. “With a
fairly small table a 650 might handle up to 1,000 non-negative obser-
vations of not over 5 digits each.” The accompanying phrase “If large-
scale storage is not available” implies that such a dataset of 1,000 cases
would have been large. For its time, the display in Figure 1.5 must have
been impressive. Of course, there is “large” and “large” and according to
Fig. 1.5. Reproduction of Figure 1 of Harris’s (1959) paper. Reprinted with per-
mission from The American Statistician. Copyright 1959 by the American Statis-
tical Association. All rights reserved.
31. 14 1 Introduction
Chambers (1999), “Large-scale applications did exist, even at this time [in
1965]; the analysis of the data recorded by TelStar, an early communica-
tions satellite, involved tens of thousands of observations and challenged
contemporary computing technology.”
In his 1966 Fisher memorial lecture Yates wrote:
As an example of the type of work that can now be readily under-
taken I may instance the analysis of the directional recording of
swell from distant storms involving complicated spectral analyses
of 106
automatic recordings from three pressure gauges.
He also noted that “A serious fault of many statistical investigations in
the past has been that all available data bearing on the question at is-
sue were not made use of” — a clear cry for more powerful computing
facilities to enable the analysis of larger datasets.
In 1980, Good and Gaskin published an article on what they called
‘bump-hunting’ and included a plot of the distribution of a dataset with
just over 25,000 cases (Figure 1.6). For the 1980s, Carr et al. (1987) may
sound surprisingly ambitious: “The representation of 1 million or more
data points in each plot is feasible.” In fact, the only surprise is that so
few people have followed up on this work. Figure 1.7 shows only a quar-
ter of a million points, but it is clear that a plot could readily have been
drawn for a million. In 1995, the US National Research Council organised
a workshop on “Massive Data Sets”. Several of the papers were revised
and reviewed a few years later and published in an issue of JCGS (Vol.
8, no. 3). There was some reference to numbers, but, according to Ket-
tenring, the main organiser, in a later paper presented at the Interface
meeting in 2001: “It seemed appropriate to stick with a murky definition
Fig. 1.6. Reproduction of Figure A of Good and Gaskins’s (1980) paper. Reprinted
with permission from The Journal of the American Statistical Association. Copy-
right 1980 by the American Statistical Association. All rights reserved.
32. 1.4 How Large Is a Large Dataset? 15
Fig. 1.7. Reproduction of Figure 10 of Carr et al.’s (1987) paper plotting 243,800
points from a glass-melter simulation. Reprinted with permission from The Jour-
nal of the American Statistical Association. Copyright 1987 by the American Sta-
tistical Association. All rights reserved.
[of massive].” (Kettenring; 2001) He offered one version of a murky def-
inition as: “A massive data set is one for which the size, heterogeneity,
and general complexity cause serious pain for the analyst(s).” A realistic,
if unattractive, description.
In their paper, Eddy et al. (1999) worked with brain image data,
analysing 2800 slices of 128 × 128 voxels each, making up about 256MB
of raw data. Kahn and Braverman (1999) described climate data being
collected at the rate of 80 gigabytes per day (though they did not claim to
analyse datasets of this size). In a later JCGS paper, Braverman (2002)
discussed the analysis of a subset of 5 million cases for 2 variables, i.e.,
large according to Huber’s table. McIntosh (1999) studied telephone net-
works and was able to store 2 to 4 million messages on 128MB. Four years
later, when he revised his paper for publication in JCGS, his storage limit
had jumped to 55 gigabytes!
It is gratifying to be able to show a plot for a genuinely large dataset
(at least in comparison to most of the datasets used so far). Figure 1.8
displays the distribution of reported birthweights for more than 4 million
children. The curious form is due to rounding. (Perhaps there are more
urgent matters to attend to just after a birth than to record the precise
weight of the baby?) Hand et al. (2000) in their paper on Data Mining
give examples of datasets that are potentially much larger than anything
discussed here (Barclaycard’s 350 million credit card transactions a year,
33. 16 1 Introduction
Fig. 1.8. Reproduction of Figure 2 of Clemons and Pagano’s (1999) paper.
Reprinted with permission from The American Statistician. Copyright 1999 by
the American Statistical Association. All rights reserved.
Wal-Mart’s 7 billion transactions a year in the early 1990s, AT&T’s 70
billion long distance calls annually, reported in 1997). But have these
ever been analysed in full? The examples in the paper are certainly not
for small datasets (e.g., Figure 1.9), but they are not as big as that. Wilks
Fig. 1.9. Reproduction of Figure 2 of Hand et al.’s (2000) paper. (Permission to
reprint was granted by the Institute of Mathematical Statistics)
talked at the 2004 Interface meeting in Baltimore about how AT&T stores
34. 1.5 The Effects of Largeness 17
details of all telephone calls on its system currently. At the moment, it is
still quite difficult, but clearly it will become progressively easier. An-
other generator of large datasets is, of course, the internet. Marchette
and Wegman (2004) report that their university network was seeing rates
as high as 8 million packets per hour in 2002. As an example for large
numbers of variables, consider the remark of Kettaneh et al. (2005) con-
cerning chemometrics that “In 1970 the number of variables K (matrix
columns) was large when exceeding 20. Today K is large when exceeding,
say, 100,000 or 1,000,000.”
For (almost) the last quotation, it is interesting to look at an article
earlier than any cited above, a discussion of the analysis of the 1890 cen-
sus in the US by Porter (1891). The 1890 census was the first one to count
with Hollerith machines. Porter wrote “After this the 63,000,000 cards
with their thousand million statements must each pass through the tab-
ulating machine five times.” Few of the other quotations come anywhere
near such numbers! (We are indebted to Günther Sawitzki for suggest-
ing looking at analyses of census data for evidence of working with large
datasets.) See also Figure 1.2.
So we have had large, extensive, modest bulk, vast, substantial, huge,
and massive in referring to dataset size, but rarely, probably wisely, any
statements about what the terms might mean. Despite their desire for
precise data to analyse, statisticians can be just as vague as the next
person when it suits them. A plot of the log of dataset size against date
for the quotations listed in this chapter is shown in Figure 1.10. To finish
off this set of quotations, here is a final one and you might like to guess
from what period it came before checking the reference:
Large data objects will usually be read as values from external files
rather than entered during a session at the keyboard.
Now just how large might that be? (It is actually from the 3rd edition,
1999, of the esteemed Modern Applied Statistics with S-Plus by Venables
and Ripley. In the 4th edition, the beginning of the sentence has been
changed to “For all but the smallest data sets. . . ”.)
For the purposes of this book we have taken one million as a guideline,
though that still leaves a lot of flexibility. It could be one million bytes
(medium in Huber’s classification), one million cases, one million vari-
ables, one million combinations, or one million tests. It should be some-
thing “large” at any rate.
1.5 The Effects of Largeness
Largeness comes in different forms and has many different effects. Whereas
some tasks remain easy, others become obstinately difficult. Largeness
is not just an increase in dataset size. Fisher’s remark in his preface
35. 18 1 Introduction
to the first edition of Statistical Methods for Research Workers in 1925
that “The elaborate mechanism built on the theory of infinitely large
samples is not accurate enough for simple laboratory data” might today
be re-expressed as “The elaborate mechanisms of classical statistics for
analysing small samples of simple laboratory data are not enough for
large, complex datasets.”
Areas affected by largeness include storage, data quality, dataset com-
plexity, speed, analyses, and, of course, visualization.
Large will usually mean large rather than LARGE in this
book. It will be assumed that the whole dataset resides on the local hard
disk. Problems of retrieving data from distant databases in real-time do
not occur in this case. This is a restrictive assumption but may be justified
on two grounds: many large datasets are small enough to be handled
locally (storing a million cases on a laptop is no big deal) and methods of
visualizing datasets of this size are still not fully developed.
1.5.1 Storage
The most obvious effect on increasing dataset size is the storage space
needed. Although storage was a major limitation in the past (cf. McNeill
and Tukey; 1975), it is far less so now. AT&T may have organisational
problems in storing the information on all the calls that go through its
networks (cf. Wilks’s talk at the Interface meeting 2004), but AT&T can
do it. For the rest of us, it is a case of being able to put on our laptops
datasets of a size that only specialist installations could handle a few
years ago.
Storing data is one side of the coin and retrieving data is the other.
Modern database systems give intelligent access to large amounts of data,
3
4
5
6
1960 1970 1980 1990
Year
Log10 dataset size
"Large"
Handled
Handled interactively
Fig. 1.10. A scatterplot of the dataset sizes reported in the quoted papers.
36. 1.5 The Effects of Largeness 19
but how can the data be inspected and reviewed? Summaries and dis-
plays are all very well, but often it is valuable to look at the raw data.
Standard statistics software packages offer spreadsheet-like data tables
to view datasets. Whether the data can be stored locally or not, scrolling
through tables is not effective. Perhaps this is the reason why Excel lim-
its itself to a rather miserly 65,536 rows (although why only at most 256
columns are permitted remains a mystery). Tables are fine for viewing
sections of a dataset, but simple scrolling is no longer a practical naviga-
tional option.
Whereas the question of storing the actual data is one problem, an-
other difficulty can arise when converting data between different appli-
cations with different (file) formats. Most analysis or visualization tools
have their own native storage format or rely on the generality of flat
ASCII files. Efficient import functions are crucial, as converting data can
turn out to be a major bottleneck.
1.5.2 Quality
The larger the dataset the more errors there are. Probably there is a
higher proportion of errors too. As Yates (1966) put it: “The need for better
editing is well known to those concerned with extensive datasets.” What
he meant by “extensive” in 1966 is not recorded, but as an example he
gives a dataset Healy reported on in 1952 with around 4,000 cases of 30
variables each. Ripley (2005) wrote in relation to car insurance proposals
that the more questions potential customers are asked the less reliable
their answers will become. Huber (1994) referred to “the rawness of raw
data” and wrote: “We have found that some of the hardest errors to de-
tect by traditional methods are unsuspected gaps in the data collection
(we usually discovered them serendipitously in the course of graphical
checking [our italics]).”
In addition, the larger the dataset the more coding problems there are
(varieties of missings, and coding variations in general, as the dataset
will often have been compiled from different sources). Data cleaning be-
comes a major issue — or rather it should become a major issue. Although
many organisations go to considerable effort to ensure the quality of their
data, many large datasets are still presented in a lamentable condition,
as if the effort and expense of collecting them justifies neglecting any
kind of serious preparatory cleaning. Of course, data analysts have often
been given unchecked datasets to analyse, but with small datasets it is
a simple job to fix the blatant errors and further errors can be checked
and discussed with the dataset owner. This may sometimes lead to a gain
of background information that might otherwise not have occurred. That
kind of interaction with a domain expert is much more difficult to achieve
when the dataset is large and the results of many individuals’ efforts. No
37. 20 1 Introduction
one person may have a complete overview and you cannot expect to dis-
cuss large numbers of individual cases. What all practising data analysts
agree on is that the proportion of project time spent on data cleaning is
huge. Estimates of 75–90% have been suggested.
Matters do improve with time in some respects. The famous curiosity
of the Indians and the teenage widows from the 1950 US census (Coale
and Stephan; 1962) would probably have been identified and corrected
much more quickly with modern interactive tools.
1.5.3 Complexity
Largeness may mean more complexity — more variables, more detail
(additional categories, special cases), and more structure (temporal or
spatial components, combinations of relational data tables). Again this
is not so much of a problem with small datasets, where the complexity
will be by definition limited, but becomes a major problem with large
datasets. They will often have special features that do not fit the standard
case by variable matrix structure well-known to statisticians. Datasets
of results from the PISA study comparing schoolchildren’s educational
achievements form an interesting example. The sample design used is
weight-adjusted at many different levels to ensure scientific comparabil-
ity but at the expense of transparency and understanding (Adams and
Wu; 2002).
1.5.4 Speed
Largeness affects speed. It takes longer to read and retrieve data (though
locally stored datasets should not have this problem), and it takes longer
for some calculations to be evaluated. Formerly, the rendering of displays
would also take noticeably longer, though this seems barely an issue
nowadays. But it is not just the computer’s speed that is affected, so is
the user’s. There are likely to be more variables to consider, more dis-
plays to manage, and more results to analyse. More thought is needed
between the steps of an analysis, and more time is needed even just to
locate objects. Locating one variable out of three or finding two cases out
of seventy is easy. When there are two hundred variables and one million
cases, both of these tasks require highly organized data, and software
support to match, to be carried out at all.
Speed is relative. The US Census of 1890 was concerned about the
analysis taking longer than 10 years (the 1880 results were first ready
in 1888). Nowadays, interest lies in results fast enough to be considered
interactive.
38. 1.5 The Effects of Largeness 21
1.5.5 Analyses
Largeness implies more analyses and more results. Managing the pro-
cess of analysis successfully becomes a central task in its own right. The
days of typing “Statistics all” and picking up a few pages of computer
output are long gone. Even in 1967, serious analysts, (Page et al.; 1967),
were getting bogged down in paper: Healy p. 136, “One final thing I would
press for [. . . ] is more use of graphical output, notably destructible graph-
ical output which does not leave heaps of paper lying around when a job
is finished.” The days of trawling through endless volumes of frequency
tables for every variable and of contingency tables for every pair of vari-
ables are still sadly with us. Automatic filtering and storing of results are
essential first steps to help analysts to concentrate on the important is-
sues that require human input to interpret the results. This kind of work
is currently more likely to be carried out under the name Data Mining
than under the name Statistics.
Data Mining is mainly concerned with the analysis of large datasets
(Hand et al.; 2000, mention several extremely large datasets), although
some Data Mining papers only describe working with relatively small
ones. For instance, the 2003 review of Data Mining software packages
(Haughton et al.; 2003) used two datasets, one with 1,580 cases and
one with just under 20,000 (though it did have 200 variables). Generally
speaking, there is less concern in Data Mining for statistical concepts and
more attention paid to computational efficiency and heuristic approaches.
In this way, Data Mining and Statistics complement each other well.
1.5.6 Displays
Largeness affects screen real estate. The more cases and the more points
or bars there are to be displayed, the larger a window tends to have to be
(cf. Carr et al.; 1987). Visibility is obviously an issue for numbers of cases,
but numbers of variables and numbers of displays are a bigger problem.
The larger the dataset, the more displays are likely to be required, both
for the larger number of variables and for the larger number of subsets
that may need to be examined. More and bigger plots mean that win-
dow design and window management become increasingly important. In
the DDB Needham dataset discussed in Putnam’s (2000) Bowling Alone
book, there are more than 20 variables relating to the number and age
of children in the household. Displaying barcharts of all these variables
simultaneously on a screen of 1024 × 768 pixels would allow a maximum
of about 200 × 180 pixels per plot. Positioning the displays in informa-
tive ways, for instance, placing all the variables on ages of the children
in a line in order, would require additional capabilities. Repositioning, re-
ordering, and resizing graphics is difficult, messy work, so it is only done
if it absolutely has to be. Like many other operations, there is a lot you
39. 22 1 Introduction
might want to do with advantage, if you only had the tools to do it. These
are tasks the computer can do much better than a human — provided
that you can tell it what you want.
1.5.7 Graphical Formats
Dealing with large datasets is also a challenge for the purely technical
process of plotting the data on the screen and saving the resulting graph-
ics to a file. A scatterplot of Fisher’s Iris Data can be plotted by whatever
graphics engine you like and saved into any file format. This is no longer
true when datasets become really big. Plotting hundreds of thousands
of points on the screen is slow, unless the system can take advantage of
the native routines of the graphics board of the computer, today usually
some kind of OpenGL. This is especially true when using smoothing and
sub-pixel-rendering of the plotted objects.
Similar problems arise when saving a graphic to a file. Object oriented
formats like PS or PDF would be the preferred option, but saving tens of
thousands of polygons of a parallel coordinate plot to a pdf-file makes this
file too big for most of the tools available. At this point, bitmap graphics
formats like PNG or JPEG are a more efficient choice, though they offer
poorer reproduction quality. Much care must then be taken to achieve a
high rendering quality on the screen.
For reasons of size, a few of the graphics reproduced in this book are
“only” bitmap graphics.
1.6 What Is in This Book
This chapter has set the scene for what our book is about. The next chap-
ter defines and discusses standard statistical graphics. Standard means
the plots that are used most frequently in data visualization and that
cover the basic data display requirements: barcharts, histograms, scat-
terplots, and so on. It is helpful to set down what is meant in this book
by these terms, as not everyone agrees on the definitions of the basic
plots. There is a particularly varied set of definitions in use for boxplots
— sometimes more than one can be found in the same software package!
Many plots can be used in data visualization: piecharts, rose diagrams,
starplots, stacked barcharts, to name only a few, but the ideas described
in this book can be applied to all of them just as to the standard displays.
Chapter 3 looks at the issue of upscaling graphics to cope with large
datasets. In general, area plots can be upscaled with minor amendments,
whereas point plots require more substantial changes.
One of the ways of extending what can be achieved with statistical
graphics is to add interaction. In the fourth chapter, several develop-
ments of interactive methods are explained, which improve the capability
of graphics for finding information in large datasets.
40. 1.7 Software 23
After these three chapters on general principles follow chapters on
particular types of graphics for applications. These chapters have been
written by experts in their field. Some have given a complete overview; a
couple have chosen to concentrate on special issues.
Chapter 5 discusses specialist plots for multivariate categorical data
— mosaic plots and their variations. For mosaic plots, features like sort-
ing and redmarking are most valuable. Chapters 6 and 7 consider multi-
variate continuous data, firstly experimenting using pre-calcuated videos
to enable selection and linking in grand tours for very large datasets and
secondly looking at parallel coordinate plots for continuous data. For par-
allel coordinate plots, sorting is also important, but Chapter 7 concen-
trates on transforming link functions between the axes and using satura-
tion brushing.
The emphasis up till now has been on visualization of cases from large
datasets, but the next two chapters discuss visualization of structures —
networks and trees. Like parallel coordinates, networks are drawn with
many lines, and so an increase in magnitude has a more dramatic effect
on networks than it does on point or area plots. The main issue is not
drawing optimal layouts but drawing informative and acceptable layouts
fast enough to be useful. In particular, this chapter makes clear that hav-
ing to analyse applications with a million nodes is not at all unusual. With
trees, the task is different again. Large datasets do not lead to specially
large trees, but complex datasets may lead to many, many trees, and the
visualization here concentrates on the task of combining and summaris-
ing the information from large numbers of trees. A broad range of innova-
tive displays is introduced for these specialist tasks, though they all have
their origins in existing plots.
There are several applications discussed throughout the book. Chap-
ter 10 looks at a major one that everyone meets in their daily work even if
they are only subliminally aware of it: internet packet data. The problems
of sampling the data to produce representative displays are highlighted
and the aptly named ‘mice and elephants plot’ is shown to have good prop-
erties for uncovering features in the data.
No matter how interesting the papers are that are written by re-
searchers to investigate deep theoretical problems, readers usually ap-
preciate some clear-cut, cogent advice. The final short chapter uses an
example to illustrate how a large dataset can be explored with visual-
ization. If you have an immediate data visualization problem for a large
dataset in front of you, the last chapter might give you some ideas.
1.7 Software
At various points in the book, several different software packages are
used or referred to. Some authors give short lists of tools that implement
41. 24 1 Introduction
the concepts they discuss. The packages used to produce the graphics
in the book (in alphabetical order) were Data Desk, GGobi, ExploreN,
KLIMT, Limn, MANET, Mondrian, R, VizML. Current references to these can
be found on the book’s webpage. Many of the graphics could have been
produced by Excel, JMP
, SAS-Insight, SPSS, S-Plus, Xlisp-Stat, or indeed by
many other packages.
It is the ideas and concepts that are important and not the current
implementations and so no specific recommendations are made. Everyone
should use the software that they feel comfortable with and they should
demand some, if not all, of the features expounded in this book. Of course,
experimenting with new tools broadens the horizon and encourages the
generation of new ideas. As always, we should hope that much better
software is just around the corner. Every now and again that turns out to
be true.
1.8 What Is on the Website
Although the book is a stand-alone resource, the accompanying website
will offer additional material and information. The website at http://
www.rosuda.org/GOLD will include:
• pdf-files and code/settings for figures;
• up-to-date links to the software mentioned;
• the most important datasets used in the book;
• errata.
1.8.1 Files and Code for Figures
Visualizing large datasets almost always involves some non-trivial graph-
ics. If possible, the code and settings used to produce the graphics will be
included on the webpage so that readers can replicate them.
Colour printing helps a lot to cope with the complexity of some graph-
ics, but a good choice of colour schemes is important. Not all the book’s
graphics are optimal in this respect. Readers may like to experiment with
redrawing them themselves using other colour palettes.
Many displays in the book would need far more printing space than
the page size allows. Therefore, all graphics of the book can be down-
loaded from the website as pdf-files for closer scrutiny (as long as no copy-
right applies).
1.8.2 Links to Software
The book communicates principles, not details of specific implementa-
tions. Nevertheless, even the best considered principle needs an imple-
mentation in software and evaluation by users. The website will give up-
to-date links to the software used in the book and, for each tool, a list of
42. 1.8 What Is on the Website 25
the figures that were directly drawn by that software (though in many
cases other software could have been used just as well).
1.8.3 Datasets
The index lists around 20 different datasets used in the book. Some of
them are used only once; others are used throughout the book. Some have
millions of observations; some are smaller. The website will offer links
to all datasets mentioned in the book as far as possible. The following
section briefly describes seven of the datasets.
Small Datasets for Illustration
The first part of the book illustrates many principles that can best be
introduced with datasets of small size.
• Italian Olive Oils
The data consist of 572 samples of Italian olive oils. For each sample,
the contents of the 8 major fatty acids have been determined. The data
are grouped according to a hierarchical classification of 3 regions and
9 areas of Italy.
• Athens Decathlon 2004
For the 28 athletes from 22 countries who completed all 10 disciplines
in the 2004 Olympics decathlon in Athens, the individual results of
the disciplines and the resulting point scores have been recorded.
• 2004 Cars Data
Taken from the website of the Journal of Statistical Education at
http://www.amstat.org/publications/jse/jse data archive.
html, this dataset contains information on a number of cars including
horsepower, mileage, weight, and country of origin.
Large Datasets
Large datasets often have a very specific structure. The following four
examples cover a broad range of applications.
• Bowling Alone
This dataset is one used in Robert Putnam’s book Bowling Alone.
The DDB Life Style Survey, available on the web from http://www.
bowlingalone.com, is an annual survey over 24 years of around
3,500 different individuals each year with up to just under 400 pieces
of information per case. With 85,000 cases this means that, ignoring
43. 26 1 Introduction
missing values, there are about 30 million pieces of information in the
dataset.
• Bank Deals
Over two years there were approximately 700,000 transactions car-
ried out for firms by a major bank. The record of each deal included
the amount, the book profit, the type of deal, the location, and vari-
ous other pieces of information. Recordkeeping changed between the
two years so that the data are not fully comparable for the two years.
These data are confidential.
• US Current Population Survey – Census Data
The dataset contains data for the 63,756 families that were inter-
viewed in the March 1995 Current Population Survey (CPS). These
include families with husbands and wives, whether military or non-
military; families with male heads only; families with female heads
only; non-family householders and unrelated individuals are included
too — although such individuals do not constitute “families” in the
strict sense of the word. This dataset also includes families in group
quarters. The source of the data is the March 1995 Current Population
Survey, conducted by the US Bureau of the Census for the Bureau of
Labor Statistics. The data can be found at http://www.stat.ucla.
edu/data/.
• Internet Usage Data
These data come from a survey conducted by the Graphics and Visual-
ization Unit at Georgia Tech October 10 to November 16, 1997. Details
of the survey are available at http://www.cc.gatech.edu/gvu/
user surveys/survey-1997-10/. The particular subset of the sur-
vey used here is the “general demographics” of internet users. The full
survey is available from the website above, along with summaries, ta-
bles, and graphs of their analyses.
These data were used in the American Statistical Association Statis-
tical Graphics and Computing Sections 1999 Data Exposition.
1.9 Contributing Authors
Antony Unwin is Professor of Computer Oriented Statistics and Data
Analysis at the University of Augsburg.
Martin Theus is a senior researcher at the University of Augsburg.
Heike Hofmann is Professor at Iowa State University.
Dianne Cook is Professor at Iowa State University. Her work is joint re-
44. 1.9 Contributing Authors 27
search with Peter Sutherland, Manuel Suarez, Jing Zhang, Hai-Qing You,
and Hu Lan and was supported by funding from National Science Foun-
dation grant #9982341.
Leslie Miller is Professor at Iowa State University.
Rida Moustafa completed this work on a postdoctoral appointment at the
Center for Computational Statistics, George Mason University, USA.
His work was funded by the Air Force Office of Scientific Research under
contract F49620-01-1-0274.
Ed Wegman is Director, Center for Computational Statistics, George Ma-
son University, USA. His work was funded by the Office of Naval Re-
search under contract DAAD19-99-1-0314 administered by the Army Re-
search Office, by the Air Force Office of Scientific Research under con-
tract F49620-01-1-0274 and contract DAAD19-01-1-0464, the latter also
administered by the Army Research Office, and finally by the Defense Ad-
vanced Research Projects Agency through cooperative agreement 8105-
48267 with the Johns Hopkins University.
Simon Urbanek is in the research department of AT&T, New Jersey.
Graham Wills is Principal Software Engineer at SPSS, Chicago.
Bárbara González-Arévalo is at the University of Louisiana.
Félix Hernández-Campos is at the University of North Carolina.
Steve Marron is at the University of North Carolina. His research was
partially supported by Cornell University’s College of Engineering Mary
Upson Fund and by NSF Grants DMS-9971649 and DMS-0308331.
Cheolwoo Park is at the University of Florida.
46. 2
Statistical Graphics
Martin Theus
Millions of stars that seemed to speak in fire.
John Masefield, The Wanderer
2.1 Introduction
Statistics has its own basic suite of domain-specific visualization tools.
These statistical graphics can best be classified by the kind of data that
they depict. Statistical data are usually characterized by their scale: nom-
inal, ordinal (which are both categorical) or numerical (which is usually
regarded as continuous). What is most important in distinguishing statis-
tical graphics from other graphics is their universality: statistical graph-
ics are not tailored towards only one specific application but are valid for
any data measured on the appropriate scales.
Depending on the data scale, certain standard graphing techniques
have prevailed. Data on a discrete scale, which represent counts of differ-
ent groups within a dataset, are best represented by areas whose sizes are
proportional to the counts they represent, whereas data on a continuous
scale are usually depicted by a single glyph (i.e., a graphical object, usu-
ally a point) per observation. Exceptions are plots based on summaries of
the data like histograms and boxplots.
Using this classification, the standard statistical graphics are intro-
duced in the following sections along with their most common extensions
and modifications. Obviously the range of statistical graphics is much
broader than the basic plots that can be introduced in this chapter.
2.2 Plots for Categorical Data
Plots for categorical data have not received much attention in the past.
This can be explained by several factors. One reason was that software
was not able to read non-numeric values as class labels, and once the
47. 32 2 Statistical Graphics
New England
Middle Atlantic
South Atlantic
East South Central
West South Central
East North Central
West North Central
Mountain
Pacific NEd MAc ENC WNC
SAc ESC WSC Mnn Pcc
Fig. 2.1. Two barcharts for Division in the Census dataset.
labels had been converted to numbers, it was quite tempting to treat the
data as numerical. Another reason was the difficulty of representing mul-
tivariate categorical data.
2.2.1 Barcharts and Spineplots for Univariate Categorical Data
A univariate vector of categorical data can be summarized in a one-
dimensional table. The easiest way to depict this summary is a barchart,
where the area of the bar represents the count for its category. For ease
of comparison, all bars are given equal width, so that their heights rep-
resent the counts as well. (The bars can also be drawn horizontally with
equal heights but different lengths.)
Figure 2.1 shows an example of two barcharts. In the plot on the left
the bars are plotted horizontally, and on the right the bars are drawn
vertically. Whereas the vertical plot follows the more natural stacking
principle, the horizontal one is better for labelling the graphic.
Spineplots
In many situations it is desirable to look at the distribution of a sub-
group of a categorical variable. Going back to the example of Figure 2.1,
it might be interesting to know the proportions of females in the different
divisions. Figure 2.2 shows two examples of how this extra information
may be included in a barchart. The left barchart just adds a barchart for
females within the same scale of the barchart for the complete sample.
The right plot is what most spreadsheet-like applications do: separate
barcharts for each group are plotted side by side for all divisions. This
plot is much harder to read, as there are twice as many bars as before.
However, the left plot does not make comparing the proportions of fe-
males across the different divisions easy. Each bar of a subgroup must
48. 2.2 Plots for Categorical Data 33
New England
Middle Atlantic
South Atlantic
East South Central
West South Central
East North Central
West North Central
Mountain
Pacific
NEd MAc ENC WNC
SAc ESC WSC Mnn Pcc
Fig. 2.2. Two barcharts showing the proportion of females (red highlighting) in
each Division in the Census dataset.
be compared to the bar of the complete sample, which is visually not a
straightforward task. A solution for this problem is the spineplot. In a
spineplot all bars have equal length, but proportional width, which re-
tains the proportionality of the area. As the length of each bar is now
standardized to be 100%, the highlighting can be drawn from 0% to the
proportion of the subgroup in a particular bar. Figure 2.3 illustrates a
highlighted barchart and spineplot. Whereas the barchart enables a good
comparison of the absolute counts of the subgroup, the spineplot enables
a direct comparison of the proportions.
2.2.2 Mosaic Plots for Multi-dimensional Categorical Data
Highlighted spineplots already incorporate information of more than just
one categorical variable in a single plot. Mosaic plots, which were first
New England
Middle Atlantic
South Atlantic
East South Central
West South Central
East North Central
West North Central
Mountain
Pacific
New England
Middle Atlantic
South Atlantic
East South Central
West South Central
East North Central
West North Central
Mountain
Pacific
Fig. 2.3. A spineplot (right) allows the comparison of proportions across categories.
49. 34 2 Statistical Graphics
Married Nvd Divorced Widowed
Spd Married Nvd Dvd Wdd
Spd
Fig. 2.4. A barchart and spineplot for marital status with female heads of house-
hold selected.
introduced by Hartigan and Kleiner (1981), are designed to show the de-
pendencies and interactions between multiple categorical variables in one
plot. Figure 2.4 shows a barchart and spineplot for marital status with fe-
male heads of household selected. A spineplot can be regarded as a kind of
one-dimensional mosaic plot. Figure 2.5 shows the corresponding mosaic
plot for the data in Figure 2.4. In contrast with a barchart, where the bars
are aligned to an axis, the mosaic plot uses a rectangular region, which is
subdivided into tiles according to the numbers of observations falling into
the different classes. This subdivision is done recursively, or in statisti-
cal terms conditionally, as more variables are included. Figure 2.6 shows
an example of a mosaic plot for the US-Census data. Starting with the
variable Marital Status, the initial rectangle (i.e. the complete square) is
divided along x according to the classes of this variable (cf. Figure 2.6,
left). In the second step, the variable Education — summarized as a bi-
nary response of college education or higher and highschool or less — is
incorporated. This is done by dividing each bar or tile of Marital Sta-
Married Never M. Divorced S. Wid.
Fig. 2.5. A one-dimensional mosaic plot for marital status
50. 2.2 Plots for Categorical Data 35
Married Never M. Divorced S. Wid.
less
than
high
school
College
and
higher
Married Never M. Divorced S. Wid.
Fig. 2.6. Development of a mosaic plot including Marital Status and Education
(all females are highlighted).
tus according to Education along y. The plot (Figure 2.6, right) shows
that the groups of people who are married or were never married have
approximately equivalent education levels. Only about one fourth of all
widowed heads of household have some college education. The third vari-
able in the plot is Gender, which is shown as horizontal highlighting in
all combinations of Marital Status and Education. This plot shows that
Gender and Education are independent, given any class of Marital Sta-
tus. At this point it is easier to use highlighting instead of including the
gender information via a third variable as the next split along x.
A detailed discussion on how to detect dependence structures in mo-
saic plots can be found in Theus and Lauer (1999).
The recursive construction of a mosaic plot means that the only limit
for the number of variables included is the number of tiles to display, i.e.
the number of possible combinations of the variables. Labeling of mosaic
plots is not an easy task even with relatively few variables. Usually it
is possible to label the first 2 to 3 variables in the plot all further infor-
mation should be provided by interactive queries (cf. Chapter 4). If inter-
active queries are not available, the following strategy has proved to be
helpful. Variables with only few categories should be put in the plot first,
to keep the number of conditioned groups small. If one of the variables in
the plot is a binary response, showing this variable via highlighting will
reduce the number of tiles by half.
Note that the gaps between the tiles are not part of the rectangular
region that is used to build the tiles. The gaps are there to improve visual
discrimination.
51. 36 2 Statistical Graphics
2.3 Plots for Continuous Data
The most commonly used plots for continuous data are dotplots, box-
plots, and histograms for one-dimensional data and scatterplots for two-
dimensional data. Methods and plots for higher dimensions of continuous
data include parallel coordinates and the Grand Tour.
2.3.1 Dotplots, Boxplots, and Histograms
The simplest way to plot univariate continuous data is a dotplot. Because
the points are distributed along only one axis, overplotting is a serious
problem, no matter how small the sample is. The usual technique to avoid
overplotting is jittering, i.e., the data are randomly spread along a virtual
second axis. Figure 2.7 shows an example of a jittered dotplot for the
variable Horsepower of the Cars2004 data. Although jittering was used,
the overplotting still masks parts of the distribution, and a quantification
of the density is not easily possible. Nevertheless, the accumulation of
cases around 300 hp and the gap between 400 hp and 450 hp are clearly
visible, which might not be the case in other plots.
Histograms use area to represent counts of a distribution. This makes
them somewhat related to barcharts and mosaic plots, although the num-
ber or the width of the bins of a histogram is not determined a priori and
the bins are drawn without gaps between them reflecting the continuous
scale of the data. Whereas barcharts and mosaic plots show the exact dis-
tribution of the sample, a histogram is always just one approximation to
the distribution of the data. Sometimes histograms are also used as crude
density estimators for some “true”, but usually unknown, underlying dis-
tribution for the data. There are much better density estimation methods
that produce smooth distribution displays. Average Shifted Histograms
(ASH) (Scott; 1992) is one of them — see also Section 6.2. Figure 2.8
shows a histogram for the same data as in Figure 2.7. The histogram
100 200 300 400 500
Fig. 2.7. A jittered dotplot of Horsepower for the Cars2004 data.
52. 2.3 Plots for Continuous Data 37
100 200 300 400 500
Fig. 2.8. A default histogram of Horsepower for the Cars2004 data.
gives a much better impression of the distribution than the dotplot in
Figure 2.7 does. A good approximation to the shape of the data distribu-
tion depends strongly on the chosen anchorpoint, i.e., the point where the
first bin starts, and the binwidth, i.e., the width of the intervals used to
draw the histogram. For the example, the accumulation of cases at about
300 hp is not visible in Figure 2.8. Changing the anchorpoint from 50 to
70 and the binwidth from 50 to just 20 gives the plot in Figure 2.9, which
now shows all the features found in the dotplot in Figure 2.7. Being able
to change these plot parameters quickly is a typical interactive feature,
which will be discussed in more detail in Chapter 4.
Boxplots are another popular alternative for univariate continuous
data. Boxplots use a mixture of summary information (as histograms do)
and information on individual points (as dotplots do). The center line in a
boxplot is marked by the median x̃0.5 of the sample. Upper and lower ends
of the box are determined by the upper and lower hinges xu and xl, which
100 200 300 400 500
Fig. 2.9. A histogram of Horsepower for the Cars2004 data, with anchorpoint 70
and binwidth 20.
53. 38 2 Statistical Graphics
100 200 300 400 500
Fig. 2.10. A boxplot of Horsepower for the Cars2004 data.
are the medians of the upper and lower halves of the sample divided at
the median. The dotted lines — the so called “whiskers” — are drawn
from the box up to the most extreme point, which is not further away
from the hinge than 1.5 times the h-spread (defined as the distance be-
tween the hinges), i.e., xu −xl. All points further away from the center are
marked as outliers by a single dot. If points exceed 3 times the h-spread,
they are marked as extreme-outliers. The virtual thresholds, hinge + 1.5
or 3 times the h-spread, are called inner fences and outer fences.1
Figure 2.10 shows the data from Figure 2.7 and 2.8 in a boxplot. The
boxplot can neither show the second mode around 300 hp nor any gaps
in the data that are not between outliers. Boxplots are very good for com-
paring distributions because they take up little space and can be drawn
parallel to one another.
To highlight a subgroup of the data in a boxplot, the boxplot for all
data is often modified as shown in Figure 2.11. In the base boxplot show-
ing all data, the whiskers are drawn as light-gray boxes, which allows
plotting of a standard boxplot for the highlighted data on top of the base
boxplot for all data.
Table 2.1 gives an overview of the strengths and weaknesses of the
three plots.
Fig. 2.11. A boxplot of Horsepower for the Cars2004 data, with all 4-cylinder cars
highlighted.
1
Many statistical software packages do not stick to Tukey’s original definition,
which can make boxplots confusing to interpret.
54. 2.3 Plots for Continuous Data 39
Table 2.1. A Comparison of the Strengths (+) and Weakness (–) of Plots for Uni-
variate Continuous Data (‘o’ Means Neither Strength nor Weakness)
Dotplot Histogram Boxplot
Visualizing the shape of a distribution – + o
Detection of outlier + – +
Inspection of gaps, discreteness + o –
Size of the sample o o –
Comparison of distributions – o +
2.3.2 Scatterplots, Parallel Coordinates, and the Grand Tour
Except for the dotplot, none of the plots presented in the last section has
a natural generalization to more than one dimension. The scatterplot is
the natural counterpart of a dotplot in two dimensions — the 3-D rotating
plot in three dimensions.
Scatterplots and Scatterplot Matrices
The scatterplot is the ideal plot to display the structure of two-dimensional
continuous data. In a scatterplot, two variables are plotted in a cartesian,
i.e., orthogonal coordinate system.
Higher-dimensional structures are often depicted in a so-called scat-
terplot matrix or SPLOM. In a SPLOM for k variables,
k
2
scatterplots
are plotted to display the k(k − 1)/2 bivariate relationships of all k vari-
ables. The scatterplot of variable i vs. variable j is plotted in the upper
triangle matrix of size k−1 at position (i, j). Figure 2.12 gives an example
of a scatterplot matrix for 5 variables. The 10 scatterplots are arranged
to accommodate all 10 pairs of the 5 variables. Often a univariate plot
of variable i is plotted at position (i, i) in the plot matrix. In Figure 2.12
only the names of the variables are noted on the diagonal, and the 10
transposed plots are shown in the lower triangle matrix.
The natural generalization of a two-dimensional scatterplot is the
three-dimensional rotating plot, which gives a three-dimensional view of
continuous data. The pseudo rotation is generated by a rapid succession
of two-dimensional projections with smoothly changing projection angles.
The Grand Tour
So far, all plots have been rendered on a two-dimensional medium, a sheet
of paper or a computer screen. Even a 3-D rotating plot is “just” a 2-D
56. out of its mouth the length of its body, and instantaneously catching
the fly, it would go back like a spring. They will drink mutton broth:
how I came to know this was, one day having a plate of broth and
rice on the table where it was: it went to the plate and got half into
it, and began drinking, and trying to take up some of the rice, by
pushing it with its mouth towards the side of the plate, which kept it
from moving, and in a very awkward way taking it into its mouth.
In the autumn of 1868, a pair of Chameleons, in the possession of
the Hon. Lady Cust, of Leasowe Castle, Cheshire, produced nine
active young ones, like little alligators, less than an inch long. Such a
birth has been, it is believed, very rare in this country. It was
remarked, in the above case, that the male and female appeared
altogether indifferent about their progeny.
Whatever may be the cause, the fact seems to be certain, that the
Chameleon has an antipathy to objects of a black colour. One, which
Forbes kept, uniformly avoided a black board which was hung up in
the chamber; and, what is most remarkable, when the Chameleon
was held forcibly before the black board, it trembled violently and
assumed a black colour. [25]
It may be something of the same kind which makes Bulls and
Turkey-cocks dislike the colour of scarlet, a fact of which there can
be no doubt.
FOOTNOTES:
[24] The Fables of John Gay. Illustrated. With Original Memoir,
Introduction, and Annotations. By Octavius Freire Owen, M.A.,
F.S.A. 1854.
[25] This, it will be seen by referring to page 307, does not
correspond with Calmet's statement.
57. RUNNING TOADS.
T HAT the Toad, by common repute ugly and venomous,
should be made a parlour pet, is passing strange; yet such is
the case, and we find in a letter from Dr. Husenbeth, of
Cossey, the following curious instances. Thus he describes a species,
there often met with, the eyes of which have the pupil surrounded
with bright golden-yellow, whereas in the common toad the circle is
red or orange. This remarkable peculiarity Dr. H. has not seen
anywhere noticed. The head is like that of the common sort, but
much more blunt, and rounded off at the nose and mouth, and the
arches over the eyes are more prominent. The most remarkable
difference is a line of yellow running all down the back. Also down
each side this Toad has a row of red pimples, like small beads, which
are tolerably regular, but appear more in some specimens than in
others. The general colour is a yellowish-olive, but the animal is
beautifully marked with black spots, very regularly disposed, and
exactly corresponding on each side of the yellow line down the back.
Like all other Toads, this one occasionally changes its colour,
becoming more brown, or ash-colour, or reddish at times, probably
in certain states of the weather. This species is much more active
than the common Toad. It never leaps, and very seldom crawls, but
makes a short run, stops a little, and then runs on again. If
frightened or pursued, it will run along much quicker than one would
suppose.
During the previous summer Dr. H. kept three Toads of this kind in
succession. The first (says Dr. H.) I procured in July; but after a few
days, when I let him have a run on the carpet of my parlour, he got
into a hole in a corner of the floor, of which I was not aware, and
fell, as I suppose, underneath the floor, into the hollow space below.
I concluded that he could never get up again, and gave him up to
his fate. I then began to keep another Running Toad, which fed well
58. at first, but after three weeks refused food, and evidently wasted; so
I turned him out into the garden, and have not met with him since.
After more than three weeks, the former Toad reappeared, but how
he came up from beneath the floor I never could conceive, or how
he had picked up a living in the meantime. He was, however, in good
condition, and seemed to have lived well, probably on spiders and
woodlice. He had been seen by a servant running about the carpet,
but I knew nothing of his having come forth again, till in the
evening, when he had got near the door, and it was suddenly
opened so as to pass over the poor creature, and crush it terribly. I
took it up apparently dead. It showed no sign of life; the eyes were
closed, it did not breathe, and the backbone seemed quite broken,
and the animal was crushed almost flat. I found a very curious milky
secretion exuding from it, where it had been most injured and the
skin was most broken. This was perfectly white, and had exactly the
appearance of milk thrown over the toad. It did not bleed, though
much lacerated; but instead of blood appeared this milky fluid, which
had an odour of a most singular kind, different from anything I ever
smelt. It is impossible to describe it. It was not fetid, but of a sickly,
disgusting, and overpowering character, so that I could not endure
to inhale it for a moment. I had read and seen a good deal of the
extraordinary powers of revivification in toads, but was not prepared
for what I witnessed on this occasion. I laid this poor animal,
crushed, flattened, motionless, and to all appearance dead, upon a
cold iron plate of the fireplace. He fell over on one side, and showed
no sign of life for a full hour. After that he had slightly moved one
leg, and so remained for about another half-hour. Then he began to
breathe feebly, and gathered up his legs, and his back began to rise
up into its usual form. In about two hours from the time of the
accident, he had so far recovered as to crawl about, though with
difficulty. The milky liquor was reabsorbed, and gradually
disappeared as the toad recovered. The next morning it was all
gone, and no mark of injury could be seen, except a small hole in his
back, which soon closed. He recovered so far as to move about
pretty well, but his back appeared to have been broken, and one
fore-leg crippled. I therefore thought it best to give him his liberty in
59. the garden. But so wonderful and speedy a recovery I could never
have believed without ocular testimony.
I then tried my third and last Running Toad. I began to keep him on
Sept. 13th. He was a very fine specimen, and larger than the two
former. He fed well, and amused me exceedingly. He was very tame,
and would sit on my hand quite quiet, and enjoy my stroking him
gently down his head and back. Soon after I got him he began to
cast his skin. I helped him to get rid of it by stripping it down each
side, which he seemed to like much, and sat very quiet during the
operation. The new skin was quite beautiful, and shone as if
varnished. This Toad lived in a crystal palace, or glass jar, where I
had kept all the others before him. He took food freely, and his
appetite was so good that in one day he ate seven large flies and
three bees without stings. He was particularly fond of woodlice and
earwigs, but would take centipedes, moths, and even butterflies.
Being more active than common Toads, he often made great efforts
to get out of his glass jar. I used to let him run about the room
nearly every day for a short time, and often treated him to a run in
the garden. Toads make a slight noise sometimes in the evenings,
uttering a short sound like 'coo,' but I never heard them croak.
Before wet weather, and during its continuance, my Toad was
disinclined for food, and took no notice of flies even walking over his
nose. He would then burrow and hide himself in the moss at the
bottom of his glass palace. Thus I kept him, and found him very
tame and amusing. But after about two months he became more
impatient of confinement, and refused to take any food. I did not
perceive that he fell away, though his feet and toes turned of a dark
colour, which I knew was a sign of being out of condition; and, on
the 10th of November, I found him dead. I have now tried three of
this sort, and have come to the conclusion that the Running Toad
will not live in captivity. This I much regret, as its habits are
interesting, and its ways very amusing.
F. C. Husenbeth, D.D.
60. FROG AND TOAD CONCERTS.
It would be hard to believe the stories of the vocal powers of Frogs
and Toads were they not related by trustworthy travellers, who tell
of animal concerts,
Wild as the marsh, and tuneful as the harp.
Mr. Priest, the traveller in America, who was himself a musician,
records:—Prepared as I was to hear something extraordinary from
these animals, I confess the first Frog Concert I heard in America
was so much beyond anything I could conceive of the power of
these musicians, that I was truly astonished. This performance was
al fresco, and took place on the eighteenth of April, in a large
swamp, where there were at least 10,000 performers; and I really
believe not two exactly in the same pitch, if the octave can possibly
admit of so many divisions, or shades of semitones.
Professor and Mrs. Louis Agassiz, in their recent Journey in Brazil,
record:—We must not leave Parà without alluding to our evening
concerts from the adjoining woods and swamps. When I first heard
this strange confusion of sounds, I thought it came from a crowd of
men shouting loudly, though at a little distance. To my surprise. I
found that the rioters were the frogs and toads in the
neighbourhood. I hardly know how to describe this Babel of
woodland noises; and, if I could do it justice, I am afraid my account
would hardly be believed. At moments it seems like the barking of
dogs, then like the calling of many voices on different keys; but all
loud, rapid, excited, full of emphasis and variety. I think these frogs,
like ours, must be silent at certain seasons of the year, for on our
first visit to Parà we were not struck by this singular music, with
which the woods now resound at nightfall.
61. SONG OF THE CICADA.
T HE Greeks have been scoffed at for rendering in deathless
verse the song of so insignificant an insect as the Cicada; and
hence it has been asserted that their love for such slender
music must have been either exaggerated or simulated. It is
pleasant, however, to hear an independent observer in the other
hemisphere confirm their testimony. Mr. Lord tells us that in British
Colombia there is one sound or song which is clearer, shriller, and
more singularly tuneful than any other. It never appears to cease,
and it comes from everywhere—from the tops of the trees, from the
trembling leaves of the cotton-wood, from the stunted under-brush,
from the flowers, the grass, the rocks and boulders—nay, the very
stream itself seems vocal with hidden minstrels, all chanting the
same refrain.
An especial feature of the Cicada's song is, that it increases in
intensity when the sun is hottest; and one of the later Latin poets
mentions the time when its music is at its highest, as an alternative
expression for noon. Mr. Tennyson, inadvertently, speaks in Ænone
of the Grasshopper being silent in the grass, and of the Cicada
sleeping when the noonday quiet holds the hill. Keats sings more
truly:—
When all the birds are faint with the hot sun,
And hide in cooling trees, a voice will run
From hedge to hedge about the new-mown
mead:
That is the Grasshopper's.
Then the Greek poets show us how intimately the song of the Cicada
is associated with the hottest hours of the day. Aristophanes
describes it as mad for the love of the sun; and Theocritus, as
scorched by the sun. When all things are parched with the heat
(says Alcæus), then from among the leaves issues the song of the
62. sweet Cicada. His shrill melody is heard in the full glow of noontide,
and the vertical rays of a torrid sun fire him to sing. Over and over
again Mr. Lord met with allusions to the same peculiarity.
Cicadæ are regularly sold for food in the markets of South America.
They are not eaten now, like they were at Athens, as a whet to the
appetite; but they are dried in the sun, powdered, and made into a
cake.
63. STORIES ABOUT THE BARNACLE
GOOSE.
As barnacles turn Poland geese
In th' islands of the Orcades.
—Hudibras.
O NE of the earliest references to this popular error is in the
Natural Magic of Baptista Porta, who says:—Late writers
report that not only in Scotland, but also in the river of
Thames by London, there is a kind of shell-fish in a two-leaved shell,
that hath a foot full of plaits and wrinkles.... They commonly stick to
the keel of some old ship. Some say they come of worms, some of
the boughs of trees which fall into the sea; if any of them be cast
upon shore, they die; but they which are swallowed still into the sea,
live and get out of their shells, and grow to be ducks, or such-like
birds.
Professor Max Müller, in a learned lecture, enters fully into the origin
of the different stories about the Barnacle Goose. He quotes from
the Philosophical Transactions of 1678 a full account by Sir Robert
Moray, who declared that he had seen within the barnacle shell, as
through a concave or diminishing glass, the bill, eyes, head, neck,
breast, wings, tail, feet, and feathers of the Barnacle Goose. The
next witness was John Gerarde, Master in Chirurgerie, who, in 1597,
declared that he had seen the actual metamorphosis of the muscle
into the bird, describing how—
The shell gapeth open, and the first thing that appeareth is the fore
said lace or string; next come the leg of the birde hanging out, and
as it groweth greater, it openeth the shell by degrees, till at length it
is all come forth, and hangeth only by the bill, and falleth into the
sea, when it gathereth feathers and groweth to a foule, bigger than
64. a mallart; for the truth hereof, if any doubt, may it please them to
repair unto me, and I shall satisfie them by the testimonies of good
witnesses.
As far back as the thirteenth century, the same story is traced in the
writings of Giraldus Cambrensis. This great divine does not deny the
truth of the miraculous origin of the Barnacle Geese, but he warns
the Irish priests against dining off them during Lent on the plea that
they were not flesh, but fish. For, he writes, If a man during Lent
were to dine off a leg of Adam, who was not born of flesh either, we
should not consider him innocent of having eaten what is flesh. This
modern myth, which, in spite of the protests of such men as
Albertus Magnus, Æneas Sylvius, and others, maintained its ground
for many centuries, and was defended, as late as 1629, in a book by
Count Maier, De volucri arborea, with arguments, physical,
metaphysical, and theological, owed its origin to a play of words.
The muscle shells are called Bernaculæ from the Latin perna, the
mediæval Latin berna; the birds are called Hibernicæ or Hiberniculæ,
abbreviated to Berniculæ. As their names seem one, the creatures
are supposed to be one, and everything conspires to confirm the
first mistake, and to invest what was originally a good Irish story—a
mere canard—with all the dignity of scientific, and all the solemnity
of theological truth. The myth continued to live until the age of
Newton. Specimens of Lepadidæ, prepared by Professor Rolleston of
Oxford, show how the outward appearance of the Anatifera could
have supported the popular superstition which derived the Bernicla,
the goose, from the Bernicula, the shell.
Drayton (1613), in his Poly-olbion, iii., in connexion with the river
Lee, speaks of
Th' anatomised fish and fowls from planchers sprung;
to which a note is appended in Southey's edition, p. 609, that such
fowls were Barnacles, a bird breeding upon old ships. A bunch of
the shells attached to the ship, or to a piece of floating timber, at a
distance appears like flowers in bloom; the foot of the animal has a
similitude to the stalk of a plant growing from the ship's sides, the
65. shell resembles a calyx, and the flower consists of the tentacula, or
fingers, of the shell-fish. The ancient error was to mistake the foot
for the neck of a goose, the shell for its head, and the tentacula for
feathers. As to the body, non est inventus.
Sir Kenelm Digby was soundly laughed at for relating to a party at
the castle of the Governor of Calais, that the Barnacle, a bird in
Jersey, was first a shell-fish to appearance, and, from that striking
upon old wood, became in time a bird. In 1807, there was exhibited
in Spring-gardens, London, a Wonderful natural curiosity, called the
Goose Tree, Barnacle Tree, or Tree bearing Geese, taken up at sea
on January 12th, and more than twenty men could raise out of the
water. [26]
Sir J. Emerson Tennent asks whether the ready acceptance and
general credence given to so obvious a fable may not have been
derived from giving too literal a construction to the text of the
passage in the first chapter of Genesis:—
And God said, Let the waters bring forth abundantly the moving
creature that hath life, and the fowl that may fly in the open
firmament of heaven.
The Barnacle Goose is a well-known bird, and is eaten on fast-days
in France, by virtue of this old belief in its marine origin. The belief in
the barnacle origin of the bird still prevails on the west coast of
Ireland, and in the Western Highlands of Scotland.
The finding of the Barnacle is thus described by Mr. Sidebotham, to
the Microscopical and Natural History Section of the Literary and
Philosophical Society:—In September, I was at Lytham with my
family. The day was very stormy, and the previous night there had
been a strong south-west wind, and evidences of a very stormy sea
outside the banks. Two of my children came running to tell me of a
very strange creature that had been washed up on the shore. They
had seen it from the pier, and pointed it out to a sailor, thinking it
was a large dog with long hair. On reaching the shore I found a fine
mass of Barnacles, Pentalasinus anatifera, attached to some staves
of a cask, the whole being between four and five feet long. Several
66. sailors had secured the prize, and were getting it on a truck to carry
it away. The appearance was most remarkable, the hundreds of long
tubes with their curious shells looking like what one would fancy the
fabled Gorgon's head with its snaky locks. The curiosity was carried
to a yard where it was to be exhibited, and the bellman went round
to announce it under the name of the sea-lioness, or the great sea-
serpent. Another mass of Barnacles was washed up at Lytham, and
also one at Blackpool, the same day or the day following. This mass
of Barnacles was evidently just such a one as that seen by Gerard at
the Pile of Foulders. It is rare to have such a specimen on our
coasts. The sailors at Lytham had never seen anything like it,
although some of them were old men who had spent all their lives
on the coast.
FOOTNOTE:
[26] Notes and Queries, No. 201.
67. LEAVES ABOUT BOOKWORMS.
O N paper, leather, and parchment are found various animals,
popularly known as Bookworms. Johnson describes it as a
worm or mite that eats holes in books, chiefly when damp;
and in the Guardian we find this reference to its habits:—My lion,
like a moth or bookworm, feeds upon nothing but paper.
Many years ago an experienced keeper of the Ashmolean Museum at
Oxford collected these interesting details of Bookworms:—The
larvæ of Crambus pinguinalis will establish themselves upon the
binding of a book, and spinning a robe will do it little injury. A mite,
Acarus eruditus, eats the paste that fastens the paper over the
edges of the binding and so loosens it. The caterpillar of another
little moth takes its station in damp old books, between the leaves,
and there commits great ravages. The little boring wood-beetle, who
attacks books and will even bore through several volumes. An
instance is mentioned of twenty-seven folio volumes being
perforated in a straight line, by the same insect, in such a manner
that by passing a cord through the perfect round hole made by it the
twenty-seven volumes could be raised at once. The wood-beetle also
destroys prints and drawings, whether framed or kept in a portfolio.
There is another Bookworm, which is often confounded with the
Death-watch of the vulgar; but is smaller, and instead of beating at
intervals, as does the Death-watch, continues its noise for a
considerable length of time without intermission. It is usually found
in old wood, decayed furniture, museums, and neglected books. The
female lays her eggs, which are exceedingly small, in dry, dusty
places, where they are least likely to meet with disturbance. They
are generally hatched about the beginning of March, a little sooner
or later, according to the weather. After leaving the eggs, the insects
are so small as to be scarcely discerned without the use of a glass.
68. They remain in this state about two months, somewhat resembling
in appearance the mites in cheese, after which they undergo their
change into the perfect insect. They feed on dead flies and other
insects; and often, from their numbers and voracity, very much
deface cabinets of natural history. They subsist on various other
substances, and may often be observed carefully hunting for
nutritious particles amongst the dust in which they are found,
turning it over with their heads, and searching about in the manner
of swine. Many live through the winter buried deep in the dust to
avoid the frost.
The best mode of destroying the insects which infest books and
MSS. has often occupied the attention of the possessors of valuable
libraries. Sir Thomas Phillips found the wood of his book-case
attacked, particularly where beech had been introduced, and
appeared to think that the insect was much attracted by the paste
employed in binding. He recommended as preservatives against their
attacks spirits of turpentine and a solution of corrosive sublimate,
and also the latter substance mixed with paste. In some instances
he found the produce of a single impregnated female sufficient to
destroy a book. Turpentine and spirit of tar are also recommended
for their destruction; but the method pursued in the collections of
the British Museum is an abundant supply of camphor, with attention
to keeping the rooms dry, warm, and ventilated. Mr. Macleay states it
is the acari only which feed on the paste employed in binding books,
and the larvæ of the Coleoptera only which pierce the boards and
leaves.
The ravages of the Bookworm would be much more destructive had
there not been a sort of guardian to the literary treasures in the
shape of a spider, who, when examined through a microscope,
resembles a knight in armour. This champion of the library follows
the Worm into the book-case, discovers the pit he has digged,
rushes on his victim, which is about his own size, and devours him.
His repast finished, he rests for about a fortnight, and when his
digestion is completed, he sets out to break another lance with the
enemy.
69. The Death-watch, already referred to, and which must be acquitted
of destroying books, is chiefly known by the noise which he makes
behind the wainscoting, where he ticks like a clock or watch. How so
loud a noise is produced by so small an insect has never been
properly explained; and the ticking has led to many legends. The
naturalist Degeer relates that one night, in the autumn of 1809,
during an entomological excursion in Brittany, where travellers were
scarce and accommodation bad, he sought hospitality at the house
of a friend. He was from home, and Degeer found a great deal of
trouble in gaining admittance; but at last the peasant who had
charge of the house told Degeer that he would give him the
chamber of death, if he liked. As Degeer was much fatigued, he
accepted the offer. The bed is there, said the man, but no one has
slept in it for some time. Every night the spirit of the officer, who
was surprised and killed in this room by some chouans, comes back.
When the officer was dead, the peasants divided what he had about
him, and the officer's watch fell to my uncle, who was delighted with
the prize, and brought it home to examine it. However, he soon
found out that the watch was broken, and would not go. He then
placed it under his pillow, and went to sleep; he awoke in the night,
and to his terror heard the ticking of a watch. In vain he sold the
watch, and gave the money for masses to be said for the officer's
soul, the ticking continued, and has never ceased. Degeer said that
he would exorcise the chamber, and the peasant left him, after
making the sign of the cross. The naturalist at once guessed the
riddle, and, accustomed to the pursuit of insects, soon had a couple
of Death-watches shut up in a tin case, and the ticking was
reproduced.
Swift has prescribed this destructive remedy by way of ridicule:—
70. A Wood-worm
That lies in old wood, like a hare in her form:
With teeth or with claws it will bite, or will
scratch;
And chambermaids christen this worm a Death-
watch,
Because like a watch it always cries click:
Then woe be to those in the house that are sick!
For, sure as a gun, they will give up the ghost
If the maggot cries click when it scratches the
post.
But a kettle of scalding hot water ejected,
Infallibly cures the timber affected:
The omen is broken, the danger is over;
The maggot will die, the sick will recover.
71. BORING MARINE ANIMALS, AND
HUMAN ENGINEERS.
W ERE a young naturalist asked to exemplify what man has
learned from the lower animals, he could scarcely adduce a
more striking instance than that of a submarine shelly worker
teaching him how to execute some of his noblest works. This we
have learned from the life and labours of the Pholas, of which it has
been emphatically said:—Numerous accounts have been published
during the last fourteen years in every civilized country and language
of the boring process of the Pholas; and machines formed on the
model of its mechanism have for years been tunnelling Mont Cenis.
In the Eastern Zoological Gallery of the British Museum, cases 35
and 36, as well as in the Museum of Economic Geology in Piccadilly,
may be seen specimens of the above very curious order of
Conchifers, most of the members of which are distinguished by their
habits of boring or digging, a process in which they are assisted by
the peculiar formation of the foot, from which they derive their
name. Of these ten families one of the most characteristic is that of
the Razor-shells, which, when the valves are shut, are of a long,
flattened, cylindrical shape, and open at both ends. Projecting its
strong pointed foot at one of these ends, the solen can work itself
down into the sand with great rapidity, while at the upper end its
respiratory tubes are shot out to bring the water to its gills. Of the
Pholadæ, the shells of which are sometimes called multivalve,
because, in addition to the two chief portions, they have a number
of smaller accessory pieces, some bore in hard mud, others in wood,
and others in rocks. They fix themselves firmly by the powerful foot,
and then make the shell revolve; the sharp edges of this commence
the perforation, which is afterwards enlarged by the rasp-like action
of the rough exterior; and though the shell must be constantly worn
72. down, yet it is replaced by a new formation from the animal, so as
never to be unfit for its purpose. The typical bivalve of this family is
the Pholas, which bores into limestone-rock and other hard material,
and commits ravages on the piers, breakwaters, c., that it selects
for a home.
In the same family as the above Dr. Gray ranks the Teredo, [27] or
wood-boring mollusc, whose ravages on ships, piles, wooden piers,
c., at sea resemble those of the white ant on furniture, joints of
houses, c., on shore. Perforating the timber by exactly the same
process as that by which the Pholas perforates the stones, the
Teredo advances continually, eating out a contorted tube or gallery,
which it lines behind it with calcareous matter, and through which it
continues to breathe the water.
The priority of the demonstration of the Pholas and its boring
habits has been much disputed. The evidence is full of curious
details. It appears that Mr. Harper, of Edinburgh, author of The Sea-
side and Aquarium, having claimed the lead. Mr. Robertson, of
Brighton, writes to dispute the originality; adding that he publicly
exhibited Pholades in the Pavilion at Brighton in July, 1851,
perforating chalk rocks by the raspings of their valves and squirtings
of their syphons. Professor Flourens (says Mr. Robertson) taught my
observations to his class in Paris in 1853; I published them in 1851,
and again more fully in the Journal de Conchyliologie, in 1853; and
M. Emile Blanchard illustrated them in the same year in his
Organisation du Règne Animal. I published a popular account of
the perforating processes in Household Words in 1856. After
obtaining the suffrages of the French authorities, I have been
recently honoured with those of the British naturalist. (See
Woodward's Recent and Fossil Shells, p. 327. Family, Pholadidæ.)
On returning to England last autumn I exhibited perforating
Pholades to all the naturalists who cared to watch them. An
intelligent lady whom I supplied with Pholades has made a really
new and original observation, which I may take this opportunity of
communicating to the public. She observed two Pholades whose
73. perforations were bringing them nearer and nearer to each other.
Their mutual raspings were wearing away the thin partition which
separated their crypts. She was curious to know what they would do
when they met, and watched them closely. When the two
perforating shell-fish met and found themselves in each other's way,
the stronger just bored right through the weaker Pholas. [28]
Mr. Robertson has communicated to Jameson's Journal, No. 101,
the results of his opportunities of studying the Pholas, during six
months, to discover how this mollusc makes its hole or crypt in the
chalk: by a chemical solvent? by absorption? by ciliary currents? or
by rotatory motions? Between twenty and thirty of these creatures
were at work in lumps of chalk, in sea-water, in a finger-glass, and
open for three months; and by watching their operations. Mr.
Robertson became convinced that the Pholas makes its hole by
grating the chalk with its rasp-like valves, licking it up when
pulverized with its foot, forcing it up through its principal orbrambial
syphon, and squirting it out in oblong nodules. The crypt protects
the Pholas from confervæ, which, when they get at it, grow not
merely outside, but even with the lips of the valves, preventing the
action of the syphons. In the foot there is a gelatinous spring or
style, which, even when taken out, has great elasticity, and which
seems the mainspring of the motions of the Pholas.
Upon this Dr. James Stark, of Edinburgh, writes:
—Mr. Robertson, of Brighton, claims the merit of teaching that
Pholades perforate rocks by 'the rasping of their valves and the
squirting of their syphons.' His observations only appear to reach
back to 1851. But the late Mr. John Stark, of Edinburgh, author of
the 'Elements of Natural History,' read a paper before the Royal
Society of Edinburgh, in 1826, which was printed in the Society's
'Transactions' of that year, in which he demonstrated that the
Pholades perforate the shale rocks in which they occur on this coast,
by means of the rasping of their valves, and not by acids or other
secretions. From also finding that their shells scratched limestone
74. without injury to the fine rasping rugosities, he inferred that it was
by the same agency they perforated the hard limestone rocks.
To this Mr. Robertson replies, that Mr. Osler also, in 1826,
demonstrated that the Pholades perforate the shale rocks by means
of the rasping of their valves; and more, for he actually witnessed a
rotatory movement. But Réaumur and Poli had done as much as this
in the eighteenth and Sibbald in the seventeenth century: and yet I
found the solvent hypothesis in the ascendant among naturalists in
1835, when I first interested myself in the controversy. What I did in
1851 was, I exhibited Pholades at work perforating rocks, and
explained how they did it. What I have done is, I have made future
controversy impossible, by exhibiting the animals at work, and by
discovering the anatomy and the physiology of the perforating
instruments. In the words of M. Flourens, 'I made the animals work
before my eyes,' and I 'made known their mechanism.' The
discovery of the function of the hyaline stylet is not merely a new
discovery, it is the discovery of a kind of instrument as yet unique in
physiology.
Mr. Harper having termed the boring organ of the Pholas the
hyaline stylet, found it to have puzzled some of the disputants,
whereupon Mr. Harper writes:—Its use up to the present time has
been a mystery, but the general opinion of authors seems to be, that
it is the gizzard of the Pholas. This I very much doubt, for it is my
belief that the presence of such an important muscle is solely for the
purpose of aiding the animal's boring operations. Being situated in
the centre of the foot, we can readily conceive the great increase of
strength thus conveyed to the latter member, which is made to act
as a powerful fulcrum, by the exercise of which the animal rotates—
and at the same time presses its shell against and rasps the surface
of the rock. The question being asked, 'How can the stylet be
procured to satisfy curiosity?' I answer, by adopting the following
extremely simple plan. Having disentombed a specimen, with the
point of a sharp instrument cut a slit in the base of its foot, and the
object of your search will be distinctly visible in the shape of, if I may
75. so term it, an opal cylinder. Sometimes I have seen the point of this
organ spring out beyond the incision, made as above described.
Lastly, Mr. Harper presented the Editor of the Athenæum with a
piece of bored rock, of which he has several specimens. He adds,
On examination, you will perceive that the larger Pholas must have
bored through its smaller and weaker neighbour (how suggestive!),
the shell of the latter, most fortunately, remaining in its own cavity.
Now, Mr. Robertson claimed for his observation of this phenomenon
novelty and originality; but Mr. Harper stoutly maintained it to be as
common to the eye of the practised geologist as rain or sunshine.
The details are curious; though some impatient, and not very
grateful reader, may imagine himself in the condition of the shell of
the smaller Pholas, and will be, as he deserves to remain, in the
minority. [29]
It may be interesting to sum up a few of the opinions of the mode
by which these boring operations are performed. Professor Forbes
states the mode by which Molluscs bore into wood and other
materials is as follows:—Some of the Gauterspods have tongues
covered with silica to enable them to bore, and it was probably by
some process of this kind that all the Molluscs bored.
Mr. Peach never observed the species of Pholas to turn round in their
holes, as stated by some observers, although he had watched them
with great attention. Mr. Charlesworth refers to the fact that, in one
species of shell, not only does the hole in the rock which the animal
occupies increase in size, but also the hole through which it projects
its syphons.
Professor John Phillips, alluding to the theories which have been
given of the mode in which Molluscs bore into the rocks in which
they live, believes that an exclusively mechanical theory will not
account for the phenomenon; and he is inclined to adopt the view of
Dr. T. Williams—that the boring of the Pholades can only be
explained on the principle which involves a chemical as well as a
mechanical agency.
76. Mr. E. Ray Lankester notices that the boring of Annelids seems quite
unknown; and he mentions two cases, one by a worm called
Leucadore, the other by a Sabella. Leucadore is very abundant on
some shores, where boulders and pebbles may be found worm-
eaten and riddled by them. Only stones composed of carbonate of
lime are bored by them. On coasts where such stones are rare, they
are selected, and others are left. The worms are quite soft, and
armed only with horny bristles. How, then, do they bore? Mr.
Lankester maintains that it is by carbonic acid and other acid
excretions of their bodies, aided by the mechanical action of their
bristles. The selection of a material soluble in these acids is most
noticeable, since the softest chalk and the hardest limestone are
bored with the same facility. This can only be by chemical action. If,
then, we have a case of chemical boring in these worms, is it not
probable that many Molluscs are similarly assisted in their
excavations?
FOOTNOTES:
[27] How Brunel took his construction of the Thames Tunnel from
observing the bore of the Teredo navalis in the keel of a ship, in
1814, is well known.
[28] Athenæum, No. 1640.
[29] See also Life in the Sea, in Strange Stories of the Animal
World, by the author of the present volume. Second Edition.
1868.
77. INDEX.
ANCIENT Zoological Gardens, 12
Animals, Rare, of London Zoological Society, 16, 17, 18
Annelids, boring, 348
Annelids and Molluscs, Boring Habits of, 348
Ant-Bear in captivity, 76
Ant-Bear, the Great, 72
Ant-Bear at Madrid, 72
Ant-Bear described, 77
Ant-Bear, Domestic, in Paraguay, 75
Ant-Bear, Economy of, 76
Ant-Bear and its Food, 74
Ant-Bears, Fossil, 80, 81
Ant-Bear, Muscular Force of, 79
Ant-Bear, Wallace's Account of, 73
Ant-Bear, Zoological Society's, 76, 82, 84
Ant-Eater, Porcupine, 84
Ant-Bear, Professor Owen on, 80
Ant-Eaters, scarcity of, 80
Ant-Eater, Tamandua, 82
Ant-Eaters, Von Saek's Account of, 83
Aristotle's History of Animals, 279, 280
BARNACLE GEESE, finding of the, 334
Barnacle Goose, Gerarde on, 332
Barnacle Goose, Giraldus Cambrensis on, 332
Barnacle Goose, Max Müller on, 331
Barnacle Goose, name of, 332
Barnacle Goose, Sir E. Tennent on, 334
78. Barnacle Goose, Sir Kenelm Digby on, 334
Barnacle Goose, Sir R. Moray on, 331
Barnacle Goose, Stories of the, 331-335
Barnacles breeding upon old ships, 333
Barnacle Geese in the Thames, 331
Bat, altivolans, by Gilbert White, 100
Bat, American, by Lesson, 91
Bat, Aristotle on, 85
Bat, Mr. Bell on, 86
Bats, Curiosities of, 85
Bat, described by Calmet, 87
Bat, Flight and Wing of, 96
Bats, in England, 100
Bat, Heber, Stedman, and Waterton on, 91
Bats in Jamaica, 100
Bat, Kalong, of Java, 98
Bat, Long-Eared, by Sowerby, 92, 93-96
Bat, Nycteris, 97
Bat, Rere-mouse and Flitter-mouse, 86
Bat Skeleton, Sir C. Bell on, 87
Bat in Scripture, 85
Bat, Vampire, from Sumatra, 88
Bat, Vampire, Lines on, by Byron, 89
Bat, vulgar errors respecting, 97
Bat-Fowling or Bat-Folding, 92
Berlin Zoological Gardens and Museum, 16
Bible Natural History, 11
Birds, Addison on their Nests and Music, 156, 157
Bird, Australian Bower, Nest of, 167
Bird, Baya, Indian, Nest of, 164
Birds and Animals, Beauty in, 150
Birds, Brain of, 154
Birds, Characteristics of, 145
Birds, Colour of, 148
Bird Confinement, Dr. Livingstone on, 169
Birds' Eggs, large, 162
79. Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com