SlideShare a Scribd company logo
Biological Data Mining Chapman Hall Crc Data
Mining And Knowledge Discovery Series 1st
Edition Jake Y Chen download
https://ebookbell.com/product/biological-data-mining-chapman-
hall-crc-data-mining-and-knowledge-discovery-series-1st-edition-
jake-y-chen-2172726
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Biological Data Mining And Its Applications In Healthcare Xiaoli Li
Seekiong Ng Jason T L Wang
https://ebookbell.com/product/biological-data-mining-and-its-
applications-in-healthcare-xiaoli-li-seekiong-ng-jason-t-l-
wang-51374142
Biological Data Mining In Protein Interaction Networks Seekiong Ng
https://ebookbell.com/product/biological-data-mining-in-protein-
interaction-networks-seekiong-ng-1382476
Data Mining In Medical And Biological Research Giannopoulou E Ed
https://ebookbell.com/product/data-mining-in-medical-and-biological-
research-giannopoulou-e-ed-1103532
Data Mining Foundations And Intelligent Paradigms Volume 3 Medical
Health Social Biological And Other Applications 1st Edition Dawn E
Holmes
https://ebookbell.com/product/data-mining-foundations-and-intelligent-
paradigms-volume-3-medical-health-social-biological-and-other-
applications-1st-edition-dawn-e-holmes-2511290
Biological Knowledge Discovery Handbook Preprocessing Mining And
Postprocessing Of Biological Data 1st Edition Mourad Elloumi
https://ebookbell.com/product/biological-knowledge-discovery-handbook-
preprocessing-mining-and-postprocessing-of-biological-data-1st-
edition-mourad-elloumi-5249964
Biological Data Integration Computer And Statistical Approaches 1st
Edition Christine Froidevaux
https://ebookbell.com/product/biological-data-integration-computer-
and-statistical-approaches-1st-edition-christine-froidevaux-54251750
Biological Data Exploration With Python Pandas And Seaborn Clean
Filter Reshape And Visualize Complex Biological Datasets Using The
Scientific Python Stack Dr Martin Jones
https://ebookbell.com/product/biological-data-exploration-with-python-
pandas-and-seaborn-clean-filter-reshape-and-visualize-complex-
biological-datasets-using-the-scientific-python-stack-dr-martin-
jones-55211736
Biological Data Integration Computer And Statistical Approaches
Froidevaux
https://ebookbell.com/product/biological-data-integration-computer-
and-statistical-approaches-froidevaux-231944450
A Primer In Biological Data Analysis And Visualization Using R Pilot
Project Ebook Available To Selected Us Libraries Only Gregg Hartvigsen
https://ebookbell.com/product/a-primer-in-biological-data-analysis-
and-visualization-using-r-pilot-project-ebook-available-to-selected-
us-libraries-only-gregg-hartvigsen-51905110
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
Biological
Data Mining
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions
David Skillicorn
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT
David Skillicorn
MULTIMEDIA DATA MINING: A Systematic Introduction to Concepts and Theory
Zhongfei Zhang and Ruofei Zhang
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, Second Edition
Harvey J. Miller and Jiawei Han
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
PUBLISHED TITLES
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis.This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and hand-
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Edited by
Jake Y. Chen
Stefano Lonardi
Biological
Data Mining
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number: 978-1-4200-8684-3 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Biological data mining / editors, Jake Y. Chen, Stefano Lonardi.
p. cm. -- (Data mining and knowledge discovery series)
Includes bibliographical references and index.
ISBN 978-1-4200-8684-3 (hardcover : alk. paper)
1. Bioinformatics. 2. Data mining. 3. Computational biology. I. Chen, Jake. II.
Lonardi, Stefano. III. Title. IV. Series.
QH324.2.B578 2010
570.285--dc22 2009028067
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ix
Editors xiii
Contributors xv
Part I Sequence, Structure, and Function 1
1 Consensus Structure Prediction for RNA Alignments 3
Junilda Spirollari and Jason T. L. Wang
2 Invariant Geometric Properties of Secondary Structure
Elements in Proteins 27
Matteo Comin, Concettina Guerra, and Giuseppe Zanotti
3 Discovering 3D Motifs in RNA 49
Alberto Apostolico, Giovanni Ciriello, Concettina Guerra,
and Christine E. Heitsch
4 Protein Structure Classification Using Machine
Learning Methods 69
Yazhene Krishnaraj and Chandan Reddy
5 Protein Surface Representation and Comparison:
New Approaches in Structural Proteomics 89
Lee Sael and Daisuke Kihara
6 Advanced Graph Mining Methods for Protein Analysis 111
Yi-Ping Phoebe Chen, Jia Rong, and Gang Li
7 Predicting Local Structure and Function of Proteins 137
Huzefa Rangwala and George Karypis
v
vi Contents
Part II Genomics, Transcriptomics,
and Proteomics 161
8 Computational Approaches for Genome Assembly
Validation 163
Jeong-Hyeon Choi, Haixu Tang, Sun Kim, and Mihai Pop
9 Mining Patterns of Epistasis in Human Genetics 187
Jason H. Moore
10 Discovery of Regulatory Mechanisms from Gene
Expression Variation by eQTL Analysis 205
Yang Huang, Jie Zheng, and Teresa M. Przytycka
11 Statistical Approaches to Gene Expression Microarray
Data Preprocessing 229
Megan Kong, Elizabeth McClellan, Richard H. Scheuermann,
and Monnie McGee
12 Application of Feature Selection and Classification
to Computational Molecular Biology 257
Paola Bertolazzi, Giovanni Felici, and Giuseppe Lancia
13 Statistical Indices for Computational and Data Driven
Class Discovery in Microarray Data 295
Raffaele Giancarlo, Davide Scaturro, and Filippo Utro
14 Computational Approaches to Peptide Retention Time
Prediction for Proteomics 337
Xiang Zhang, Cheolhwan Oh, Catherine P. Riley,
Hyeyoung Cho, and Charles Buck
Part III Functional and Molecular Interaction
Networks 351
15 Inferring Protein Functional Linkage Based on Sequence
Information and Beyond 353
Li Liao
16 Computational Methods for Unraveling Transcriptional
Regulatory Networks in Prokaryotes 377
Dongsheng Che and Guojun Li
17 Computational Methods for Analyzing and Modeling
Biological Networks 397
Nataša Pržulj and Tijana Milenković
Contents vii
18 Statistical Analysis of Biomolecular Networks 429
Jing-Dong J. Han and Chris J. Needham
Part IV Literature, Ontology, and Knowledge
Integration 447
19 Beyond Information Retrieval: Literature Mining
for Biomedical Knowledge Discovery 449
Javed Mostafa, Kazuhiro Seki, and Weimao Ke
20 Mining Biological Interactions from Biomedical Texts
for Efficient Query Answering 485
Muhammad Abulaish, Lipika Dey, and Jahiruddin
21 Ontology-Based Knowledge Representation of Experiment
Metadata in Biological Data Mining 529
Richard H. Scheuermann, Megan Kong, Carl Dahlke,
Jennifer Cai, Jamie Lee, Yu Qian, Burke Squires, Patrick Dunn,
Jeff Wiser, Herb Hagler, Barry Smith, and David Karp
22 Redescription Mining and Applications in Bioinformatics 561
Naren Ramakrishnan and Mohammed J. Zaki
Part V Genome Medicine Applications 587
23 Data Mining Tools and Techniques for Identification of
Biomarkers for Cancer 589
Mick Correll, Simon Beaulah, Robin Munro, Jonathan Sheldon,
Yike Guo, and Hai Hu
24 Cancer Biomarker Prioritization: Assessing the in vivo
Impact of in vitro Models by in silico Mining
of Microarray Database, Literature, and Gene Annotation 615
Chia-Ju Lee, Zan Huang, Hongmei Jiang, John Crispino,
and Simon Lin
25 Biomarker Discovery by Mining Glycomic and
Lipidomic Data 627
Haixu Tang, Mehmet Dalkilic, and Yehia Mechref
26 Data Mining Chemical Structures and Biological Data 649
Glenn J. Myatt and Paul E. Blower
Index 689
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
Preface
Modern biology has become an information science. Since the invention of a
DNA sequencing method by Sanger in the late seventies, public repositories
of genomic sequences have been growing exponentially, doubling in size every
16 months—a rate often compared to the growth of semiconductor transistor
densities in CPUs known as Moore’s Law. In the nineties, the public–private
race to sequence the human genome further intensified the fervor to gener-
ate high-throughput biomolecular data from highly parallel and miniaturized
instruments. Today, sequencing data from thousands of genomes, including
plants, mammals, and microbial genomes, are accumulating at an unprece-
dented rate. The advent of second-generation DNA sequencing instruments,
high-density cDNA microarrays, tandem mass spectrometers, and high-power
NMRs have fueled the growth of molecular biology into a wide spectrum of
disciplines such as personalized genomics, functional genomics, proteomics,
metabolomics, and structural genomics. Few experiments in molecular biol-
ogy and genetics performed today can afford to ignore the vast amount of
biological information publicly accessible. Suddenly, molecular biology and
genetics have become data rich.
Biological data mining is a data-guzzling turbo engine for postgenomic
biology, driving the competitive race toward unprecedented biological discov-
ery opportunities in the twenty-first century. Classical bioinformatics emerged
from the study of macromolecules in molecular biology, biochemistry, and
biophysics. Analysis, comparison, and classification of DNA and protein se-
quences were the dominant themes of bioinformatics in the early nineties.
Machine learning mainly focused on predicting genes and proteins functions
from their sequences and structures. The understanding of cellular functions
and processes underlying complex diseases were out of reach. Bioinformatics
scientists were a rare breed, and their contribution to molecular biology and
genetics was considered marginal, because the computational tools available
then for biomolecular data analysis were far more primitive than the array
of experimental techniques and assays that were available to life scientists.
Today, we are now witnessing the reversal of these past trends. Diverse sets
of data types that cover a broad spectrum of genotypes and phenotypes, par-
ticularly those related to human health and diseases, have become available.
Many interdisciplinary researchers, including applied computer scientists, ap-
plied mathematicians, biostatisticians, biomedical researchers, clinical scien-
tists, and biopharmaceutical professionals, have discovered in biology a gold
ix
x Preface
mine of knowledge leading to many exciting possibilities: the unraveling of the
tree of life, harnessing the power of microbial organisms for renewable energy,
finding new ways to diagnose disease early, and developing new therapeutic
compounds that save lives. Much of the experimental high-throughput biology
data are generated and analyzed “in haste,” therefore leaving plenty of oppor-
tunities for knowledge discovery even after the original data are released. Most
of the bets on the race to separate the wheat from the chaff have been placed
on biological data mining techniques. After all, when easy, straightforward,
first-pass data analysis has not yielded novel biological insights, data mining
techniques must be able to help—or, many presumed so.
In reality, biological data mining is still much of an “art,” successfully
practiced by a few bioinformatics research groups that occupy themselves
with solving real-world biological problems. Unlikely data mining in business,
where the major concerns are often related to the bottom line—profit—the
goals of biological data mining can be as diverse as the spectrum of biologi-
cal questions that exist. In the business domain, association rules discovered
between sales items are immediately actionable; in biology, any unorthodox
hypothesis produced by computational models has to be first red-flagged and
is lucky to be validated experimentally. In the Internet business domain, clas-
sification, clustering, and visualization of blogs, network traffic patterns, and
news feeds add significant values to regular Internet users who are unaware of
high-level patterns that may exist in the data set; in molecular biology and ge-
netics, any clustering or classification of the data presented to biologists may
promptly elicit questions like “great, but how and why did it happen?” or
“how can you explain these results in the context of the biology I know?” The
majority of general-purpose data mining techniques do not take into consider-
ation the prior knowledge domain of the biological problem, leading them to
often underperform hypothesis-driven biological investigative techniques. The
high level of variability of measurements inherent in many types of biological
experiments or samples, the general unavailability of experimental replicates,
the large number of hidden variables in the data, and the high correlation of
biomolecular expression measurements also constitute significant challenges in
the application of classical data mining methods in biology. Many biological
data mining projects are attempted and then abandoned, even by experienced
data mining scientists. In the extreme cases, large-scale biological data min-
ing efforts are jokingly labeled as fishing expeditions and dispelled, in national
grant proposal review panels.
This book represents a culmination of our past research efforts in biolog-
ical data mining. Throughout this book, we wanted to showcase a small, but
noteworthy sample of successful projects involving data mining and molec-
ular biology. Each chapter of the book is authored by a distinguished team
of bioinformatics scientists whom we invited to offer the readers the widest
possible range of application domains. To ensure high-quality standards, each
contributed chapter went through standard peer reviews and a round of revi-
sions. The contributed chapters have been grouped into five major sections.
Preface xi
The first section, entitled Sequence, Structure, and Function, collects contri-
butions on data mining techniques designed to analyze biological sequences
and structures with the objective of discovering novel functional knowledge.
The second section, on Genomics, Transcriptomics, and Proteomics, contains
studies addressing emerging large-scale data mining challenges in analyzing
high-throughput “omics” data. The chapters in the third section, entitled
Functional and Molecular Interaction Networks, address emerging system-
scale molecular properties and their relevance to cellular functions. The fourth
section is about Literature, Ontology, and Knowledge Integrations, and it col-
lects chapters related to knowledge representation, information retrieval, and
data integration for structured and unstructured biological data. The con-
tributed works in the fifth and last section, entitled Genome Medicine Appli-
cations, address emerging biological data mining applications in medicine.
We believe this book can serve as a valuable guide to the field for graduate
students, researchers, and practitioners. We hope that the wide range of topics
covered will allow readers to appreciate the extent of the impact of data mining
in molecular biology and genetics. For us, research in data mining and its
applications to biology and genetics is fascinating and rewarding. It may even
help to save human lives one day. This field offers great opportunities and
rewards if one is prepared to learn molecular biology and genetics, design user-
friendly software tools under the proper biological assumptions, and validate
all discovered hypotheses rigorously using appropriate models.
In closing, we would like to thank all the authors that contributed a chapter
in the book. We are also indebted to Randi Cohen, our outstanding publishing
editor. Randi efficiently managed timelines and deadlines, gracefully handled
the communication with the authors and the reviewers, and took care of ev-
ery little detail associated with this project. This book could not have been
possible without her. Our thanks also go to our families for their support
throughout the book project.
Jake Y. Chen
Indianapolis, Indiana
Stefano Lonardi
Riverside, California
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
Editors
Jake Chen is an assistant professor of informatics at Indiana University
School of Informatics and assistant professor of computer science at Purdue
School of Science, Indiana. He is the founding director of the Indiana Cen-
ter for Systems Biology and Personalized Medicine—the first research center
in the region to promote the development of systems biology tools towards
solving future personalized medicine problems. He is an IEEE senior mem-
ber and a member of several other interdisciplinary Indiana research centers,
including: Center for Computational Biology and Bioinformatics, Center for
Bio-computing, Indiana University Cancer Center, and Indiana Center for En-
vironmental Health. He was a scientific co-founder and chief informatics officer
(2006–2008) of Predictive Physiology and Medicine, Inc. and the founder of
Medeolinx, LLC-Indiana biotech startups developing businesses in emerging
personalized medicine and translational bioinformatics markets.
Dr. Chen received PhD and MS degrees in computer science from the
University of Minnesota at Twin Cities and a BS in molecular biology and
biochemistry from Peking University in China. He has extensive industrial
research and management experience (1998–2003), including developing com-
mercial GeneChip microarrays at Affymetrix, Inc. and mapping the first hu-
man protein interactome at Myriad Proteomics. After rejoining academia in
2004, he concentrated his research on “translational bioinformatics,” studies
aiming to bridge the gaps between bioinformatics research and human health
applications. He has over 60 publications in the areas of biological data man-
agement, biological data mining, network biology, systems biology, and various
disease-related omics applications.
Stefano Lonardi is associate professor of computer science and engineering
at the University of California, Riverside. He is also a faculty member of
the graduate program in genetics, genomics and bioinformatics, the Center
for Plant Cell Biology, the Institute for Integrative Genome Biology, and the
graduate program in cell, molecular and developmental biology.
Dr. Lonardi received his “Laurea cum laude” from the University of Pisa
in 1994 and his PhD, in the summer of 2001, from the Department of Com-
puter Sciences, Purdue University, West Lafayette, IN. He also holds a PhD
in electrical and information engineering from the University of Padua (1999).
During the summer of 1999, he was an intern at Celera Genomics, Department
of Informatics Research, Rockville, MD.
xiii
xiv Editors
Dr. Lonardi’s recent research interests include designing of algorithms,
computational molecular biology, data compression, and data mining. He has
published more than 30 papers in major theoretical computer science and
computational biology journals and has about 45 publications in refereed in-
ternational conferences. In 2005, he received the CAREER award from the
National Science Foundation.
Contributors
Muhammad Abulaish
Department of Computer Science
Jamia Millia Islamia
New Delhi, India
Alberto Apostolico
College of Computing
Georgia Institute of Technology
Atlanta, Georgia
Simon Beaulah
InforSense, Ltd.
London, United Kingdom
Paola Bertolazzi
Istituto di Analisi dei Sistemi ed
Informatica Antonio Ruberti
Consiglio Nazionale delle Ricerche
Rome, Italy
Paul E. Blower
Department of Pharmacology
Ohio State University
Columbus, Ohio
Charles Buck
Bindley Bioscience Center
Purdue University
West Lafayette, Indiana
Jennifer Cai
Department of Pathology
University of Texas Southwestern
Medical Center
Dallas, Texas
Dongsheng Che
Department of Computer Science
East Stroudsburg University
East Stroudsburg, Pennsylvania
Yi-Ping Phoebe Chen
School of Information Technology
Deakin University
Melbourne, Australia
Hyeyoung Cho
Bindley Bioscience Center
Purdue University
West Lafayette, Indiana
and
Department of Bio and Brain
Engineering
KAIST
Daejeon, South Korea
Jeong-Hyeon Choi
Center for Genomics and
Bioinformatics and
School of Informatics
Indiana University
Bloomington, Indiana
Giovanni Ciriello
Department of Information
Engineering
University of Padova
Padova, Italy
xv
xvi Contributors
Matteo Comin
Department of Information
Engineering
University of Padua
Padova, Italy
Mick Correll
InforSense, LLC
Cambridge, Massachusetts
John Crispino
Hematology Oncology
Northwestern University
Chicago, Illinois
Carl Dahlke
Health Information Systems
Northrop Grumman, Inc.
Rockville, Maryland
Mehmet Dalkilic
School of Informatics
Indiana University
Bloomington, Indiana
Lipika Dey
Innovation Labs
Tata Consultancy Services
New Delhi, India
Patrick Dunn
Health Information Systems
Northrop Grumman, Inc.
Rockville, Maryland
Giovanni Felici
Istituto di Analisi dei Sistemi ed
Informatica Antonio Ruberti
Consiglio Nazionale delle Ricerche
Rome, Italy
Raffaele Giancarlo
Dipartimento di Matematica ed
Applicazioni
University of Palermo
Palermo, Italy
Concettina Guerra
College of Computing
Georgia Institute of Technology
Atlanta, Georgia and
Department of Information
Engineering
University of Padua
Padova, Italy
Yike Guo
InforSense, Ltd.
London, United Kingdom
Herb Hagler
Department of Pathology
University of Texas Southwestern
Medical Center
Dallas, Texas
Jing-Dong J. Han
Key Laboratory of Molecular
Developmental Biology
Center for Molecular Systems
Biology
Institute of Genetics and
Developmental Biology
Chinese Academy of Sciences
Beijing, People’s Republic of China
Christine E. Heitsch
School of Mathematics
Georgia Institute of Technology
Atlanta, Georgia
Contributors xvii
Hai Hu
Windber Research Institute
Windber, Pennsylvania
Yang Huang
National Institutes of Health
Bethesda, Maryland
Zan Huang
Hematology Oncology
Northwestern University
Chicago, Illinois
Hongmei Jiang
Department of Statistics
Northwestern University
Evanston, Illinois
David Karp
Division of Rheumatology
University of Texas Southwestern
Medical Center
Dallas, Texas
George Karypis
Deparment of Computer Science
University of Minnesota
Minneapolis, Minnesota
Weimao Ke
University of North Carolina
Chapel Hill, North Carolina
Daisuke Kihara
Department of Biological Sciences
and Department of Computer
Science
Markey Center for Structural Biology
College of Science
Purdue University
West Lafayette, Indiana
Sun Kim
Center for Genomics and
Bioinformatics and School of
Informatics
Indiana University
Bloomington, Indiana
Megan Kong
Department of Pathology
University of Texas Southwestern
Medical Center
Dallas, Texas
Yazhene Krishnaraj
Wayne State University
Detroit, Michigan
Giuseppe Lancia
Dipartimento di Matematica e
Informatica
University of Udine
Udine, Italy
Chia-Ju Lee
Biomedical Informatics Center
Northwestern University
Chicago, Illinois
Jamie Lee
Department of Pathology
University of Texas Southwestern
Medical Center
Dallas, Texas
Gang Li
School of Information Technology
Deakin University
Melbourne, Australia
xviii Contributors
Guojun Li
Department of Biochemistry and
Molecular Biology and Institute of
Bioinformatics
University of Georgia
Athens, Georgia
and
School of Mathematics and System
Sciences
Shandong University
Jinan, People’s Republic of China
Li Liao
Computer and Information Sciences
University of Delaware
Newark, Delaware
Simon Lin
Biomedical Informatics Center
Northwestern University
Chicago, Illinois
Elizabeth McClellan
Division of Biomedical Informatics
University of Texas Southwestern
Medical Center
Dallas, Texas
and
Department of Statistical Science
Southern Methodist University
Dallas, Texas
Monnie McGee
Department of Statistical Science
Southern Methodist University
Dallas, Texas
Yehia Mechref
National Center for Glycomics and
Glycoproteomics
Department of Chemistry
Indiana University
Bloomington, Indiana
Tijana Milenković
Department of Computer Science
University of California
Irvine, California
Jason H. Moore
Computational Genetics Laboratory
Norris-Cotton Cancer Center
Departments of Genetics and
Community and Family Medicine
Dartmouth Medical School
Lebanon, New Hampshire
and
Department of Computer Science
University of New Hampshire
Durham, New Hampshire
and
Department of Computer Science
University of Vermont
Burlington, Vermont
and
Translational Genomics Research
Institute
Phoenix, Arizona
Javed Mostafa
University of North Carolina
Chapel Hill, North Carolina
Robin Munro
InforSense, Ltd.
London, United Kingdom
Glenn J. Myatt
Myatt & Johnson, Inc.
Jasper, Georgia
Chris J. Needham
School of Computing
University of Leeds
Leeds, United Kingdom
Contributors xix
Cheolhwan Oh
Bindley Bioscience Center
Purdue University
West Lafayette, Indiana
Mihai Pop
Center for Bioinformatics and
Computational Biology
University of Maryland
College Park, Maryland
Teresa M. Przytycka
National Institutes of Health
Bethesda, Maryland
Nataša Pržulj
Department of Computer Science
University of California
Irvine, California
Yu Qian
Department of Pathology
University of Texas Southwestern
Medical Center
Dallas, Texas
Naren Ramakrishnan
Department of Computer Science
Virginia Tech
Blacksburg, Virginia
Huzefa Rangwala
Department of Computer Science
George Mason University
Fairfax, Virginia
Chandan Reddy
Wayne State University
Detroit, Michigan
Catherine P. Riley
Bindley Bioscience Center
Purdue University
West Lafayette, Indiana
Jia Rong
School of Information Technology
Deakin University
Melbourne, Australia
Lee Sael
Department of Computer Science
Purdue University
West Lafayette, Indiana
Davide Scaturro
Dipartimento di Matematica
ed Applicazioni
University of Palermo
Palermo, Italy
Richard H. Scheuermann
Department of Pathology
Division of Biomedical Informatics
University of Texas Southwestern
Medical Center
Dallas, Texas
Kazuhiro Seki
Organization of Advanced Science
and Technology
Kobe University
Kobe, Japan
Jonathan Sheldon
InforSense Ltd.
London, United Kingdom
Barry Smith
Department of Philosophy
University at Buffalo
Buffalo, New York
xx Contributors
Junilda Spirollari
New Jersey Institute of
Technology
Newark, New Jersey
Burke Squires
Department of Pathology
University of Texas
Southwestern Medical Center
Dallas, Texas
Haixu Tang
School of Informatics
National Center for Glycomics
and Glycoproteomics
Indiana University
Bloomington, Indiana
Jahiruddin
Department of Computer Science
Jamia Millia Islamia
New Delhi, India
Filippo Utro
Dipartimento di Matematica
ed Applicazioni
University of Palermo
Palermo, Italy
Jason T. L. Wang
New Jersey Institute of Technology
Newark, New Jersey
Jeff Wiser
Health Information Systems
Northrop Grumman, Inc.
Rockville, Maryland
Mohammed Zaki
Department of Computer Science
Rensselaer Polytechnic Institute
Troy, New York
Giuseppe Zanotti
Department of Biological Chemistry
University of Padua
Padova, Italy
Xiang Zhang
Department of Chemistry
Center of Regulatory and
Environmental Analytical
Metabolomics
University of Louisville
Louisville, Kentucky
Jie Zheng
National Institutes of Health
Bethesda, Maryland
Part I
Sequence, Structure, and
Function
1
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
Chapter 1
Consensus Structure Prediction for
RNA Alignments
Junilda Spirollari and Jason T. L. Wang
New Jersey Institute of Technology
1.1 Introduction ........................................................... 3
1.2 Algorithms ............................................................ 5
1.2.1 Folding of a single RNA sequence ............................. 6
1.2.1.1 Preliminaries ......................................... 6
1.2.1.2 Algorithm ............................................ 8
1.2.2 Calculation of covariance scores ............................... 12
1.2.2.1 Covariance score ...................................... 12
1.2.2.2 Pairing threshold ..................................... 13
1.2.3 Algorithms for RSpredict ...................................... 14
1.3 Results ................................................................ 16
1.3.1 Performance evaluation on Rfam alignments of
high similarity .................................................. 17
1.3.2 Performance evaluation on Rfam alignments of
medium and low similarity ..................................... 17
1.4 Conclusions ............................................................ 22
References .................................................................. 23
1.1 Introduction
RNA secondary structure prediction has been studied for quite awhile.
Many minimum free energy (MFE) methods have been developed for pre-
dicting the secondary structures of single RNA sequences, such as mfold [1],
RNAfold [2], MPGAfold [3], as well as recent tools presented in the liter-
ature [4, 5]. However, the accuracy of predicted structures is far from per-
fect. As evaluated by Gardner and Giegerich [6], the accuracy of the MFE
methods for single sequences is 73% when averaged over many different
RNAs.
Recently, a new concept of energy density for predicting the secondary
structures of single RNA sequences was introduced [7]. The normalized free
energy, or energy density, of an RNA substructure is the free energy of that
substructure divided by the length of its underlying sequence. A dynamic
3
4 Biological Data Mining
programming algorithm, called Densityfold, was developed, which delocalizes
the thermodynamic cost of computing RNA substructures and improves on
secondary structure prediction via energy density minimization [7]. Here, we
extend the concept used in Densityfold and present a tool, called RSpredict, for
RNA secondary structure prediction. RSpredict computes the RNA structure
with minimum energy density based on the loop decomposition scheme used
in the nearest neighbor energy model [8]. RSpredict focuses on the loops in an
RNA secondary structure, whereas Densityfold considers RNA substructures
where a substructure may contain several loops.
While the energy density model creates a foundation for RNA secondary
structure prediction, there are many limitations in Densityfold, just like in all
other single sequence-based MFE methods. Optimal structures predicted by
these methods do not necessarily represent real structures [9]. This happens
due to several reasons. The thermodynamic model may not be accurate. The
bases of structural RNAs may be chemically modified and these processes
are not included in the prediction model. Finally, some functional RNAs may
not have stable secondary structures [6]. Thus, a more reliable approach is
to use comparative analysis to compute consensus secondary structures from
multiple related RNA sequences [9].
In general, there are three strategies with the comparative approach. The
first strategy is to predict the secondary structures of individual RNA se-
quences separately and then align the structures. Tools such as RNAshapes
[10,11], MARNA [12], STRUCTURELAB [13], and RADAR [14,15] are based
on this strategy. RNA Sampler [9] and comRNA [16] compare and find stems
conserved across multiple sequences and then assemble conserved stem blocks
to form consensus structures, in which pseudoknots are allowed.
The second strategy predicts common secondary structures of two or more
RNA sequences through simultaneous alignment and consensus structure in-
ference. Tools based on this strategy include RNAscf [17], Foldalign [18], Dy-
nalign [19], stemloc [20], PMcomp [21], MASTR [22], and CARNAC [23].
These tools utilize either folding free energy change parameters or stochastic
context-free grammars (SCFGs) and are considered derivations of Sankoff’s
method [24].
The third strategy is to fold multiple sequence alignments. RNAalifold
[25, 26] uses a dynamic programming algorithm to compute the consensus
secondary structure with MFE by taking into account thermodynamic stabil-
ity, sequence covariation together with RIBOSUM-like scoring matrices [27].
Pfold [28] is a SCFG algorithm that produces a prior probability distribution
of RNA structures. A maximum likelihood approach is used to estimate a
phylogenetic tree for predicting the most likely structure for input sequences.
A limitation of Pfold is that it does not run on alignments of more than 40 se-
quences and in some cases produces no structures due to under-flow errors [6].
Maximum weighted matching (MWM), based on a graph-theoretical approach
and developed by Cary and Stormo [29] and Tabaska et al. [30], is able to
Consensus Structure Prediction for RNA Alignments 5
predict common secondary structures allowing pseudo-knots. KNetFold [31]
is a recently published machine learning method, implemented using a hierar-
chical network of k-nearest neighbor classifiers that analyzes the base pairings
of alignment columns in the input sequences through their mutual information,
Watson–Crick base pairing rules and thermodynamic base pair propensity de-
rived from RNAfold [2]. The method presented in this chapter, RSpredict,
joins the many tools using the third strategy; it accepts a multiple alignment
of RNA sequences as input data and predicts the consensus secondary struc-
ture for the input sequences via energy density minimization and covariance
score calculation.
We also considered two variants of RSpredict, referred to as RSefold and
RSdfold respectively. Both RSefold and RSdfold use the same covariance score
calculation as in RSpredict. The differences among the three approaches lie in
the folding algorithms they adopt. Rse-fold predicts the consensus secondary
structure for the input sequences via free energy minimization, as opposed to
energy density minimization used in RSpredict. RSdfold does the prediction
via energy density minimization, though its energy density is calculated based
on RNA substructures as in Densityfold, rather than based on the loops used
in RSpredict.
The rest of the chapter is organized as follows. We first describe the imple-
mentation and algorithms used by RSpredict, and analyze the time complexity
of the algorithms (see Section 1.2). We then present experimental results of
running the RSpredict tool as well as comparison with the existing tools (see
Section 1.3). The experiments were performed on a variety of datasets. Finally
we discuss some properties of RSpredict, possible ways to improve the tool
and point out some directions for future research (see Section 1.4).
1.2 Algorithms
RSpredict, which can be freely downloaded from http://datalab.njit.edu/
biology/RSpredict, was implemented in the Java programming language. The
program accepts, as input data, a multiple sequence alignment in the FASTA
or ClustalW format and outputs the consensus secondary structure of the
input sequences in both the Vienna style dot bracket format [26] and the
connectivity table format [32]. Below, we describe the energy density model
adopted by RSpredict. We then present a dynamic programming algorithm
for folding a single RNA sequence via energy density minimization. Next,
we describe techniques for calculating covariance scores based on the input
alignment. Finally we summarize the algorithms used by RSpredict, combining
both the folding technique and the covariance scores obtained from the input
alignment, and show its time complexity.
6 Biological Data Mining
1.2.1 Folding of a single RNA sequence
1.2.1.1 Preliminaries
We represent an RNA secondary structure as a fully decomposed set of
loops. In general, a loop L can be one of the following (see Figure 1.1):
i. A hairpin loop (which is a loop enclosed by only one base pair; the
smallest possible hairpin loop consists of three nucleotides enclosed by
a base pair)
ii. A stack, composed of two consecutive base pairs
iii. A bulge loop, if two base pairs are separated only on one side by one or
more unpaired bases
iv. An internal loop, if two base pairs are separated by one or more unpaired
bases on both sides
v. A multibranched loop, if more than two base pairs are separated by zero
or more unpaired bases in the loop
We now introduce some terms and definitions. Let S be an RNA sequence
consisting of nucleotides or bases A, U, C, G. S[i] denotes the base at position
i of the sequence S and S[i, j] is the subsequence starting at position i and
ending at position j in S. A base pair between nucleotides at positions i and
j is denoted as (i, j) or (S[i], S[j]), and its enclosed sequence is S[i, j]. Given
a loop L in the secondary structure R of sequence S, the base pair (i∗
, j∗
) in
L is called the exterior pair of L if S[i∗
](S[j∗
], respectively) is closest to the
5
(3
, respectively) end of R among all nucleotides in L. All other nonexterior
base pairs in L are called interior pairs of L. The length of a loop L is the
number of nucleotides in L. Note that two loops may overlap on a base pair.
For example, the interior pair of a stack may be the exterior pair of another
stack, or the exterior pair of a hairpin loop. Also note that a bulge or an
internal loop has exactly one exterior pair and one interior pair.
We use the energy density concept as follows. Given a secondary structure
R, every base pair (i, j) in R is the exterior pair of some loop L. We assign
(i, j) and L an energy density, which is the free energy of the loop L divided by
the length of L. The set of free energy parameters for nonmultibranched loops
used in our algorithm is acquired from [33]. The free energy of a multibranched
loop is computed based on the approach adopted by mfold [1], which is a
linear function of the number of unpaired bases and the number of base pairs
inside the loop, namely a + b × n1 + c × n2, where a, b, c are constants, n1
is the number of unpaired bases and n2 is the number of base pairs inside
the multibranched loop. We adopt the loop decomposition scheme used in the
nearest neighbor energy model developed by Turner et al. [8]. The secondary
structure R contains multiple loop components and the energy densities of
Consensus Structure Prediction for RNA Alignments 7
Hairpin
Stack
Bulge
Internal loop
Hairpin
Stack
Bulge
Multibranched loop
Bulge
Internal loop
Stack
5'
3'
FIGURE 1.1: Illustration of the loops in an RNA secondary structure. Each
loop has at least one base pair. A stem consists of two or more consecutive
stacks shown in the figure.
the loop components are additive. Our folding algorithm computes the total
energy density of R by taking the sum of the energy densities of the loop
components in R. Thus, the RNA folding problem can be formalized as follows.
Given an RNA sequence S, find the set of base pairs (i, j) and loops with
(i, j) as exterior pairs, such that the total energy density of the loops (or
equivalently, the exterior pairs) is minimized. The set of base pairs constitutes
the optimal secondary structure of S.
When generalizing the folding of a single sequence to the prediction of the
consensus structure of a multiple sequence alignment, we introduce the notion
of refined alignments. At times, an input alignment may have some columns
each of which contains more than 75% gaps. Some tools including RSpredict
delete these columns to get a refined alignment [28]; some tools simply use the
8 Biological Data Mining
original input alignment as the refined alignment. Suppose the original input
alignment Ao has N sequences and no columns, and the refined alignment A
has N sequences and n columns, n ≤ no. Formally, the consensus structure of
the refined alignment A is a secondary structure R together with its sequence
S such that each base pair (S[i], S[j]), 1 ≤ i  j ≤ n, in R corresponds to
the pair of columns i, j in the alignment A, and each base S[i], 1 ≤ i ≤ n,
is the representative base of the ith column in the alignment A. There are
several ways to choose the representative base. For example, S[i] could be
the most frequently occurring nucleotide, excluding gaps, in the ith column
of the alignment A. Furthermore, there is an energy measure value associated
with each base pair (S[i], S[j]) or more precisely its corresponding column pair
(i, j), such that the total energy measure value of all the base pairs in R is
minimized.
The consensus secondary structure of the original input alignment Ao is
defined as the structure Ro, obtained from R, as follows: (i) the base (base pair,
respectively) for column Co (column pair (Co1, Co2), respectively) in Ao is
identical to the base (base pair, respectively) for the corresponding column C
(column pair (C1, C2), respectively) in A if Co ((Co1, Co2), respectively) is not
deleted when getting A from Ao; (ii) unpaired gaps are inserted into R, such
that each gap corresponds to a column that is deleted when getting A from Ao
(see Figure 1.2). In Figure 1.2, the RSpredict algorithm transforms the original
input alignment Ao to a refined alignment A by deleting the fourth column (the
column in red) of Ao. The algorithm predicts the consensus structure of the
refined alignment A. Then the algorithm generates the consensus structure
of Ao by inserting an unpaired gap to the fourth position of the consensus
structure of A. The numbers inside parentheses in the refined alignment A
represent the original column numbers in Ao.
In what follows, we first present an algorithm for folding a single RNA
sequence based on the energy density concept described here. We then gener-
alize the algorithm to predict the consensus secondary structure for a set of
aligned RNA sequences.
1.2.1.2 Algorithm
The functions and parameters used in our algorithm are defined below
where S[i, j] is a subsequence of S and R[i, j] is the optimal secondary struc-
ture of S[i, j].
i. NE(i, j) is the total energy density of all loops in R[i, j], where nu-
cleotides at positions i, j may or may not form a base pair.
ii. NEp(i, j) is the total energy density of all loops in R[i, j] if nucleotides
at positions i, j form a base pair.
iii. eH(i, j)(EH(i, j), respectively) is the free energy (energy density, respec-
tively) of the hairpin with exterior pair (i, j).
Consensus Structure Prediction for RNA Alignments 9
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8
A
( ( ( . . . ) ) )
( ( ( . . . ) ) ) )
G C
A G C –
C A A G C U
C A A G C U
9
Original alignment Ao
Refined alignment A
Consensus structure of A
Consensus structure of Ao
(1) (2) (3) (5) (6) (7) (8) (9) (10)
FIGURE 1.2: Illustration of the consensus structure definition used by
RSpredict.
iv. eS(i, j)(ES(i, j), respectively) is the free energy (energy density, respec-
tively) of the stack with exterior pair (i, j) and interior pair (i+1, j −1).
v. eB(i, j, i
, j
), (EB(i, j, i
, j
), respectively) is the free energy (energy
density, respectively) of the bulge or internal loop with exterior pair
(i, j) and interior pair (i
, j
).
vi. eJ (i, j, i
1, j
1, i
2, j
2, . . . , i
k, j
k) EJ (i, j, i
1, j
1, i
2, j
2, . . . , i
k, j
k) respectively,
is the free energy (energy density, respectively) of the multibranched loop
with exterior pair (i, j) and interior pairs (i
1, j
1) , (i
2, j
2) , . . . , (i
k, j
k) .
10 Biological Data Mining
It is clear that
EH (i, j) =
eH (i, j)
j − i + 1
(1.1)
ES(i, j) =
eS(i, j)
4
(1.2)
EB (i, j, i
, j
) =
eB (i, j, i
, j
)
i − i + j − j + 2
(1.3)
EJ (i, j, i
1, j
1, i
2, j
2, . . . , i
k, j
k) =
eJ (i, j, i
1, j
1, i
2, j
2, . . . , i
k, j
k)
n1 + 2 × n2
(1.4)
Here n1 is the number of unpaired bases and n2 is the number of base
pairs in the multibranched loop in (vi).
Thus, the total energy density of all loops in R[i, j] where (i, j) is a base
pair is computed by Equation 1.5:
NEP (i, j) = min















EH(i, j)
ES (i, j) + NEP (i + 1, j − 1)
min
iijj
{EB (i, j, i
, j
) + NEP (i
, j
)}
min
ii
1j
1i
2j
2···i
kj
kj
{EJ (i, j, i
1, j
1, i
2, j
2, . . . , i
k, j
k)
+
k
r=1 NEP (i
r, j
r)}
(1.5)
That is, the energy density is calculated by taking the minimum of the
following four cases:
i. (i, j) is the exterior pair of a hairpin, in which case the energy density
NEP (i, j) equals EH(i, j), which is the energy density of the hairpin
ii. (i, j) is the exterior pair of a stack, in which case NEP (i, j) equals the
energy density of the stack, i.e., ES(i, j), plus NEP (i + 1, j − 1)
iii. (i, j) is the exterior pair of a bulge or an internal loop, in which case
NEP (i, j) equals the minimum of the energy density of the bulge or
internal loop EB(i, j, i
, j
) plus NEP (i
, j
) for all i  i
 j
 j
iv. (i, j) is the exterior pair of a multibranched loop, in which case NEP (i, j)
equals the minimum of the energy density of the multibranched loop
Ej

i, j, i
1, j
1, i
2,j
2, . . . , i
k, j
k

plus
k
r=1 NEP (i
r, j
r), for all i  i
1 
j
1  i
2  j
2  · · ·  i
k  j
k  j
Equation 1.6 below shows the recurrence formula for calculating NE(i, j):
NE (i, j) = min







NE (i, j − 1)
NE (i + 1, j)
NEP (i, j)
minihj{NE (i, h − 1) + NE (h, j)}
(1.6)
Consensus Structure Prediction for RNA Alignments 11
(d)
NE(i, h – 1)
NE(i, j)
NE(h, j)
j
h
h – 1
i
NE(i, j – 1)
(a) (b) (c)
NE(i + 1, j)
NEp(i, j)
NE(i, j)
NE(i, j)
j i
i
j – 1 i + 1 j i j
FIGURE 1.3: Illustration of the cases in Equation 1.6. a) the total nor-
malized energy of all loops in the optimal secondary structure R[i, j − 1] of
subsequence S[i, j − 1]; b) the total normalized energy of all loops in the op-
timal secondary structure R[i + 1, j] of subsequence S[i + 1, j]; c) the total
normalized energy of all loops in the optimal secondary structure R[i, j] of
subsequence S[i, j], where S[i] and S[j] form a base pair; d) the minimum of
NE(i, k − 1) plus NE(k, j) for all i  k  j; The dashed line between two
nucleotides means that the two nucleotides may or may not form a base pair.
The solid line between two nucleotides means that the two nucleotides form a
base pair.
That is, the energy density is computed by taking the minimum of the
following four cases:
i. The total energy density of all loops in the optimal secondary structure
R [i, j − 1] of subsequence S [i, j − 1] (Figure 1.3a)
ii. The total energy density of all loops in the optimal secondary structure
R [i + 1, j] of subsequence S [i + 1, j] (Figure 1.3b)
iii. The total energy density of all loops in the optimal secondary structure
R[i, j] of subsequence S[i, j], where S[i] and S[j] form a base pair (Figure
1.3c)
12 Biological Data Mining
iv. The minimum of NE(i, h − 1) plus NE(h, j) for all i  h  j (Figure
1.3d)
Note that case (iii) of Equation 1.6 is not considered when the nucleotides at
positions i, j are forbidden to form a base pair, i.e., (S[i], S[j]) is a nonstandard
base pair. A standard base pair is any of the following: (A,U), (U,A), (G,C),
(C,G), (G,U), (U,G); all other base pairs are nonstandard.
In calculating the time complexity of the folding algorithm, there is a need
to check for finding the optimal i
, j
where i  i
 j
 j in case (iii) (the
optimal i
1, j
1, i
2, j
2, . . . , i
k, j
k where i  i
1  j
1  i
2  j
2  · · ·  i
k  j
k  j
in case (iv), respectively) of Equation 1.5. It can be shown that it takes linear
time to compute NEP (i, j) in Equation 1.5. Hence, the time complexity of
the folding algorithm is O(n3
) since we need to calculate NEP (i, j) for all
1 ≤ i  j ≤ n, where n is the number of nucleotides in the given sequence S.
The energy density of the optimal secondary structure R for the sequence S
equals NE(1, n).
1.2.2 Calculation of covariance scores
When applying the above folding algorithm to a multiple sequence align-
ment Ao, we take into consideration the correlation between columns of the
alignment. In many cases, the sequences in the alignment may have highly
varying lengths. We refine the alignment Ao by deleting columns containing
more than 75% gaps to get a refined alignment A [28]. We will use this refined
alignment throughout the rest of this subsection.
1.2.2.1 Covariance score
We use the covariance score introduced by RNAalifold [25, 26, 34] to
quantify the relationship between two columns in the refined alignment. Let
fij(XY ) be the frequency of finding both base X in column i and base Y
in column j, where X, Y are in the same row of the refined alignment. We
exclude the occurrences of gaps in column i or column j when calculating
fij(XY). The covariation measure for columns i, j, denoted Cij, is calculated
by Equation 1.7:
Cij =

XY, X
Y 
fij (XY ) Dij (XY, X
Y 
) fij (X
Y 
)
2
(1.7)
Here, Dij(XY, X
Y
) is the Hamming distance between the two base pairs
(X, Y ) and (X
, Y 
) if both of the base pairs are standard base pairs, or 0
otherwise. The Hamming distance between (X, Y ) and (X
, Y 
) is calculated
as follows:
Dij (XY, X
Y 
) = 2 − δ (X, X
) − δ (Y, Y 
) (1.8)
where
δ (X, X
) =

1 if X = X
0 otherwise
(1.9)
Consensus Structure Prediction for RNA Alignments 13
Observe that the information acquired from the two base pairs (X, Y )
and (X
, Y 
) is the same as that from (X
, Y 
) and (X, Y ). Thus, we divide
the numerator in Equation 1.7 by two so as to obtain the non-redundant
information between column i and column j in the refined alignment.
For every pair of columns i, j in the refined alignment, the covariance score
of the two columns i and j, denoted Covij, is calculated in Equation 1.10:
Covij = Cij + c1 × NFij (1.10)
Here, Cij is as defined in Equation 1.7, c1 is a user-defined coefficient (in
the study presented here, c1 has a value of −1), and
NFij =
NCij
N
(1.11)
where N is the total number of sequences and NCij is the total number of
conflicting sequences in the refined alignment. A conflicting sequence is one
that has a gap in column i or column j, or has a nonstandard base pair in the
columns i, j of the refined alignment. A sequence with gaps in both columns
i, j is not conflicting.
1.2.2.2 Pairing threshold
We say that column i and column j in the refined alignment can possibly
form a base pair if their covariance score is greater than or equal to a pairing
threshold; otherwise, column i and column j are forbidden to form a base pair.
The pairing threshold, η, used in RSpredict is calculated as follows.
It is known that, on average, 54% of the nucleotides in an RNA sequence
S are involved in the base pairs of its secondary structure [35]. We use this
information to calculate an alignment-dependent pairing threshold, observing
that the base pairs in the consensus secondary structure of a sequence align-
ment represent the column pairs with the highest covariance scores. Given that
different structures contain different numbers of base pairs, we consider two
different percentages of columns, namely, 30% and 65%, in the sequence align-
ment. For each percentage p, there are at most Tp possible base pairs, where
Tp =
(p × n) × (p × n − 1)
2
(1.12)
and n is the number of columns in the sequence alignment.
Now, we calculate the covariance scores of all pairs of columns in the
given refined alignment, and sort the covariance scores in descending order.
We then select the top Tp largest covariance scores and store the covariance
scores in the set STp. Thus, the set ST0.65 contains the top largest covariance
scores that involve 65% of the columns in the refined alignment; the set ST0.30
contains the top largest covariance scores that involve 30% of the columns in
the refined alignment; and ST0.65ST0.30 is the set difference that contains
covariance scores in ST0.65 but not in ST0.30 (see Figure 1.4). The pairing
14 Biological Data Mining
ST0.30
ST0.65
T0.30 T0.65
FIGURE 1.4: Illustration of the pairing threshold computation. The pairing
threshold used in RSpredict is computed as the average of the covariance
scores inside the shaded area.
threshold η used in RSpredict is calculated as the average of the covariance
scores in ST0.65ST0.30, as shown in Equation 1.13:
η =

Covij ∈ ST0.65ST0.30Covij
|ST0.65ST0.30|
(1.13)
where the denominator is the cardinality of the set difference ST0.65ST0.30.
If the covariance score of columns i and j is greater than or equal to η,
then column i and column j can possibly form a base pair, and we refer to
(i, j) as a pairing column. If the covariance score of the columns i and j is
less than η, we will check the covariance scores of the immediate neighboring
column pairs of i, j to see if they are above a user-defined threshold [31] (in the
study presented here, this threshold is set to 0). The immediate neighboring
column pairs of i, j are i + 1, j − 1 and i − 1, j + 1. If the covariance scores
of both of the immediate neighboring column pairs of i, j are greater than or
equal to max{η, 0}, then (i, j) is still considered as a paring column.
1.2.3 Algorithms for RSpredict
Given a refined multiple sequence alignment A with N sequences, let (i, j)
be a pairing column in A. Let XS
i (Y S
j , respectively) be the nucleotide at
position i (j, respectively) of the sequence S in the alignment A.

XS
i , Y S
j

must be the exterior pair of some loop L in S. We use e

XS
i , Y S
j

to repre-
sent the free energy of that loop L. If

XS
i , Y S
j

is a nonstandard base pair,
e

XS
i , Y S
j

= 0. We assign the pairing column (i, j) a pseudo-energy eij where
eij =
1
N
S∈A
e

XS
i , Y S
j

+ c2 × Covij (1.14)
Here, c2 is a user-defined coefficient (in the study presented here, c2 = −1).
Thus, every pairing column in the refined alignment A has a pseudo-energy.
We then apply the minimum energy density folding algorithm described in
the beginning of this section to the refined alignment A, treating each pairing
column in A as a possible base pair considered in the folding algorithm.
Notice that when calculating the energy density for the loop L, the se-
quence S is in the refined alignment A, which may have fewer columns than
Consensus Structure Prediction for RNA Alignments 15
the original input alignment Ao (cf. Figure 1.2). RSpredict computes all energy
densities based on the refined alignment, and the program uses loop lengths
from the refined alignment A rather than the original input alignment Ao.
Let R be the consensus secondary structure, computed by RSpredict, for the
refined alignment A. We obtain the consensus structure Ro of the original
input alignment Ao by inserting unpaired gaps to the positions in R whose
corresponding columns are deleted when getting A from Ao (cf. Figure 1.2).
The following summarizes the algorithms for RSpredict:
1. Input an alignment Ao in the FASTA or ClustalW format.
2. Delete the columns with more than 75% gaps from Ao to obtain a refined
alignment A.
3. Compute the pseudo-energy eij for every pairing column (i, j) in A as
in Equation 1.14.
4. Run the minimum energy density folding algorithm on A, using the
pseudo-energy values obtained from step (3) to produce the consensus
secondary structure R of the refined alignment A. The base at position i
of the consensus secondary structure R is the most frequently occurring
nucleotide, excluding gaps, in the ith column of the refined alignment A.
5. Map the consensus structure R back to the original alignment Ao by in-
serting unpaired gaps to the positions of R whose corresponding columns
are deleted in Step (2).
Notice that Equation 1.6 is used to compute the NE values only. To gen-
erate the optimal structure R in Step (4), we maintain a stack of pointers
that point to the substructures of loops with minimum energy density as we
compute the NE values. Once all the NE values are calculated and the energy
density of the optimal secondary structure R is obtained, we pop up the point-
ers from the stack to extract the optimal predicted structure. In step (5), we
map the bases (base pairs, respectively) for the columns (column pairs, respec-
tively) in A to their corresponding columns (column pairs, respectively) in Ao.
For example, consider Figure 1.2 again. In the figure, the refined alignment A
is obtained by deleting column 4 from the original input alignment Ao. The
bases for columns 1, 2, 3, 4 in A are mapped to columns 1, 2, 3, 5 in Ao. The
base pair between column 1 and column 9 in A becomes the base pair between
column 1 and column 10 in Ao; the base pair between column 2 and column
8 in A becomes the base pair between column 2 and column 9 in Ao. An
unpaired gap is inserted to the position corresponding to the deleted column
4 in Ao.
Let N be the number of sequences and no be the number of columns in the
input alignment Ao. Step (2) takes O(Nno) time. Step (3) takes O

n2
o

time.
Step (4) takes O

n3
o

time. Step (5) takes O(no) time. Therefore, the time
complexity of RSpredict is O

Nno + n3
o

, which is approximately O

n3
o

as
Nis usually much smaller than no.
16 Biological Data Mining
1.3 Results
We conducted a series of experiments to evaluate the performance of
RSpredict and compared it with five related tools including KNetFold, Pfold,
RNAalifold, RSefold, and RSdfold. We tested these tools on Rfam [36] se-
quence alignments with different similarities. The Rfam sequence alignments
come with consensus structures. For evaluation purposes, we used the Rfam
consensus structures as reference structures and compared them against the
consensus structures predicted by the six tools. The similarity of a sequence
alignment is determined by the average pairwise sequence identity (APSI) of
that alignment [6]. In the study presented here, a sequence alignment is of
high similarity if its APSI value is greater than 75%, is of medium similarity
if its APSI value is between 55% and 75%, or is of low similarity if its APSI
value is less than 55%. The data sets used in testing included 20 Rfam se-
quence alignments of high similarity and 36 Rfam sequence alignments of low
and medium similarity. These data sets were chosen to form a collection of
sequence alignments with different (low, medium and high) APSI values, dif-
ferent numbers of sequences, as well as different sequence alignment lengths.
More specifically, the data sets contained sequence alignments that ranged in
size from 2 to 160 sequences, in length from 33 to 262 nucleotides and had
APSI values ranging from 42% to 99%.
The performance measures used in our study include sensitivity (SN ) and
selectivity (SL) [6], where
SN =
TP
TP + FN
(1.15)
SL =
TP
TP + (FP − ξ)
. (1.16)
Here, TP is the number of correctly predicted base pairs (“true positives”),
FN is the number of base pairs in a reference structure that were not predicted
(“false negatives”) and FP is the number of incorrectly predicted base pairs
(“false positives”). False positives are classified as inconsistent, contradicting
or compatible [6]. When predicting the consensus secondary structure for a
multiple sequence alignment, a predicted base pair (i, j) is inconsistent if col-
umn i in the alignment is paired with column q, q = j, or column j is paired
with column p, p = i, and p, q form a base pair in the reference structure of the
alignment. A base pair (i, j) is contradicting if there exists a base pair (p, q) in
the reference structure of the alignment, such that i  p  j  q. A base pair
(i, j) is compatible if it is a false positive but is neither inconsistent nor contra-
dicting. The ξ in SL represents the number of compatible base pairs, which are
considered neutral with respect to algorithmic accuracy. Therefore ξ is sub-
tracted from FP. Finally, we used the Matthews correlation coefficient (MCC)
to combine the sensitivity and selectivity, where MCC is approximated to the
Consensus Structure Prediction for RNA Alignments 17
geometric mean of the two measures, i.e., MCC ≈
√
SN × SL [18]. The larger
MCC, SN, SL values a tool has, the better performance that tool achieves and
the more accurate that tool is.
1.3.1 Performance evaluation on Rfam alignments
of high similarity
The first data set consisted of seed alignments of high similarity taken
from 20 families in Rfam. The APSI values of these seed alignments ranged
from 77% to 99%. The alignments ranged in size from 2 to 160 sequences and
in length from 33 to 159 nucleotides. Table 1.1 presents the accession number,
description, number of sequences, and length of the seed alignment of each of
the 20 Rfam families used in the experiment. The seed alignments of the 20
families are of high similarity; their APSI values are shown in the last column
of the table. The families are sorted, from top to bottom, in ascending order
on the APSI values. All six tools including RSpredict, KNetFold, RNAalifold,
Pfold, RSefold and RSdfold were tested on this data set.
The graphs in Figure 1.5 show the trend of the MCC, SN, and SL, which
are sorted in descending order for each tool under analysis. The X-axis shows,
therefore, the rank of the MCC (SN and SL, respectively) from highest to
lowest. For example, number 1 in the X-axis corresponds to the highest
score achieved by each tool. The Y-axis represents the MCC, SN, and SL,
respectively.
It can be seen from Figure 1.5 that RSpredict performed the best while
RSdfold performed the worst among the six tools. The Pfold tool had good
performance in selectivity but did not perform well in sensitivity and as a
result in MCC. It also suffered from a size limitation (the Pfold web server
can accept a multiple alignment of up to 40 sequences). Only 17 out of the 20
sequence alignments used in the experiment were accepted by the Pfold server;
the other three alignments (RF00386, RF00041, and RF00389) had more than
40 sequences and therefore could not be run on the Pfold server. RSpredict
had stable performance with the best mean 0.85 (standard deviation 0.16,
respectively) in MCC, while the other methods’ MCC values varied a lot and
had means (standard deviations, respectively) ranging from 0.37 to 0.82 (0.24
to 0.34, respectively).
1.3.2 Performance evaluation on Rfam alignments of
medium and low similarity
In the second experiment, we compared RSpredict with the other five
methods on multiple sequence alignments of low and medium similarity. The
test dataset included seed alignments of 36 families taken from Rfam [36].
The APSI values of the seed alignments ranged from 42 to 75%, the number
of sequences in the alignments ranged from 3 to 114, and the alignment lengths
ranged from 43 to 262 nucleotides. Table 1.2 presents the accession number,
18 Biological Data Mining
TABLE 1.1: Rfam alignments of high similarity.
Number of
Accession Description sequences Length APSI
RF00460 U1A polyadenylation
inhibition element (PIE)
8 75 77%
RF00326 Small nucleolar RNA Z155 8 81 79%
RF00560 Small nucleolar RNA
SNORA17
38 132 82%
RF00453 Cardiovirus cis-acting
replication element (CRE)
12 33 82%
RF00386 Enterovirus 5
cloverleaf
cis-acting replication
element
160 91 83%
RF00421 Small nucleolar RNA
SNORA32
9 122 84%
RF00302 Small nucleolar RNA
SNORA65
8 130 84%
RF00465 Japanese encephalitis virus
(JEV) hairpin structure
20 60 86%
RF00501 Rotavirus cis-acting
replication element (CRE)
14 68 87%
RF00041 Enteroviral 3
UTR element 60 123 87%
RF00575 Small nucleolar RNA
SNORD70
4 88 89%
RF00362 Pospiviroid RY motif stem
loop
16 79 92%
RF00105 Small nucleolar RNA
SNORD115
23 82 92%
RF00467 Rous sarcoma virus (RSV)
primer binding site (PBS)
23 75 93%
RF00389 Bamboo mosaic virus
satellite RNA
cis-regulatory
element
42 159 93%
RF00384 Poxvirus AX element late
mRNA cis-regulatory
element
7 62 93%
RF00098 Snake H/ACA box small
nucleolar RNA
22 150 93%
RF00607 Small nucleolar RNA
SNORD98
2 67 98%
RF00320 Small nucleolar RNA Z185 2 86 98%
RF00318 Small nucleolar RNA Z175 3 81 99%
description, number of sequences, and length of the seed alignment of each
of the 36 Rfam families used in the experiment. The seed alignments of the
36 families are of low and medium similarity; their APSI values are shown in
the last column of the table. The families are sorted, from top to bottom, in
ascending order on the APSI values.
Consensus Structure Prediction for RNA Alignments 19
Matthews correlation coefficient
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
Sensitivity
0.00
0.20
0.40
0.60
0.80
1.00
1.20
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
Selectivity
0.00
0.20
0.40
0.60
0.80
1.00
1.20
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
FIGURE 1.5: Comparison of the MCC, SN, and SL values of the six tools
under analysis on the seed alignments of high similarity taken from the 20
families listed in Table 1.1.
20 Biological Data Mining
TABLE 1.2: Rfam alignments of low and medium similarity.
Number of
Accession Description sequences Length APSI
RF00230 T-box leader 103 262 42%
RF00080 yybP-ykoY leader 50 131 44%
RF00515 PyrR binding site 72 125 47%
RF00557 Ribosomal protein L10 leader 66 149 48%
RF00504 Glycine riboswitch 93 111 50%
RF00029 Group II catalytic intron 114 94 52%
RF00458 Cripavirus internal ribosome
entry site (IRES)
7 203 54%
RF00559 Ribosomal protein L21 leader 33 81 54%
RF00234 glmS glucosamine-6-phosphate
activated ribozyme
11 218 55%
RF00556 Ribosomal protein L19 leader 24 43 55%
RF00519 suhB 13 80 56%
RF00379 ydaO/yuaA leader 25 150 58%
RF00380 ykoK leader 36 172 59%
RF00445 mir-399 microRNA precursor
family
13 119 59%
RF00522 PreQ1 riboswitch 22 47 59%
RF00095 Pyrococcus C/D box small
nucleolar RNA
25 59 60%
RF00442 ykkC-yxkD leader 11 111 60%
RF00430 Small nucleolar RNA SNORA54 5 134 60%
RF00521 SAM riboswitch
(alpha-proteobacteria)
12 79 61%
RF00049 Small nucleolar RNA SNORD36 20 82 63%
RF00513 Tryptophan operon leader 11 100 63%
RF00309 Small nucleolar RNA snR60/
Z15/Z230/Z193/J17
23 106 63%
RF00451 mir-395 microRNA precursor
family
21 112 64%
RF00464 mir-92 microRNA precursor
family
33 80 64%
RF00507 Coronavirus frameshifting
stimulation element
23 85 66%
RF00388 Qa RNA 5 103 70%
RF00357 Small nucleolar RNA R44/
J54/Z268 family
19 105 70%
RF00434 Luteovirus cap-independent
translation element (BTE)
17 108 71%
RF00525 Flavivirus DB element 111 76 71%
RF00581 Small nucleolar SNORD12/
SNORD106
8 91 71%
RF00238 ctRNA 48 88 72%
RF00477 Small nucleolar RNA snR66 5 105 72%
RF00608 Small nucleolar RNA SNORD99 3 80 72%
RF00468 Heaptitis C virus stem-loop VII 110 66 74%
RF00489 ctRNA 14 80 74%
RF00113 QUAD RNA 14 150 75%
The MCC, SN, and SL values are sorted in descending order for each tool
under analysis and placed in the graphs in Figure 1.6. The X-axis shows, there-
fore, the rank of the MCC (SN and SL, respectively) from highest to lowest.
For example, number 1 in the X-axis corresponds to the highest score achieved
by each tool. The Y -axis represents the MCC, SN, and SL, respectively.
Consensus Structure Prediction for RNA Alignments 21
Matthews correlation coefficient
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Sensitivity
0.00
0.20
0.40
0.60
0.80
1.00
1.20
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Selectivity
0.00
0.20
0.40
0.60
0.80
1.00
1.20
KNetFold Pfold RNAalifold
RSefold RSdfold RSpredict
FIGURE 1.6: Comparison of the MCC, SN, and SL values of the six tools
under analysis on the seed alignments of low and medium similarity taken
from the 36 families listed in Table 1.2.
22 Biological Data Mining
Comparing Figures 1.5 and 1.6, we see that the methods under analysis
generally performed better on sequence alignments of medium and low similar-
ity than on sequence alignments of high similarity. Like what was observed in
the previous experiment, RSdfold performed the worst (cf. Figure 1.5). The
structures predicted by RSdfold tend to be stem-like structures; therefore,
many structures, particularly those containing multibranched loops, were mis-
predicted. For this reason, RSdfold yielded very low MCC, SN and SL values.
RSpredict outperformed the other five methods based on the three per-
formance measures used in the experiment. The tool achieved a high mean
value of 0.94 in MCC, better than those of KNetFold (0.86), Pfold (0.88)
and RNAalifold (0.89). Similar results were observed for sensitivity and se-
lectivity values. Furthermore, RSpredict exhibited stable performance across
all the families tested in the experiment. The tool had an MCC, SN and SL
standard deviation of 0.08, 0.09 and 0.08, respectively. These numbers were
better than the standard deviation values obtained from the other five meth-
ods, which ranged from 0.11 to 0.34. Pfold suffered from a size limitation; it
could not generate a structure for the large seed alignments with more than
40 sequences in 9 families, including RF00230, RF00080, RF00515, RF00557,
RF00504, RF00029, RF00525, RF00238 and RF00468.
1.4 Conclusions
In this chapter we presented a software tool, called RSpredict, capable
of predicting the consensus secondary structure for a set of aligned RNA
sequences via energy density minimization and covariance score calculation.
Our experimental results showed that RSpredict is competitive with some
widely used tools including RNAalifold and Pfold on tested datasets, sug-
gesting that RSpredict can be a choice when biologists need to predict RNA
secondary structures of multiple sequence alignments, especially those with
low and medium similarity. Notice that RSpredict differs from KNetFold [31]
in that KNetFold is a machine learning method that relies on precompiled
training data derived from existing RNA secondary structures. RSpredict, on
the other hand, is based on a dynamic programming algorithm for folding
sequences and does not utilize training data.
Given a multiple sequence alignment Ao, our work is focused on predicting
the consensus structure of the aligned sequences in Ao, rather than folding each
individual sequence in Ao. Our approach is to first transform Ao to a refined
alignment A by deleting columns with more than 75% gaps from Ao, then pre-
dict the consensus structure for A, and finally extend the consensus structure
by inserting gaps to the positions corresponding to the deleted columns in Ao
(cf. Figure 1.2). The predicted structure may not correspond exactly to any
individual sequence in the original alignment Ao. As an example, assume for
Consensus Structure Prediction for RNA Alignments 23
simplicity that Ao is the same as A, i.e., no columns are deleted when getting
A from Ao. Consider a particular sequence S in Ao. Assume that the position
(column) i of S has a gap due to the alignment with the other sequences in
Ao. On the other hand, the position i in the consensus structure of Ao has
the most frequently occurring nucleotide in column i of Ao, which cannot
be a gap. As a result, the consensus structure of Ao, which is at least one
nucleotide longer than S, cannot be mapped exactly back onto S. In future
work we plan to look into ways for improving on consensus structure predic-
tion. Possible ways include the utilization of evolutionary information [37],
more sophisticated models of covariance scoring, and training data for more
accurate pairing thresholds.
References
[1] Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridiza-
tion prediction. Nucleic Acids Res. 31:3406–3415.
[2] Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic
Acids Res. 31:3429–3431.
[3] Shapiro, B.A., Kasprzak, W., Grunewald, C., Aman, J. 2006. Graphi-
cal exploratory data analysis of RNA secondary structure dynamics pre-
dicted by the massively parallel genetic algorithm. J. Mol. Graph. Model.
25:514–531.
[4] Bellamy-Royds, A.B., Turcotte, M. 2007. Can Clustal-style progressive
pairwise alignment of multiple sequences be used in RNA secondary struc-
ture prediction? BMC Bioinformatics 8:190.
[5] Horesh, Y., Doniger, T., Michaeli, S., Unger, R. RNAspa: a shortest path
approach for comparative prediction of the secondary structure of ncRNA
molecules. BMC Bioinformatics 8:366.
[6] Gardner, P.P., Giegerich, R. 2004. A comprehensive comparison of com-
parative RNA structure prediction approaches. BMC Bioinformatics
5:140.
[7] Alkan, C., Karakoc, E., Sahinalp, S.C., Unrau, P., Alexander, E., Zhang,
K., Buhler, J. 2006. RNA secondary structure prediction via energy
density minimization. In Proceedings of the Research in Computational
Molecular Biology (RECOMB), Springer Berlin/Heidelberg, Venice, Italy,
130–142.
[8] Xia, T., SantaLucia, J., Burkard, M.E., Kierzek, R., Schroeder, S.J.,
Jiao, X., Cox, C., Turner, D.H. 1998. Thermodynamic parameters for an
24 Biological Data Mining
expanded nearest-neighbor model for formation of RNA duplexes with
Watson-Crick base pairs. Biochemistry 37:14719–14735.
[9] Xu, X., Yongmei, J., Stormo, G.D. 2007. RNA Sampler: a new sampling
based algorithm for common RNA secondary structure prediction and
structural alignment. Bioinformatics 23:1883–1891.
[10] Giegerich, R., Voss, B., Rehmsmeier, M. 2007. Abstract shapes of RNA.
Nucleic Acids Res. 32:4843–4851.
[11] Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J., Giegerich, R. 2006.
RNAshapes: an integrated RNA analysis package based on abstract
shapes. Bioinformatics 22:500–503.
[12] Siebert, S., Backofen, R. 2005. MARNA: multiple alignment and consen-
sus structure prediction of RNAs based on sequence structure compar-
isons. Bioinformatics 21:3352–3359.
[13] Shapiro, B.A., Bengali, D., Kasprzak, W., Wu, J.C. 2001. RNA folding
pathway functional intermediates: their prediction and analysis. J. Mol.
Biol. 312:27–44.
[14] Khaladkar, M., Bellofatto, V., Wang, J.T.L., Tian, B., Shapiro, B.A.
2007. RADAR: a web server for RNA data analysis and research. Nucleic
Acids Res. 35:W300–W304.
[15] Liu, J., Wang, J.T.L., Hu, J., Tian, B. 2005. A method for aligning RNA
secondary structures and its application to RNA motif detection. BMC
Bioinformatics 6:89.
[16] Ji, Y., Xu, X., Stormo, G.D. 2004. A graph theoretical approach for pre-
dicting common RNA secondary structure motifs including pseudoknots
in unaligned sequences. Bioinformatics 20:1591–1602.
[17] Bafna, V., Tang, H., Zhang, S. 2006. Consensus folding of unaligned RNA
sequences revisited. J. Comput. Biol. 13:283–295.
[18] Gorodkin, J., Stricklin, S.L., Stormo, G.D. 2001. Discovering com-
mon stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res.
29:2135–2144.
[19] Mathews, D.H., Turner, D.H. 2002. Dynalign: an algorithm for finding
the secondary structure common to two RNA sequences. J. Mol. Biol.
317:191–203.
[20] Holmes, I., Rubin, G.M. 2002. Pairwise RNA structure comparison with
stochastic context-free grammars. In Proceedings of the Pacific Sympo-
sium Biocomputing, Lihue, Hawaii, 163–174.
Consensus Structure Prediction for RNA Alignments 25
[21] Hofacker, I.L., Bernhart, S.H.F., Stadler, P.F. 2004. Alignment of RNA
base pairing probability matrices. Bioinformatics 20:2222–2227.
[22] Lindgreen, S., Gardner, P.P., Krogh, A. 2007. MASTR: multiple align-
ment and structure prediction of non-coding RNAs using simulated an-
nealing. Bioinformatics 23:3304–3311.
[23] Touzet, H., Perriquet, O. 2004. CARNAC: folding families of related
RNAs. Nucleic Acids Res. 32:W142–W145.
[24] Sankoff, D. 1985. Simultaneous solution of the RNA folding, alignment
and protosequence problems. SIAM J. Appl. Math. 45:810–825.
[25] Hofacker, I.L., Fekete, M., Stadler, P.F. 2002. Secondary structure pre-
diction for aligned RNA sequences. J. Mol. Biol. 319:1059–1066.
[26] Bernhart, S.H., Hofacker, I.L., Will, S., Gruber, A.R., Stadler, P.F. 2008.
RNAalifold: improved consensus structure prediction for RNA align-
ments. BMC Bioinformatics 9:474.
[27] Klein, R.J., Eddy, S.R. 2003. RSEARCH: finding homologs of single struc-
tured RNA sequences. BMC Bioinformatics 4:44.
[28] Knudsen, B., Hein, J. 2003. Pfold: RNA secondary structure prediction
using stochastic context-free grammars. Nucleic Acids Res. 31:3423–3428.
[29] Cary, R.B., Stormo, G.D. 1995. Graph-theoretic approach to RNA mod-
eling using comparative data. In Proceedings of the Third International
Conference on Intelligent Systems for Molecular Biology, AAAI Press,
Menlo Park, CA, 75–80.
[30] Tabaska, J.E., Cary, R.B., Gabow, H.N., Stormo, G.D. 1998. An RNA
folding method capable of identifying pseudoknots and base triples.
Bioinformatics 14:691–699.
[31] Bindewald, E., Shapiro, B.A. 2006. RNA secondary structure prediction
from sequence alignments using a network of k-nearest neighbor classi-
fiers. RNA 12:342–352.
[32] Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M.,
Turner, D.H. 2004. Incorporating chemical modification constraints into a
dynamic programming algorithm for prediction of RNA secondary struc-
ture. Proc. Natl. Acad. Sci. USA. 101:7287–7292.
[33] Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H. 1999. Expanded se-
quence dependence of thermodynamic parameters provides robust pre-
diction of RNA secondary structure. J. Mol. Biol. 288:911–940.
[34] Lindgreen, S., Gardner, P.P., Krogh, A. 2006. Measuring covariation in
RNA alignments: physical realism improves information measures. Bioin-
formatics 22:2988–2995.
26 Biological Data Mining
[35] Mathews, D.H., Banerjee, A.R., Luan, D.D., Eickbush, T.H., Turner,
D.H. 1997. Secondary structure model of the RNA recognized by the
reverse transcriptase from the R2 retrotransposable element. RNA 3:1–16.
[36] Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., Eddy, S.R.
2003. Rfam: an RNA family database. Nucleic Acids Res. 31:439–441.
[37] Seemann, S.E., Gorodkin, J., Backofen, R. 2008. Unifying evolutionary
and thermodynamic information for RNA folding of multiple alignments.
Nucleic Acids Res. 36:6355–6362.
Chapter 2
Invariant Geometric Properties of
Secondary Structure Elements in
Proteins
Matteo Comin
University of Padua
Concettina Guerra
Georgia Institute of Technology and University of Padua
Giuseppe Zanotti
University of Padua
2.1 Introduction ........................................................... 28
2.1.1 The dilemma of protein folding ............................... 28
2.1.2 Protein classification and the discovery of hidden rules ....... 29
2.2 The Use of Geometric Invariants and Hashing for a Simplified
Representation of Secondary Structure Elements (SSEs) ............ 30
2.2.1 Simplified representations of three-dimensional
(3D) structures ................................................ 30
2.2.2 Segment approximation of secondary structure
element (SSE) ................................................. 32
2.2.3 Building of the hash table for triplets of secondary
structure element (SSE) ....................................... 32
2.2.4 Building the hash table ........................................ 34
2.3 The Use of Geometric Invariants for Three-Dimensional (3D)
Structures Comparison ................................................ 34
2.3.1 Retrieving similarity from the table ........................... 34
2.3.2 Pair-wise alignment of secondary structures .................. 35
2.3.3 Ranking candidate proteins ................................... 36
2.3.4 Atomic superposition .......................................... 36
2.3.5 Benchmark applications ....................................... 37
2.4 Statistical Analysis of Triplets and Quartets of Secondary
Structure Element (SSE) .............................................. 39
2.4.1 Methodology for the analysis of angular patterns ............. 40
2.4.2 Results of the statistical analysis .............................. 42
2.4.3 Selection of subsets containing secondary structure
element (SSE) in close contact ................................ 44
2.5 Conclusions ............................................................ 46
References .................................................................. 47
27
28 Biological Data Mining
2.1 Introduction
2.1.1 The dilemma of protein folding
Proteins and nucleic acids represent the two major classes of biological
macromolecules present in living organisms. They both are necessary to a cell
to perform most of its functions, but their role is profoundly different: whilst
in nucleic acids the information content is kept in the form of a string, i.e., it
resides in the linear sequence of the four bases, the most important aspect of
a protein (at least of the globular ones) is its three-dimensional (3D) architec-
ture. Using the 20 different amino acids that can constitute a protein (we are
neglecting here posttranslational modifications, which can be physiologically
very important, but are not relevant for the problem of folding), it is in princi-
ple possible to build an impressive number of different sequences:∗
considering,
for example, a polypetide chain of only 100 amino acids, this number is 20100
.
Only a very small fraction of these sequences is actually present in a cell. For
example, the genome of a simple gram-negative bacterium, like Escherichia
coli, codes for less than 2000 genes, whilst the genome of a complex organ-
ism, like a man, contains many more genes (according to different estimates,
between 20,000 and 30,000 genes) and consequently many more proteins. The
previous numbers drastically decrease if we consider tertiary structures. It is
in fact well known that the 3D structure of a protein is much more conserved
than its amino acid sequence, and proteins with different primary structure
can display the same fold.†
Quite often the same fold corresponds to the same
function, and this is one of the reasons why it is necessary to know the 3D
structure of a protein and not simply its amino acid sequence; but there are
also common protein folds that correspond to totally different functions. We
will not discuss here if the latter phenomenon has to be ascribed to conver-
gent or divergent evolution, but the practical consequence of this fact is the
relatively limited number of different protein folds present in nature. If we con-
sider the Protein Data Bank (PDB, http://www.rcsb.org), the database that
collects all the 3D structures of biological macromolecules till now experimen-
tally determined, either through X-ray or electron diffraction or NMR, there
are at present about 47,000 structures of proteins deposited. They correspond
to about 1,050 different folds according to SCOP (Murzin et al., 1995) or to
850 according to CATH (Orengo and Thornton, 2005). We do not know yet if
they can be considered representative of all the possible folds present in living
∗ The amino acid sequence is also called the primary structure. The level of organization
of a protein include three other levels: the secondary structure considers how the polypep-
tide chain folds on itself, forming pieces of repeated conformation; the tertiary structure
describes how secondary structure elements (SSEs) organize in 3D space; the quaternary
structure (which is not present in all the proteins) describes the organization of more than
one polypeptide chain.
† The term “fold” is used to indicate the way SSEs are arranged in space and is roughly
a synonym of “tertiary structure.”
Invariant Geometric Properties of Secondary Structure Elements 29
organisms: until some years ago it was estimated that the possible folds could
have been about 1000; since completely new folds have not been discovered in
the last four years, it is quite reasonable to assume that the number of folds
we know is probably quite close to the total number of the existent ones. If
so, this means that in nature a limited number of 3D architectures have been
developed, and those are used to perform all the necessary functions of cells
and organisms. Interestingly, similar 3D folds can be present in proteins that
bear small or even undetectable sequence homology, but at present we are
not yet able, given an amino acid sequence unrelated with that of previously
known 3D structures, to predict with sufficient reliability which folding that
particular sequence will assume.
2.1.2 Protein classification and the discovery of hidden rules
The concept of fold similarity is not exempt from ambiguities. Do protein
families really exist, or is it more likely that there is a sort of “continuous” of
similarities? The idea of grouping proteins into “families” according to their
fold similarity possibly derives from our needs of classification and categoriza-
tion (Gibrat et al., 1996). Whilst in some cases two proteins clearly share the
same fold, in others this similarity is questionable, and, in fact, different pro-
grams estimate a different numbers of total folds and classify some proteins as
belonging to the same family or not (Figure 2.1). This need of categorization
has, however, a great practical relevance, both in structure prediction and in
function assignment. The experimental determination of the 3D structure of a
protein, either by X-ray or NMR, takes nowadays months or, in difficult cases,
years, while the sequences of entire genomes, and consequently of the proteins
coded by them, are determined at a very high rate.∗
In this respect, the ability
of predicting the 3D structure of a protein is of paramount importance. At the
same time, the recognition of structural similarities in proteins that present
limited or nonexistent sequence similarity can sometimes be used to assign a
biological role to a protein of unknown function, when its 3D structure has
been determined.
Sometimes similarity does not involve entire structures, but only a por-
tion of them: it is limited to a single domain, i.e., to a substructure that can
be defined as an independent structural unit inside a larger protein. In order
to detect similarities, at least in all cases that are not self-evident, the pa-
rameters and the algorithm used become relevant and can strongly influence
the final results. Different algorithms have been devised to compare and su-
perimpose protein structures, but none of them is completely free of failures.
Some impose the constraint of continuity of the matched atoms along the pri-
mary sequence, in other words preserve the sequential order of the matched
atoms; other methods try to minimize the so-called “soap-bubble area” be-
tween two structures, or involve other techniques, like lattice fitting (surveys
∗ At the time of writing of this chapter 680 genomes of bacteria (http://www.
ebi.ac.uk/genomes/bacteria.html) and 33 of eukaryotes (http://www.ebi.ac.uk/2can/
genomes/eukaryotes.html) are available, and many others are in progress.
Other documents randomly have
different content
There had been loss of life here—no great amount as loss of life is
measured these times in this country, but attended by conditions
that made the disaster hideous and distressing. The blood of victims
still trickled in runlets between the paving stones where we walked,
and there were mangled bodies stretched on the floor of an
improvised morgue across the way—mainly bodies of poor working
women, and one, I heard, the body of a widow with half a dozen
children, who now would be doubly orphaned, since their father was
dead at the Front.
Back again at my hotel after a forenoon packed with curious
experiences, I found in my quarters a very badly scared
chambermaid, trying to tidy a room with fingers that shook. In my
best French, which I may state is the worst possible French, I was
trying to explain to her that the bombardment had probably ended—
and for a fact there had been a forty-minute lull in the new
frightfulness—when one of the shells struck and went off among the
trees and flowerbeds of a public breathing place not a hundred and
fifty yards away. With a shriek the maid fell on her knees and buried
her head, ostrich fashion, in a nest of sofa pillows.
I stepped through my bedroom window upon a little balcony in
time to see the dust cloud rise in a column and to follow with my
eyes the frenzied whirlings of a great flock of wood pigeons flighting
high into the air from their roosting perches in the park plot. The
next instant I felt a violent tugging at the back breadth of the
leather harness that I wore. Unwittingly, in her panic the maid had
struck upon the only possible use to which a Sam Browne belt may
be put—other than the ornamental, and that is a moot point among
fanciers of the purely decorative in the matter of military gearing for
the human form. By accident she had divined its one utilitarian
purpose. She had risen and with both hands had laid hold upon the
crosspiece of my main surcingle and was striving to drag me inside. I
rather gathered from the tenor of her contemporaneous remarks,
which she uttered at the top of her voice and into which she
interjected the names of several saints, that she feared the sight of
me in plain view on that stone ledge might incite the invisible
marauder to added excesses.
But I was the larger and stronger of the two, and my buckles held,
and I had the advantage of an iron railing to cling to. After a short
struggle my would-be rescuer lost. She turned loose of my kicking
straps and breech bands, and making hurried reference to various
names in the calendar of the canonised she fled from my presence. I
heard her falling down the stairs to the floor below. The next day I
had a new chambermaid; this one had tendered her resignation.
Not until the middle of the afternoon was the proper explanation
for the phenomenon forthcoming. It came then from the Ministry of
War, in the bald and unembroidered laconics of a formal
communiqué. At the first time of hearing it the announcement
seemed so inconceivable, so manifestly impossible that official
sanction was needed to make men believe Teuton ingenuity had
found a way to upset all the previously accepted principles touching
on gravity and friction; on arcs and orbits; on aims and directions;
on projectiles and projectives; on the resisting tensility of steel bores
and on the carrying power of gun charges—by producing a cannon
with a ranging scope of somewhere between sixty and ninety miles.
Days of bombardment followed—days which culminated on that
never-to-be-forgotten Good Friday when malignant chance sped a
shell to wreck one of the oldest churches in Paris and to kill seventy-
five and wound ninety worshippers gathered beneath its roof.
After the first flurry of uncertainty the populace for the most part
grew tranquil; now that they knew the origin of the far-flung
punishment there was measurably less dread of the consequences
among the masses of the people. On days when the shells exploded
futilely the daily press and the comedians in the music halls made
jokes at the expense of Big Bertha; as, for example, on a day when
a fragment of shell took the razor out of the hand of a man who was
shaving himself, without doing him the slightest injury; and again
when a whole shell wrecked a butcher shop and strewed the
neighbourhood with kidneys and livers and rib ends of beef, but
spared the butcher and his family. On days when the colossal piece
scored a murderous coup for its masters and took innocent life, the
papers printed the true death lists without attempt at concealment
of the ravages of the monster. And on all the bombardment days,
women went shopping in the Rue de la Paix; children played in the
parks; the flower women of the Madeleine sold their wares to
customers with the reverberations of the explosions booming in their
ears; the crowds that sat sipping coloured drinks at small tables in
front of the boulevard cafés on fair afternoons were almost as
numerous as they had been before the persistent thing started; and
unless the sound was very loud indeed the average promenader
barely lifted his or her head at each recurring report. In America we
look upon the French as an excitable race, but here they offered to
the world a pattern for the practice of fortitude.
A good many people departed from Paris to the southward.
However, there was calmness under constant danger. Our own
people, who were in Paris in numbers mounting up into the
thousands, likewise set a fine example of sang-froid. On the evening
of the opening day of the bombarding, when any one might have
been pardoned for being a bit jumpy, an audience of enlisted men
which packed the American Soldiers and Sailors' Club in the Rue
Royale was gathered to hear a jazz band play Yankee tunes and
afterward to hear an amateur speaker make an address. The cannon
had suspended its annoying performances with the going down of
the sun, but just as the speaker stood up by the piano the alerte for
an air attack—which, by the way, proved to be a false alarm, after all
—was heard outside.
There was a little pause, and a rustling of bodies.
Then the man, who was on his feet, spoke up. “I'll stay as long as
any one else does,” he said. “Anyhow, I don't know which is likely to
be the worse of two evils—my poor attempts at entertaining you
inside or the boche's threatened performances outside.”
A great yell of approval went up and not a single person left the
building until after the chairman announced that the programme for
the evening had reached its conclusion. I know this to be a fact
because I was among those present.
To be sure, the strain of the harassment got upon the nerves of
some; that would be inevitable, human nature being what it is.
Attendance at the theatres, especially for the matinées, fell off
appreciably; this, though, being attributable, I think, more to fear of
panic inside the buildings than to fear of what the missiles might do
to the buildings themselves. And there was no record of any
individual, whether man or woman, quitting a post of responsibility
because of the personal peril to which all alike were exposed.
Likewise on those days when the great gun functioned promptly at
twenty-minute intervals one would see men sitting in drinking places
with their eyes glued to the faces of their wrist watches while they
waited for the next crash. For those whose nerves lay close to their
skins this damnable regularity of it was the worst phase of the thing.
There was something so characteristically and atrociously German,
something so hellishly methodical in the tormenting certainty that
each hour would be divided into three equal parts by three
descending steel tubes of potential destruction.
Big Bertha operated on a perfect schedule. She opened up daily at
seven a. m. sharp; she quit at six-twenty p. m. It was as though the
crew that tended her carried union cards. They were never tardy.
Neither did they work overtime. But if the Prussians counted upon
bedeviling the people into panic and distracting the industrial and
social economies of Paris they missed their guess. They made some
people desperately unhappy, no doubt, and they frightened some;
but the true organism of the community remained serene and
unimpaired.
Some share of this, I figure might be attributed to the facts that in
a city as great as Paris the chances of any one individual being killed
were so greatly reduced that the very size of the town served to
envelop its inhabitants with a sense of comparative immunity; the
number of buildings, and their massiveness inspired a feeling of
partial security. I know I felt safer than I have felt out in the open
when the enemy's playful batteries were searching out the terrain
round about. In a smaller city this condition probably would not have
been manifest to the same degree. There almost everybody would
be likely to know personally the latest victim or to be familiar with
the latest scene of damage and this would serve doubtlessly to bring
the apprehensive home to all households. Howsoever, be the
underlying cause what it might, Paris weathered the brunt of the
ordeal with splendid fortitude and an admirable coolness.
Being frequently in Paris between visits to one or another sector of
the front, I was able to keep a fairly accurate score in the ravages of
the bombardment and to get a fairly average appraisal of the effects
upon the Parisian temper. Likewise by reading translated extracts out
of German newspapers I got impressions of another phase of the
tragedy which almost was as vivid as though I had been an eye
witness to events which I knew of only at second-hand from the
published descriptions of them.
I had the small advantage though on my side of being able to
vizualise the setting in the Forest of St. Gobain, to the west of Laon
for I was there once in German company. I could conjure up a
presentiment of the scene there enacted on the day when Big
Bertha's makers and masters sprang their well-guarded surprise,
which so carefully and so secretly had been evolved during months
of planning and constructing and experimentations. Behold then the
vision: It is a fine spring morning. There is dew on the grass and
there is song in the throats of the birds and young foliage is upon
the trees. The great grey gun—it is nearly ninety feet long and
according to inspired Teutonic chronicles resembles a vast metal
crone—squats its misshapen mass upon a prepared concrete base in
the edge of the woods, just on the timbered shoulder of a hill. Its
long muzzle protrudes at an angle from the interlacing boughs of the
thicket where it hides; at a very steep angle, too, since the charge it
will fire must ascend twenty miles into the air in order to reach its
objective. Behind it is a stenciling of white birdies and slender
poplars flung up against the sky line; in front of it is a disused
meadow where the newly minted coinage of a prodigal springtime—
dandelions that are like gold coins and wild marguerites that are like
silver ones—spangle the grass as though the profligate season had
strewn its treasures broadcast there. The gunners make ready the
monster for its dedication. They open its great navel and slide into
its belly a steel shell nine inches thick and three feet long nearly and
girthed with beltings of spun brass. The supreme moment is at
hand.
From a group of staff officers advances a small man, grown old
beyond his time; this man wears the field uniform of a Prussian field
marshal. He has a sword at his side and spurs on his booted feet
and a spiked helmet upon his head. He has a withered arm which
dangles abortively, foreshortened out of its proper length. His hair is
almost snow-white and his moustache with its fiercely upturned and
tufted ends is white. From between slitted lids imbedded in his skull
behind unhealthy dropical pouches of flesh his brooding, morbid
eyes show as two blue dots, like touches of pale light glinting on
twin disks of shallow polished agate. He bears himself with a mien
that either is imperial or imperious, depending upon one's point of
view.
While all about him bow almost in the manner of priests making
obeisance before a shrine, he touches with one sacred finger the
button of an electrical controller. The air is blasted and the earth
rocks then to the loudest crash that ever issued from the mouth of a
gun; for all its bulk and weight the cannon recoils on its carriage and
shakes itself; the tree tops quiver in a palsy. The young grass is
flattened as though by a sudden high wind blowing along the
ground; the frightened birds flutter about and are mute.
The bellowing echoes die away in a fainter and yet fainter
cadence. The-Anointed-of-God turns up his good wrist to consider
the face of the watch strapped thereon; his staff follow his royal
example. One minute passes in a sort of sacerdotal silence. There is
drama in the pause; a fine theatricalism in the interlude. Two
minutes, two minutes and a half pass. This is one part of the
picture; there is another part of it:
Seventy miles away in a spot where a busy street opens out into a
paved plaza all manner of common, ordinary work-a-day persons are
busied about their puny affairs. In addition to being common and
ordinary these folks do not believe in the divine right of kings; truly
a high crime and misdemeanour. Moreover, they persist in the
heretical practice of republicanism; they believe actually that all men
were born free and equal; that all men have the grace and the
authority within them to choose their own rulers; that all men have
the right to live their own lives free from foreign dictation and alien
despotism. But at this particular moment they are not concerned in
the least with politics or policies. Their simple day is starting. A
woman in a sidewalk kiosk is ranging morning papers on her narrow
shelf. A half-grown girl in a small booth set in the middle of the
square where the tracks of the tramway end, is selling street car
tickets to working men in blouses and baggy corduroy trousers.
Hucksters and barrow-men have established a small market along
the curbing of the pavement. A waiter is mopping the metal tops of
a row of little round tables under the glass markee of a café. Wains
and wagons are passing with a rumble of wheels. Here there is no
drama except the simple homely drama of applied industry.
Three minutes pass: Far away to the north, where the woods are
quiet again and the birds have mustered up courage to sing once
more, The Regal One drops his arm and looks about him at his
officers, nodding and smiling. Smiling, they nod back in chorus, like
well-trained automatons. There is a murmur of interchanged
congratulations. The effort upon which so much invaluable time and
so much scientific thought have been expended, stands unique and
accomplished. Unless all calculations have failed the nine-inch shell
has reached its mark, has scored its bull's eye, has done its
predestined job.
It has; those calculations could not go wrong. Out of the kindly
and smiling heavens, with no warning except the shriek of its
clearing passage through the skies, the bolt descends in the busy
square. The glass awning over the café front becomes a darting rain
of sharp-edged javelins; the paving stones rise and spread in
hurtling fragments from a smoking crater in the roadway. There are
a few minutes of mad frenzy among those people assembled there.
Then a measure of quiet succeeds to the tumult. The work of rescue
starts. The woman who vended papers is a crushed mass under the
wreckage of her kiosk; the girl who sold car tickets is dead and
mangled beneath her flattened booth; the waiter who wiped the
table-tops off lies among his tables now, the whole crown of his
head sliced away by slivers of glass; here and there in the square
are scattered small motionless clumps that resemble heaps of
bloodied and torn rags. Wounded men and women are being carried
away, groaning and screaming as they go. But in the edge of the
woods at St. Gobain the Kaiser is climbing into his car to ride to his
headquarters. It is his breakfast-time and past it and he has a fine
appetite this morning. The picture is complete. The campaign for
Kultur in the world has scored another triumph, the said score
standing: Seven dead; fifteen injured.
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen
T
CHAPTER XV. WANTED: A FOOL-
PROOF WAR
HERE was a transportload of newly made officers coming over
for service here in France. There was on board one gentleman
in uniform who bore himself, as the saying goes, with an air.
By reason of that air and by reason of a certain intangible
atmospheric something about him difficult to define in words he
seemed intent upon establishing himself upon a plane far remote
from and inaccessible to these fellow voyagers of his who were
crossing the sea to serve in the line, or to act as interpreters, or to
go on staffs, or to work with the Red Cross or the Y. M. C. A. or the
K. of C. or what not. He had what is called the superior manner, if
you get what I mean—and you should get what I mean, reader, if
ever you had lived, as I have, for a period of years hard by and
adjacent to that particular stretch of the eastern seaboard of North
America where, as nowhere else along the Atlantic Ocean or in the
interior, are to be found in numbers those favoured beings who
acquire merit unutterable by belonging to, or by being distantly
related to, or by being socially acquainted with, the families that
have nothing but.
Nevertheless, and to the contrary notwithstanding, divers of his
brother travellers failed to keep their distance. Toward this
distinguished gentleman they deported themselves with a familiarity
and an offhandedness that must have been acutely distasteful to
one unaccustomed to moving in a mixed and miscellaneous
company.
Accordingly he took steps on the second day out to put them in
their proper places. A list was being circulated to get up a
subscription for something or other, and almost the very first person
to whom this list came in its rounds of the first cabin was the person
in question. He took out a gold-mounted fountain pen from his
pocket and in a fair round hand inscribed himself thus:
“Bejones of Tuxedo”
There were no initials—royalty hath not need for initials—but just
the family name and the name of the town so fortunate as to
number among its residents this notable—which names for good
reasons I have purposely changed. Otherwise the impressive
incident occurred as here narrated.
But those others just naturally refused to be either abashed or
abated. They must have been an irreverent, sacrilegious lot, by all
accounts. The next man to whom the subscription was carried took
note of the new fashion in signatures and then gravely wrote himself
down as “Spirits of Niter”; and the next man called himself “Henri of
Navarre”; and the third, it developed, was no other than “Cream of
Tartar”; and the next was “Timon of Athens”; and the next “Mother
of Vinegar”—and so on and so forth, while waves of ribald and
raucous laughter shook the good ship from stem to stem.
However, the derisive ones reckoned without their host. For them
the superior mortal had a yet more formidable shot in the locker. On
the following day he approached three of the least impressed of his
temporary associates as they stood upon the promenade deck, and
apropos of nothing that was being said or done at the moment he,
speaking in a clear voice, delivered himself of the following crushing
remark:
“When I was born there were only two houses in the city of New
York that had porte-cochères, and I—I was born in one of them.”
Inconceivable though it may appear, the fact is to be recorded that
even this disclosure failed to silence the tongues of ridicule aboard
that packet boat. Rather did it enhance them, seeming but to spur
the misguided vulgarians on and on to further evidences of
disrespect. There are reasons for believing that Bejones of Tuxedo,
who had been born in the drafty semipublicity of a porte-cochère,
left the vessel upon its arrival with some passing sense of relief,
though it should be stated that up until the moment of his
debarkation he continued ever, while under the eye of the plebes
and commoners about him, to bear himself after a mode and a port
befitting the station to which Nature had called him. He vanished
into the hinterland of France and was gone to take up his duties; but
he left behind him, among those who had travelled hither in his
company, a recollection which neither time nor vicissitude can
efface. Presumably he is still in the service, unless it be that ere now
the service has found out what was the matter with it.
I have taken the little story concerning him as a text for this
article, not because Bejones of Tuxedo is in any way typical of any
group or subgroup of men in our new Army—indeed I am sure that
he, like the blooming of the century plant, is a thing which happens
only once in a hundred years, and not then unless all the conditions
are salubrious. I have chosen the little tale to keynote my narrative
for the reason that I believe it may serve in illustration; of a situation
that has arisen in Europe, and especially in France, these last few
months—a condition that does not affect our Army so much as it
affects sundry side issues connected more or less indirectly with the
presence on European soil of an army from the United States, like
most of the nations having representative forms of government that
have gone into this war, we went in as an amateur nation so far as
knowledge of the actual business of modern warfare was concerned.
Like them, we have had to learn the same hard lessons that they
learned, in the same hard school of experience. Our national
amateurishness beforehand was not altogether to our discredit;
neither was it altogether to our credit. Nobody now denies that we
should have been better prepared for eventualities than we were. On
the other hand it was hardly to be expected that a peaceful
commercial country such as ours—which until lately had been
politically remote as it was geographically aloof upon its own
hemisphere from the political storm-centres of the Old World, and in
which there was no taint of the militarism that has been Germany's
curse, and will yet be her undoing—should in times of peace greatly
concern itself with any save the broad general details of the game of
war, except as a heart-moving spectacle enacted upon the stage of
another continent and viewed by us with sympathetic and sorrowing
eyes across three or four thousand miles of salt water. Prior to our
advent into it the war had no great appeal upon the popular
conscience of the United States. Out of the fulness of our hearts and
out of the abundance of our prosperity we gave our dollars, and
gave and gave and kept on giving them for the succour of the
victims of the world catastrophe; but a sense of the impending peril
for our own institutions came home to but few among us. Here and
there were individuals who scented the danger; but they were as
prophets crying in the wilderness; the masses either could not oc
would not see it. They would not make ready against the evil days
ahead.
So we went into this most highly specialised industry, which war
has become, as amateurs mainly. Our Navy was no amateur navy, as
very speedily developed, and before this year's fighting is over our
enemy is going to realise that our Army is not an amateur army. We
may have been greenhorns at the trade wherein Germans were
experts by training and education; still we fancy ourselves as a
reasonably adaptable breed. But if the truth is to be told it must be
confessed that in certain of the Allied branches of the business we
are yet behaving like amateurs. After more than a year of actual and
potential participation in the conflict we even now are doing things
and suffering things to be done which would make us the
laughingstock of our allies if they had time or tempter for laughing. I
am not speaking of the conduct of our operations in the field or in
the camps or on the high seas. I am speaking with particular
reference to what might be called some of the by-products.
None of us is apt to forget, or cease to remember with pride, the
flood of patriotic sacrifice that swept our country in the spring of
1917. No other self-governing people ever adopted a universal draft
before their shores had been invaded and before any of their
manhood had fallen in battle. No other self-governing people ever
accepted the restrictions of a food-rationing scheme before any of
the actual provisions concerning that food-rationing scheme had
been embodied into the written laws. Other countries did it under
compulsion, after their resources showed signs of exhaustion. We
did it voluntarily; and it was all the more wonderful that we should
have done it voluntarily when all about us was human provender in
a prodigal fullness. There was plenty for our own tables.
By self-imposed regulations we cut down our supplies so that our
allies might be fed with the surplus thus made available. Outside of
a few sorry creatures there was scarcely to be found in America an
individual, great or small, who did not give, and give freely, of the
work of his or her heart and hands to this or that phase of the
mighty undertaking upon which our Government had embarked and
to which our President, speaking for us all, had solemnly dedicated
all that we were or had been or ever should be.
All sorts of commissions, some useful and important beyond
telling, some unutterably unuseful and incredibly unimportant,
sprang into being. And to and fro in the land, in numbers amounting
to a vast multitude, went the woman who wanted to do her part,
without having the least idea of what that part would be or how she
would go about doing it. She knew nothing of nursing; kitchen work,
a vulgar thing, was abhorrent to her nature and to her manicured
nails; she could not cook, neither could she sew or sweep—but she
must do her part.
She was not satisfied to stay on at home and by hard endeavour
to fit herself for helping in the task confronting every rational and
willing being between the two oceans. No, sir-ree, that would be too
prosaic, too commonplace an employment for her. Besides, the
working classes could attend to that job. She must do her part
abroad—either in France within sound of the guns or in racked and
desolated Belgium. Of course her intentions were good. The
intentions of such persons are nearly always good, because they
change them before they have a chance to go stale.
I think the average woman of this type had a mental conception
of herself wearing a wimple and a coif of purest white, in a frock
that was all crisp blue linen and big pearl buttons, with one red cross
blazing upon her sleeve and another on her cap, sitting at the side of
a spotless bed in a model hospital that was fragrant with flowers,
and ministering daintily to a splendid wounded hero with the face of
a demigod and the figure of a model for an underwear ad.
Preferably this youth would be a gallant aviator, and his wound
would be in the head so that from time to time she might adjust the
spotless bandage about his brow.
I used to wish sometimes when I met such a lady that I might
have drawn for her the picture of reality as I had seen it more times
than once—tired, earnest, competent women who slept, what sleep
they got, in lousy billets that were barren of the simplest comforts,
sleeping with gas masks under their pillows, and who for ten or
twelve or fifteen or eighteen hours on a stretch performed the most
nauseating and the most necessary offices for poor suffering
befouled men lying on blankets upon straw pallets in wrecked dirty
houses or in half-ruined stables from which the dung had hurriedly
been shoveled out in order to make room for suffering soldiers—
stables that reeked with the smells of carbolic and iodoform and with
much worse smells. It is an extreme case that I am describing, but
then the picture is a true picture, whereas the idealistic fancy
painted by the lady who just must do her part at the Front had no
existence except in the movies or in her own imagination.
It never occurred to her that there would be slop jars to be
emptied or filthy bodies, alive with crawling vermin, to be cleansed.
It never occurred to her that she would take up room aboard ship
that might better be filled with horse collars or hardtack or insect
powder; nor that while over here she would consume food that
otherwise would stay the stomach of a fighting man or a working
woman; nor that if ever she reached the battle zone she would
encounter living conditions appallingly bare and primitive beyond
anything she could conceive; nor that she could not care for herself,
and was fitted neither by training nor instinct to help care for any
one else.
When I left America last winter a great flow of national sanity had
already begun to rise above the remaining scourings of national
hysteria; and the lady whose portrait I have tried in the foregoing
paragraphs to sketch was not quite so numerous or so vociferous as
she had been in those first few exalted weeks and months following
our entrance into the war as a full partner in the greatest of
enterprises. My surprise was all the greater therefore to find that she
had beaten me across the water. She had pretty well disappeared at
home.
One typical example of this strange species crossed in the same
ship with me. Heaven alone knows what political or social influence
had availed to secure her passport for her. But she had it, and with it
credentials from an organisation that should have known better. She
was a woman of independent wealth seemingly, and her motives
undoubtedly were of the best; but as somebody might have said:
Good motives butter no parsnips, and hell is paved with buttered
parsnips. Her notion was to drive a car at the Front—an ambulance
or a motor truck or a general's automobile or something. She had
owned cars, but she had never driven one, as she confessed; but
that was a mere detail. She would learn how, some day after she got
to Europe, and then somebody or other would provide her with a car
and she would start driving it; such was her intention. Unaided she
could no more have wrested a busted tire off of a rusted rim than
she could have marcelled her own back hair; and so far as her
knowledge of practical mechanics went, I am sure no reasonably
prudent person would have trusted her with a nutpick; but she had
the serene confidence of an inspired and magnificent ignorance.
She had her uniform too. She had brought it with her and she
wore it constantly. She said she designed it herself, but I think she
fibbed there. No one but a Fifth Avenue mantuamaker of the sex
which used to be the gentler sex before it got the vote could have
thought up a vestment so ornate, so swagger and so complicated.
It was replete with shoulder straps and abounding in pleats and
gores and gussets and things. Just one touch was needed to make it
a finished confection: By rights it should have buttoned up the back.
The woman who had the cabin next to hers in confidence told a
group of us that she had it from the stewardess that it took the lady
a full hour each day to get herself properly harnessed into her
caparisons. Still I must say the effect, visually speaking, was worthy
of the effort; and besides, the woman who told us may have been
exaggerating. She was a registered and qualified nurse who knew
her trade and wore matter-of-fact garments and fiat-heeled, broad-
soled shoes. She was not very exciting to look at, but she radiated
efficiency. She knew exactly what she would do when she got over
here and exactly how she would do it. We agreed among ourselves
that if we were in quest of the ornamental we would search out the
lady who meant to drive the car—provided there was any car; but
that if anything serious ailed any of us we would rather have the
services of one of the plain nursing sisterhood than a whole skating-
rinkful of the other kind round.
In the latter part of 1917 there landed in France a young woman
hailing from a Far Western city whose family is well known on the
Pacific Slope. She brought with her letters of introduction signed by
imposing names and a comfortable sum of money, which had been
subscribed partly out of her own pocket and partly out of the
pockets of well-meaning persons in her home state whom she had
succeeded in interesting in her particular scheme of wartime
endeavour. She was very fair to see and her uniform, by all
accounts, was very sweet to look upon, it being a horizon-blue in
colour with much braiding upon the sleeves and collar. It has been
my observation since coming over that when in doubt regarding
their vocations and their intentions these unattached lady zealots go
in very strongly for striking effects in the matter of habiliments.
Along the boulevards and in the tearooms I have encountered a
considerable number who appeared to have nothing to do except to
wear their uniforms.
However, this young person had no doubt whatever concerning
her motives and her purposes. The whole thing was all mapped out
in her head, as developed when she called upon a high official of our
Expeditionary Forces at his headquarters in the southern part of
France. She told him she had come hither for the express purpose of
feeding our starving aviators. He might have told her that so long as
there continued to be served fried potato chips free at the Crillon bar
there was but little danger of any airman going hungry, in Paris at
least. What he did tell her when he had rallied somewhat from the
shock was that he saw no way to gratify her in her benevolent desire
unless he could catch a few aviators and lock them up and starve
them for two or three days, and he rather feared the young men
might object to such treatment. As a matter of fact, I understand he
so forgot himself as to laugh at the young woman.
At any rate his attitude was so unsympathetic that he practically
spoiled the whole v war for her, and she gave him a piece of her
mind and went away. She had departed out of the country before I
arrived in it, and I learned of her and her uniform and her mission
and her disappointment at its unfulfillment by hearsay only; but I
have no doubt, in view of some of the things I have myself seen,
that the account which reached me was substantially correct. Along
this line I am now prepared to believe almost anything.
Here, on the other hand, is a case of which I have direct and first-
hand knowledge. I encountered a group of young women attached
to one of the larger American organisations engaged in systematised
charities and mercies on this side of the water. Now, plainly these
young women were inspired by the very highest ideals; that there
was no discounting. They were full of the spirit of service and
sacrifice. Mainly they were college graduates. Without exception
they were well bred; almost without exception they were well
educated.
The particular tasks for which they had been detailed were to care
for pauperised repatriates returning to France through Switzerland
from areas of their country occupied by the enemy, and to aid these
poor folks in reestablishing their home life and to give them lessons
in domestic science. To the success of their ministrations there was
just one drawback: They were dealing with peasants mostly—furtive,
shy, secretive folks who under ordinary circumstances would be
bitterly resentful of any outside interference by aliens with their
mode of life, and who in these cases had been rendered doubly
suspicious by reason of the misfortunes they had endured while
under the thumb of the Germans.
To understand them, to plumb diplomatically the underlying
reasons for their prejudices, to get upon a basis of helpful sympathy
with them, it was highly essential that those dealing with them not
only should have infinite tact and finesse but should be able to
fathom the meaning of a nod or a gesture, a sidelong glance of the
eyes or the inflection of a muttered word. And yet of those zealous
young women who had been assigned to this delicate task there was
scarcely one in six who spoke any French at all. It inevitably followed
that the bulk of their patient labours should go for naught;
moreover, while they continued in this employment they were merely
occupying space in an already crowded country and consuming food
in an already needy country; the both of which—space and food—
were needed for people who could accomplish effective things.
An American woman who is reputed to be a dietetic specialist
came over not long ago, backed by funds donated in the States. Her
instructions were to establish cafeterias at some of the larger French
munition works. Probably her chagrin was equalled only by her
astonishment when she learned that for reasons which seemed to it
good and sufficient—and which no doubt were—the French
Government did not want any American-plan cafeterias established
at any of its munition works. Apparently it had not seemed feasible
and proper to the sponsors of the diet specialist to find out before
dispatching her overseas whether the plan would be agreeable to
the authorities here; or whether there already were eating places
suitable to the desires of the working people at these munition
plants; or how long it would take, given the most favourable
conditions, to cure the workers of their tenacious instinct for eating
the kind of midday meal they have been eating for some hundreds
of years and accustom them and their palates and their stomachs to
the Yankee quick lunch with its baked pork and beans, its buckwheat
cakes with maple sirup and its four kinds of pie. In their zeal the
promoters, it would seem, had entirely overlooked those essential
details. It is just such omissions as this one that the fine frenzy of
helping out in wartime appears to develop in a nation that is given
to boasting of its business efficiency and that vaunts itself that it
knows how to give generously without wasting foolishly.
The field manager of an organisation that is doing a great deal for
the comfort of our soldiers and the soldiers of our allies told me of
one of his experiences. He had a sense of humour and he could
laugh over it, but I think I noted a suggestion of resentment behind
the laughter. He said that some months before lie set up and
assumed charge of a plant well up toward the trenches in a sector
that had been taken over by the American troops. It was a large and
elaborate concern, as these concerns are rated in the field. The men
were pleased with its accommodations and facilities, and the field
manager was proud of it.
One day there appeared a businesslike young woman who
introduced herself as belonging to a kindred organisation that was
charged with the work of decorating the interiors of such
establishments as the one over which he presided. Somewhat
puzzled, he showed her, first of all, his canteen. It was as most such
places are: There were boxes of edibles upon counters, in open
boxes, so that the soldier customers might appraise the wares
before investing; upon the shelves there were soft drinks and
smoking materials and all manner of small articles of wearing
apparel; likewise baseballs and safety razors and soap, toilet kits and
the rest of it. Altogether the manager and his two assistants were
rather pleased with the arrangement.
The newly arrived young woman swept the scene with a cold
professional eye.
“On the whole this will do fairly well,” she said with a certain
briskness, in her tone. “Yes, I may say it will do very well indeed—
with certain changes, certain touches.”
“As for example, what, please?” inquired the superintendent.
“Well,” she said, “for one thing we must put up some bright
curtains at the windows; and to lighten up the background I think
we'll run a stenciled pattern in some cheerful colour round the walls
at the top.”
It was not for the manager to inquire how the decorator meant to
get her curtains and her stencils and her wall paints up over a road
that was being alternately gassed and shelled at nights and on which
the traffic capacity was already taxed to the utmost by the business
of bringing up supplies, munitions and rations from the base some
fifteen miles in the rear. He merely bowed and awaited the lady's
further commands. “And now,” she said, “where is the rest room?”
“The rest room, did you say?”
“Certainly, the rest room—the recreation hall, the place where
these poor men may go for privacy and innocent amusement?”
“Well, you see, thus close up near the Front we haven't been able
to make provision for a regular rest room,” explained the manager.
“Besides, in case of a withdrawal or an attack we might have to pull
out in a hurry and leave behind everything that is not readily
portable on wagons or trucks. The nearest approach that we have to
a rest room is here at the rear.” He led the way to a room at the
back. It contained such plenishings as one generally finds in
improvised quarters in the field—that is to say, it contained a curious
equipment made up partly of crude bits of furniture collected on the
spot out of villagers' abandoned homes and partly of makeshift
stools and tables coopered together from barrels and boxes and
stray bits of planking. Also it contained at this time as many soldiers
as could crowd into it. A phonograph was grinding out popular airs,
and divers games of checkers and cards were in progress, each with
its fringe of interested onlookers ringing in the players.
“Oh, but this will never do—never!” stated the inspecting lady. “It
is too bare, too cheerless! It lacks atmosphere. It lacks coziness; it
lacks any appeal to the senses—in short it lacks everything! We must
have some immediate improvements here by all means.”
The man was beginning to lose his temper. By an effort he
retained it.
“The men seem fairly well satisfied; at least I have heard no
complaint,” he said. “What would you suggest in the way of
changes?”
As she answered, the visitor ticked off the items of her mental
inventory of essentials on her fingers.
“Well, to begin with we must clear all this litter out of here,” she
said. “Then we must install some really comfortable chairs and at
least two or three roomy sofas and some simple couches where the
men may lie down. I should also like to see a piano here. That, with
the addition of some curtains at the windows and some simple
treatment of the walls and a few appropriate pictures properly
spaced and properly hung, will be different, I think.”
“Yes,” demurred the manager, “but admitting that we could get
the things you have enumerated up here, another problem would
arise: This room, which, as you see, is not large, would be so
crowded with the furnishings that there would be room in it for very
many less men than usually come here. There are probably fifty men
in it now. If it were filled up with sofas and couches and a piano I
doubt whether we could crowd twenty men inside of it.”
“Very well, then,” stated the lady decorator calmly, “you must
admit only twenty men at a time.”
“Quite so; but how,” he demanded—“how am I going to select the
twenty?”
The young woman considered the question for a moment. Then a
solution came to her.
“I should select the twenty neatest ones,” she said.
Whereupon the manager excused himself and went out to frame a
dispatch to headquarters embodying an ultimatum, which ultimatum
was that the lady decorator went away from there forthwith or his
resignation must take effect, coincident with his immediate
departure from his present post. The home office must have called
the lady off, because when I saw him he was still in harness, and
swinging a man-size job in a competent way.
I would not have the reader believe that I am casting discredit
upon either the patriotic impulses or the honest motives of the bulk
of the lay workers who have journeyed to Europe, paying their own
way and their own living expenses. Often they arrive, many of them,
to strike hands with the military authorities in the task which faces
our nation on Continental soil. There is room and a welcome in
France, in Italy, in England and in Flanders for every civilian recruit
who really knows how to do something helpful and who has the
strength, the self-reliance and the hardihood to perform that
particular function under difficult and complicated conditions, which
nearly always are physically uncomfortable and which may become
physically dangerous.
Nor would I wish any one to assume that I am deprecating by
inference or by frontal attack the very fine things that are being
accomplished every day by fine American women and girls who
answered the first call for trained helpers, to serve in hospitals or
canteens or huts, in settlement work or at telephone exchanges. It
will make any American thrill with pride to enter a ward where the
American Red Cross is in charge, or where a medical unit from one
of the great hospitals or one of our great universities back home has
control. The French and the British are quick enough to speak in
terms of highest praise of the achievements of American surgeons,
American nurses and American ambulance drivers. They say, and
with good reason for saying it, that our people have pluck and that
they have skill and that they above all are amazingly resourceful.
Personally I know of no smarter exhibition of native wit and
courage that the war has produced than was shown by that group of
Smith College girls who had been organising and directing
colonisation work among the peasants in the reclaimed districts of
Northern France and who were driven out by the great spring
advance of the Germans. I met some of those young women. They
were modest enough in describing their adventure. It was by
gathering a shred of a story there and a scrap of an anecdote here
that I was able to piece together a fairly accurate estimate of the
self-imposed discipline, the clean-strained grit and the initiative
which marked their conduct through three trying weeks.
Perhaps it was a mistake in their instance, as in the instances of
divers similar organisations, that the work of resettling the wasted
lands above the Aisne and the Oise should have been undertaken at
points that would be menaced in the event of a quick onslaught by
the Prussian high command. The British, I understand, privately
objected to the undertakings on the ground that the presence of
American women In villages which might fall again into the foe's
hands—and which as it turned out did fall again into his hands—
entailed an added burden and an added responsibility upon the
fighting forces. The British were right. Practically all of the
repatriated peasants had to flee for the second time, abandoning
their rebuilt homes and their newly sowed fields.
On the heels of these, improvements which represented many
thousands of American dollars and many months of painstaking
labour on the part of devoted American women went up in flames.
The torch was applied rather than that the little model houses and
the tons of donated supplies on hand should go into hostile hands.
Those Smith College girls did not run away, though, until the
Germans were almost upon them. Up to the very last minute they
stayed at their posts, feeding and housing not only refugees but
many exhausted soldiers, British and French, who staggered in,
spent and sped after alternately fighting and retreating through a
period of days and nights. When finally they did come away each
one of them came driving her own truck and bearing in it a load of
worn-out and helpless natives. One girl brought out a troop of
frightened dwarfs from a stranded travelling caravan. Another
ministered day and night to a blind woman nearly ninety years old
and a family of orphaned babies. The passengers of a third were
four inmates of a little communal blind asylum that happened to be
in the invader's path.
On the way, in addition to tending their special charges, they
cooked and served hundreds of meals for hungry soldiers and
hungry civilians. They spent the nights in towns under shell fire, and
when at length the German drive had been checked they assembled
their forces in Beauvais. Thus and with characteristic adaptability
some became drivers of ambulances and supply trucks plying along
the lines of communication, and some opened a kitchen for the
benefit of passing soldiers at the local railway station. If the faculty
and the students and the alumnæ of Smith College did not hold a
celebration when the true story of what happened in March and April
reached them they were lacking in appreciation—that's all I have to
say about it.
Right here seems a good-enough place for me to slip in a few
words of approbation for the work which another 'organisation has
accomplished in France since we put our men into the field. Nobody
asked me to speak in its favour because so far as I can find out it
has no publicity department. I am referring to the Salvation Army—
may it live forever for the service which, without price and without
any boasting on the part of its personnel, it is rendering to our boys
in France!
A good many of us who hadn't enough religion, and a good many
more of us who mayhap had too much religion, look rather
contemptuously upon the methods of the Salvationists. Some have
gone so far as to intimate that the Salvation Army was vulgar in its
methods and lacking in dignity and even in reverence. Some have
intimated that converting a sinner to the tap of a bass drum or the
tinkle of a tambourine was an improper process altogether. Never
again, though, shall I hear the blare of the cornet as it cuts into the
chorus of hallelujah whoops where a ring of blue-bonneted women
and blue-capped men stand exhorting on a city street corner under
the gas lights, without recalling what some of their enrolled brethren
—and sisters—have done and are doing in Europe.
The American Salvation Army in France is small, but, believe me, it
is powerfully busy! Its war delegation came over without any fanfare
of the trumpets of publicity. It has no paid press agents here and no
impressive headquarters. There are no well-known names, other
than the names of its executive heads, on its rosters or on its
advisory boards. None of its members is housed at an expensive
hotel and none of them has handsome automobiles in which to
travel about from place to place. No compaigns to raise nation-wide
millions of dollars for the cost of its ministrations overseas were ever
held at home. I imagine it is the pennies of the poor that mainly fill
its war chest.
I imagine, too, that sometimes its finances are an uncertain
quantity. Incidentally I am assured that not one of its male workers
here is of draft age unless he holds exemption papers to prove his
physical unfitness for military service. The Salvationists are taking
care to purge themselves of any suspicion that potential slackers
have joined their ranks in order to avoid the possibility of having to
perform duties in khaki.
Among officers as well as among enlisted men one occasionally
hears criticism—which may or may not be based on a fair judgment
—for certain branches of certain activities of certain organisations.
But I have yet to meet any soldier, whether a brigadier or a private,
who, if he spoke at all of the Salvation Army, did not speak in terms
of fervent gratitude for the aid that the Salvationists are rendering so
unostentatiously and yet so very effectively. Let a sizable body of
troops move from one station to another, and hard on its heels there
came a squad of men and women of the Salvation Army. An army
truck may bring them, or it may be they have a battered jitney to
move them and their scanty outfits. Usually they do not ask for help
from any one in reaching their destinations. They find lodgment in a
wrecked shell of a house or in the corner of a barn. By main force
and awkwardness they set up their equipment, and very soon the
word has spread among the troopers that at such-and-such a place
the Salvation Army is serving free hot drinks and free doughnuts and
free pies. It specialises in doughnuts, the Salvation Army in the field
does—the real old-fashioned homemade ones that taste of home to
a homesick soldier boy.
I did not see this, but one of my associates did. He saw it last
winter in a dismal place on the Toul sector. A file of our troops were
finishing a long hike through rain and snow over roads knee-deep in
half-thawed icy slush. Cold and wet and miserable, they came
tramping into a cheerless, half-empty town within sound and range
of the German guns. They found a reception committee awaiting
them there—in the person of two Salvation Army lassies and a
Salvation Army captain. The women had a fire going in the
dilapidated oven of a vanished villager's kitchen. One of them was
rolling out the batter on a plank with an old wine bottle for a rolling
pin and using the top of a tin can to cut the dough into circular
strips. The other woman was cooking the doughnuts, and as fast as
they were cooked the man served them out, spitting hot, to hungry
wet boys clamouring about the door, and nobody was asked to pay a
cent.
At the risk of giving mortal affront to ultra-doctrinal practitioners
of applied theology I am firmly committed to the belief that by the
grace of God and the grease of doughnuts those three humble
benefactors that day strengthened their right to a place in the
Heavenly Kingdom.
As I said a bit ago, there is in France room and to spare and the
heartiest sort of welcome for competent, sincere lay workers, both
men and women. But there is no room, and if truth be known, there
is no welcome for any other sort. These people over here long ago
passed out of the experimental period in the handling of industrial
and special problems that have grown up out of war. They have
entirely emerged from the amateur stage of endeavour and
direction. If any man doubts the truth of this he has only to see, as I
have seen, the thousands of women who have taken men's jobs in
the cities in order that the men might go to the colours; has only to
see the overalled women in the big munition plants; has only to see
how the peasant women of France are labouring in the fields and
how the girls of the British auxiliary legions—the members of the W.
A. A. C. for a conspicuous example—are carrying their share of the
burden; has only to see women of high degree and low, each doing
her part sanely, systematically and unflinchingly—to appreciate that,
though Britain and France can find employment for every pair of
willing and able hands somewhere behind the lines, they have no
use whatsoever for the unorganised applicant or for the purely
ornamental variety of volunteer or yet for the mere notoriety seeker.
I make so bold as to suggest that it is time we were taking the
same lesson to heart; time to start the sifting process ourselves. I
have seen in Paris a considerable number of American women who
appeared to have no business here except to air their most
becoming uniforms in public places and to tell in a vague broad way
of the things they hope to do. The French, proverbially, are a polite
race, and the French Government will endure a great deal of this
kind of infliction rather than run the risk of engendering friction,
even to the most minute extent, with the people or the
administration of an Allied nation. But in wartime especially, too
much patience becomes a dubious virtue, and if practiced for
overlong may become a fault.
As yet there has been no intimation from any official source that
the French would rather our State Department did not issue quite so
many passports to Americans who have no set and definite purpose
in making the journey to these shores, but even a superficial
knowledge of the French language and the most casual
acquaintance with the French nature enable one to get at what the
French people are thinking. I am sure that had the prevalent
condition been reversed our papers would have voiced the popular
protest at the imposition long before now. Some of these days,
unless we apply the preventive measures on our own side of the
Atlantic, the perfectly justifiable resentment of the hard-pressed
French is going to find utterance; and then quite a number of well-
intentioned but utterly inutile persons will be going back home with
their feelings all harrowed up.
P
CHAPTER XVI. CONDUCTING WAR
BY DELEGATION
LEASE do not think that because I have mainly dwelt thus far
upon the women offenders that there are no American men in
France who do not belong here, because that would be a
wrong assumption. I merely have mentioned the women first
because by reason of their military garbing—or what some of them
fondly mistake for military garbing—they offer rather more
conspicuous showing to the casual eye than the male civilian dress.
The men are abundantly on hand though; make no mistake about
that! Some of them come burdened with frock-coated dignity as
members of special commissions or special delegations; in certain
quarters there appears to be a somewhat hazy but very lively
inclination to try to run our share of this war by commission. Some, I
am sure, came for the same reason that the young man in the
limerick went to the stranger's funeral—because they are fond of a
ride. Some I think came in the hope of enjoying an exciting sort of
junketing expedition, and some because they were all dressed up
and had nowhere to go.
As well as may be judged by one who has been away from home
for going on five months now, the special-commission notion is
being rather overdone. Individuals and groups of individuals bearing
credentials from this fraternal organisation or that religious
organisation or the other research society reach England on nearly
every steamer that penetrates through the U-boat zone. Almost
invariably these gentlemen carry letters of introduction testifying to
their personal probity and their collective importance, which letters
are signed by persons sitting in high places.
It may be that the English are thereby deceived into believing that
the visitors are entitled to special consideration—as indeed some of
them are, and indeed some of them most distinctly are not. Or then
again it may be that the English are not aware of a device very
common among our men of affairs for getting rid of a bore who is
intent on going somewhere to see somebody and craves to be
properly vouched for upon his arrival. In certain circles this habit is
called passing the buck. In others it is known as writing letters of
introduction.
At any rate the English take no chances on offending the right
party, even at the risk of favouring the wrong one. When a half
dozen Yankees appear at the Foreign Office laden with letters
addressed “To Whom it May Concern” the Foreign Office immediately
becomes concerned.
How is a guileless Britisher intrenched behind a flat-top desk to
know that the August and Imperial Order of Supreme Potentates
whose chosen emissaries are now present desirous of having a look
at the war, and afterward to approve of it in a report to the Grand
Lodge at its next annual convention, if so be they do see fit to
approve of it—how, I repeat, is he to know that the August and
Imperial Order of Supreme Potentates has a membership largely
composed of class-C bartenders? Not knowing, he acts in
accordance with the best dictates of his ignorance.
The commission or the delegation or the presentation, whatever it
calls itself, is provided with White Passes all round. On the strength
of these White Passes the investigators are at the public expense
transferred across the Channel and housed temporarily at the
American Visitors' Château. From there they are taken in
automobiles and under escort of very bored officers on a kind of
glorified Cook's tour behind the British Front. Thereafter they are
turned over to the French Mission or to the American forces for
similar treatment.
As a result they accumulate an assortment of soft-boiled and
yolkless impressions which they incubate into the spoken or the
written word on the way back home, after they have held a meeting
to decide whether they like the way the war is going on or whether
they do not like the way the war is going on. Always there is the
possibility that as a result of the dissemination of underdone and
undigested misinformations which they have managed to acquire
these persons, though actuated by the best intentions in the world,
may do considerable harm in shaping public opinion in America. And
likewise one may be very sure a lot of pestered British and French
functionaries are left to wonder what sort of folks the masses of
American citizenship must be if these are typical samples of the
thought-moulding class.
I am not exaggerating much when I touch on this particular phase
of the topic now engaging me, for I have seen two delegations in
Europe, of the variety I have sought briefly to describe in the lines
immediately foregoing; and we are expecting more in on the next
boat. There was no imaginable reason why those whom I saw
should be in a country that is at war at such a time of crisis as this
time is, but the main point was that they were here, eating three
large rectangular meals a day apiece and taking up the valuable time
of overworked military men who accompanied them while they
week-ended at the war. How many more such delegations will sift
through the State Department and seep by the passport bureau and
journey hither during the latter half of 1918 unless the
Administration at Washington shuts down on the game no man can
with accuracy calculate.
Away down in the south of France I ran into a gentleman of a
clerical aspect who lost no time in telling me about himself. He was
tall and slender like a wand, and of a willowy suppleness of figure,
and he was terribly serious touching on his mission. He represented
a religious denomination that has several hundreds of thousands of
communicants in the United States. He had been dispatched across,
he said, by the governing body of his church. His purpose, he
explained, was to inquire into the bodily and spiritual well-being of
his coreligionists who were on foreign service in the Army and the
Navy, with a view subsequently to suggesting reforms for any
existing evil in the military and naval systems when he reported back
to the main board of his church.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Biological Data Mining Chapman Hall Crc Data Mining and Knowledge Discovery S...
PDF
Get Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Ts...
PDF
Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Tsuda
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Bioinformatics Methods From Omics To Next Generation Sequencing Sujay Datta
PDF
Bioinformatics Methods From Omics To Next Generation Sequencing Shili Lin
PDF
Next generation of data mining 1st Edition Hillol Kargupta
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Biological Data Mining Chapman Hall Crc Data Mining and Knowledge Discovery S...
Get Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Ts...
Data Mining for Systems Biology Methods and Protocols 1st Edition Koji Tsuda
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Bioinformatics Methods From Omics To Next Generation Sequencing Sujay Datta
Bioinformatics Methods From Omics To Next Generation Sequencing Shili Lin
Next generation of data mining 1st Edition Hillol Kargupta
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal

Similar to Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen (20)

PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Data Mining Techniques for the Life Sciences 3rd Edition Oliviero Carugo
PDF
Data Mining Techniques For The Life Sciences 1st Edition Stefan Washietl
PDF
Data Mining For Biomarker Discovery 1st Edition Stefania Mondello
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Next generation of data mining 1st Edition Hillol Kargupta
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Python for Bioinformatics 2nd Edition Sebastian Bassi
PDF
Serviceoriented Distributed Knowledge Discovery 1st Edition Domenico Talia
PDF
Artificial Intelligence in Data Mining: Theories and Applications 1st Edition...
PDF
Handbook Of Statistics 24 Data Mining And Data Visualization Elsevier Cr Rao
PDF
Computational Intelligence And Pattern Analysis In Biology Informatics Maulik U
PDF
The Handbook Of Data Mining 1st Edition Nong Ye
PDF
Data mining techniques for the life sciences 1st Edition Stefan Washietl
PDF
Petascale Analytics Largescale Machine Learning In The Earth Sciences 1st Edi...
PDF
Big Data in Omics and Imaging Association Analysis 1st Edition Momiao Xiong
PDF
(Ebook) Bioinformatics: A Practical Approach by Shui Qing Ye ISBN 97815848881...
PDF
Clustering In Bioinformatics And Drug Discovery John D Maccuish
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Data Mining Techniques for the Life Sciences 3rd Edition Oliviero Carugo
Data Mining Techniques For The Life Sciences 1st Edition Stefan Washietl
Data Mining For Biomarker Discovery 1st Edition Stefania Mondello
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Next generation of data mining 1st Edition Hillol Kargupta
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Python for Bioinformatics 2nd Edition Sebastian Bassi
Serviceoriented Distributed Knowledge Discovery 1st Edition Domenico Talia
Artificial Intelligence in Data Mining: Theories and Applications 1st Edition...
Handbook Of Statistics 24 Data Mining And Data Visualization Elsevier Cr Rao
Computational Intelligence And Pattern Analysis In Biology Informatics Maulik U
The Handbook Of Data Mining 1st Edition Nong Ye
Data mining techniques for the life sciences 1st Edition Stefan Washietl
Petascale Analytics Largescale Machine Learning In The Earth Sciences 1st Edi...
Big Data in Omics and Imaging Association Analysis 1st Edition Momiao Xiong
(Ebook) Bioinformatics: A Practical Approach by Shui Qing Ye ISBN 97815848881...
Clustering In Bioinformatics And Drug Discovery John D Maccuish
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Presentation on HIE in infants and its manifestations
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Institutional Correction lecture only . . .
PPTX
GDM (1) (1).pptx small presentation for students
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Cell Types and Its function , kingdom of life
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial disease of the cardiovascular and lymphatic systems
Computing-Curriculum for Schools in Ghana
Presentation on HIE in infants and its manifestations
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
RMMM.pdf make it easy to upload and study
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Supply Chain Operations Speaking Notes -ICLT Program
Institutional Correction lecture only . . .
GDM (1) (1).pptx small presentation for students
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Complications of Minimal Access Surgery at WLH
Cell Types and Its function , kingdom of life
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Ad

Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen

  • 1. Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery Series 1st Edition Jake Y Chen download https://ebookbell.com/product/biological-data-mining-chapman- hall-crc-data-mining-and-knowledge-discovery-series-1st-edition- jake-y-chen-2172726 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Biological Data Mining And Its Applications In Healthcare Xiaoli Li Seekiong Ng Jason T L Wang https://ebookbell.com/product/biological-data-mining-and-its- applications-in-healthcare-xiaoli-li-seekiong-ng-jason-t-l- wang-51374142 Biological Data Mining In Protein Interaction Networks Seekiong Ng https://ebookbell.com/product/biological-data-mining-in-protein- interaction-networks-seekiong-ng-1382476 Data Mining In Medical And Biological Research Giannopoulou E Ed https://ebookbell.com/product/data-mining-in-medical-and-biological- research-giannopoulou-e-ed-1103532 Data Mining Foundations And Intelligent Paradigms Volume 3 Medical Health Social Biological And Other Applications 1st Edition Dawn E Holmes https://ebookbell.com/product/data-mining-foundations-and-intelligent- paradigms-volume-3-medical-health-social-biological-and-other- applications-1st-edition-dawn-e-holmes-2511290
  • 3. Biological Knowledge Discovery Handbook Preprocessing Mining And Postprocessing Of Biological Data 1st Edition Mourad Elloumi https://ebookbell.com/product/biological-knowledge-discovery-handbook- preprocessing-mining-and-postprocessing-of-biological-data-1st- edition-mourad-elloumi-5249964 Biological Data Integration Computer And Statistical Approaches 1st Edition Christine Froidevaux https://ebookbell.com/product/biological-data-integration-computer- and-statistical-approaches-1st-edition-christine-froidevaux-54251750 Biological Data Exploration With Python Pandas And Seaborn Clean Filter Reshape And Visualize Complex Biological Datasets Using The Scientific Python Stack Dr Martin Jones https://ebookbell.com/product/biological-data-exploration-with-python- pandas-and-seaborn-clean-filter-reshape-and-visualize-complex- biological-datasets-using-the-scientific-python-stack-dr-martin- jones-55211736 Biological Data Integration Computer And Statistical Approaches Froidevaux https://ebookbell.com/product/biological-data-integration-computer- and-statistical-approaches-froidevaux-231944450 A Primer In Biological Data Analysis And Visualization Using R Pilot Project Ebook Available To Selected Us Libraries Only Gregg Hartvigsen https://ebookbell.com/product/a-primer-in-biological-data-analysis- and-visualization-using-r-pilot-project-ebook-available-to-selected- us-libraries-only-gregg-hartvigsen-51905110
  • 7. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L. Wagstaff KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn MULTIMEDIA DATA MINING: A Systematic Introduction to Concepts and Theory Zhongfei Zhang and Ruofei Zhang NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, Second Edition Harvey J. Miller and Jiawei Han TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi PUBLISHED TITLES SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis.This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues.
  • 8. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Edited by Jake Y. Chen Stefano Lonardi Biological Data Mining
  • 9. Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-8684-3 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Biological data mining / editors, Jake Y. Chen, Stefano Lonardi. p. cm. -- (Data mining and knowledge discovery series) Includes bibliographical references and index. ISBN 978-1-4200-8684-3 (hardcover : alk. paper) 1. Bioinformatics. 2. Data mining. 3. Computational biology. I. Chen, Jake. II. Lonardi, Stefano. III. Title. IV. Series. QH324.2.B578 2010 570.285--dc22 2009028067 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
  • 10. Contents Preface ix Editors xiii Contributors xv Part I Sequence, Structure, and Function 1 1 Consensus Structure Prediction for RNA Alignments 3 Junilda Spirollari and Jason T. L. Wang 2 Invariant Geometric Properties of Secondary Structure Elements in Proteins 27 Matteo Comin, Concettina Guerra, and Giuseppe Zanotti 3 Discovering 3D Motifs in RNA 49 Alberto Apostolico, Giovanni Ciriello, Concettina Guerra, and Christine E. Heitsch 4 Protein Structure Classification Using Machine Learning Methods 69 Yazhene Krishnaraj and Chandan Reddy 5 Protein Surface Representation and Comparison: New Approaches in Structural Proteomics 89 Lee Sael and Daisuke Kihara 6 Advanced Graph Mining Methods for Protein Analysis 111 Yi-Ping Phoebe Chen, Jia Rong, and Gang Li 7 Predicting Local Structure and Function of Proteins 137 Huzefa Rangwala and George Karypis v
  • 11. vi Contents Part II Genomics, Transcriptomics, and Proteomics 161 8 Computational Approaches for Genome Assembly Validation 163 Jeong-Hyeon Choi, Haixu Tang, Sun Kim, and Mihai Pop 9 Mining Patterns of Epistasis in Human Genetics 187 Jason H. Moore 10 Discovery of Regulatory Mechanisms from Gene Expression Variation by eQTL Analysis 205 Yang Huang, Jie Zheng, and Teresa M. Przytycka 11 Statistical Approaches to Gene Expression Microarray Data Preprocessing 229 Megan Kong, Elizabeth McClellan, Richard H. Scheuermann, and Monnie McGee 12 Application of Feature Selection and Classification to Computational Molecular Biology 257 Paola Bertolazzi, Giovanni Felici, and Giuseppe Lancia 13 Statistical Indices for Computational and Data Driven Class Discovery in Microarray Data 295 Raffaele Giancarlo, Davide Scaturro, and Filippo Utro 14 Computational Approaches to Peptide Retention Time Prediction for Proteomics 337 Xiang Zhang, Cheolhwan Oh, Catherine P. Riley, Hyeyoung Cho, and Charles Buck Part III Functional and Molecular Interaction Networks 351 15 Inferring Protein Functional Linkage Based on Sequence Information and Beyond 353 Li Liao 16 Computational Methods for Unraveling Transcriptional Regulatory Networks in Prokaryotes 377 Dongsheng Che and Guojun Li 17 Computational Methods for Analyzing and Modeling Biological Networks 397 Nataša Pržulj and Tijana Milenković
  • 12. Contents vii 18 Statistical Analysis of Biomolecular Networks 429 Jing-Dong J. Han and Chris J. Needham Part IV Literature, Ontology, and Knowledge Integration 447 19 Beyond Information Retrieval: Literature Mining for Biomedical Knowledge Discovery 449 Javed Mostafa, Kazuhiro Seki, and Weimao Ke 20 Mining Biological Interactions from Biomedical Texts for Efficient Query Answering 485 Muhammad Abulaish, Lipika Dey, and Jahiruddin 21 Ontology-Based Knowledge Representation of Experiment Metadata in Biological Data Mining 529 Richard H. Scheuermann, Megan Kong, Carl Dahlke, Jennifer Cai, Jamie Lee, Yu Qian, Burke Squires, Patrick Dunn, Jeff Wiser, Herb Hagler, Barry Smith, and David Karp 22 Redescription Mining and Applications in Bioinformatics 561 Naren Ramakrishnan and Mohammed J. Zaki Part V Genome Medicine Applications 587 23 Data Mining Tools and Techniques for Identification of Biomarkers for Cancer 589 Mick Correll, Simon Beaulah, Robin Munro, Jonathan Sheldon, Yike Guo, and Hai Hu 24 Cancer Biomarker Prioritization: Assessing the in vivo Impact of in vitro Models by in silico Mining of Microarray Database, Literature, and Gene Annotation 615 Chia-Ju Lee, Zan Huang, Hongmei Jiang, John Crispino, and Simon Lin 25 Biomarker Discovery by Mining Glycomic and Lipidomic Data 627 Haixu Tang, Mehmet Dalkilic, and Yehia Mechref 26 Data Mining Chemical Structures and Biological Data 649 Glenn J. Myatt and Paul E. Blower Index 689
  • 14. Preface Modern biology has become an information science. Since the invention of a DNA sequencing method by Sanger in the late seventies, public repositories of genomic sequences have been growing exponentially, doubling in size every 16 months—a rate often compared to the growth of semiconductor transistor densities in CPUs known as Moore’s Law. In the nineties, the public–private race to sequence the human genome further intensified the fervor to gener- ate high-throughput biomolecular data from highly parallel and miniaturized instruments. Today, sequencing data from thousands of genomes, including plants, mammals, and microbial genomes, are accumulating at an unprece- dented rate. The advent of second-generation DNA sequencing instruments, high-density cDNA microarrays, tandem mass spectrometers, and high-power NMRs have fueled the growth of molecular biology into a wide spectrum of disciplines such as personalized genomics, functional genomics, proteomics, metabolomics, and structural genomics. Few experiments in molecular biol- ogy and genetics performed today can afford to ignore the vast amount of biological information publicly accessible. Suddenly, molecular biology and genetics have become data rich. Biological data mining is a data-guzzling turbo engine for postgenomic biology, driving the competitive race toward unprecedented biological discov- ery opportunities in the twenty-first century. Classical bioinformatics emerged from the study of macromolecules in molecular biology, biochemistry, and biophysics. Analysis, comparison, and classification of DNA and protein se- quences were the dominant themes of bioinformatics in the early nineties. Machine learning mainly focused on predicting genes and proteins functions from their sequences and structures. The understanding of cellular functions and processes underlying complex diseases were out of reach. Bioinformatics scientists were a rare breed, and their contribution to molecular biology and genetics was considered marginal, because the computational tools available then for biomolecular data analysis were far more primitive than the array of experimental techniques and assays that were available to life scientists. Today, we are now witnessing the reversal of these past trends. Diverse sets of data types that cover a broad spectrum of genotypes and phenotypes, par- ticularly those related to human health and diseases, have become available. Many interdisciplinary researchers, including applied computer scientists, ap- plied mathematicians, biostatisticians, biomedical researchers, clinical scien- tists, and biopharmaceutical professionals, have discovered in biology a gold ix
  • 15. x Preface mine of knowledge leading to many exciting possibilities: the unraveling of the tree of life, harnessing the power of microbial organisms for renewable energy, finding new ways to diagnose disease early, and developing new therapeutic compounds that save lives. Much of the experimental high-throughput biology data are generated and analyzed “in haste,” therefore leaving plenty of oppor- tunities for knowledge discovery even after the original data are released. Most of the bets on the race to separate the wheat from the chaff have been placed on biological data mining techniques. After all, when easy, straightforward, first-pass data analysis has not yielded novel biological insights, data mining techniques must be able to help—or, many presumed so. In reality, biological data mining is still much of an “art,” successfully practiced by a few bioinformatics research groups that occupy themselves with solving real-world biological problems. Unlikely data mining in business, where the major concerns are often related to the bottom line—profit—the goals of biological data mining can be as diverse as the spectrum of biologi- cal questions that exist. In the business domain, association rules discovered between sales items are immediately actionable; in biology, any unorthodox hypothesis produced by computational models has to be first red-flagged and is lucky to be validated experimentally. In the Internet business domain, clas- sification, clustering, and visualization of blogs, network traffic patterns, and news feeds add significant values to regular Internet users who are unaware of high-level patterns that may exist in the data set; in molecular biology and ge- netics, any clustering or classification of the data presented to biologists may promptly elicit questions like “great, but how and why did it happen?” or “how can you explain these results in the context of the biology I know?” The majority of general-purpose data mining techniques do not take into consider- ation the prior knowledge domain of the biological problem, leading them to often underperform hypothesis-driven biological investigative techniques. The high level of variability of measurements inherent in many types of biological experiments or samples, the general unavailability of experimental replicates, the large number of hidden variables in the data, and the high correlation of biomolecular expression measurements also constitute significant challenges in the application of classical data mining methods in biology. Many biological data mining projects are attempted and then abandoned, even by experienced data mining scientists. In the extreme cases, large-scale biological data min- ing efforts are jokingly labeled as fishing expeditions and dispelled, in national grant proposal review panels. This book represents a culmination of our past research efforts in biolog- ical data mining. Throughout this book, we wanted to showcase a small, but noteworthy sample of successful projects involving data mining and molec- ular biology. Each chapter of the book is authored by a distinguished team of bioinformatics scientists whom we invited to offer the readers the widest possible range of application domains. To ensure high-quality standards, each contributed chapter went through standard peer reviews and a round of revi- sions. The contributed chapters have been grouped into five major sections.
  • 16. Preface xi The first section, entitled Sequence, Structure, and Function, collects contri- butions on data mining techniques designed to analyze biological sequences and structures with the objective of discovering novel functional knowledge. The second section, on Genomics, Transcriptomics, and Proteomics, contains studies addressing emerging large-scale data mining challenges in analyzing high-throughput “omics” data. The chapters in the third section, entitled Functional and Molecular Interaction Networks, address emerging system- scale molecular properties and their relevance to cellular functions. The fourth section is about Literature, Ontology, and Knowledge Integrations, and it col- lects chapters related to knowledge representation, information retrieval, and data integration for structured and unstructured biological data. The con- tributed works in the fifth and last section, entitled Genome Medicine Appli- cations, address emerging biological data mining applications in medicine. We believe this book can serve as a valuable guide to the field for graduate students, researchers, and practitioners. We hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining in molecular biology and genetics. For us, research in data mining and its applications to biology and genetics is fascinating and rewarding. It may even help to save human lives one day. This field offers great opportunities and rewards if one is prepared to learn molecular biology and genetics, design user- friendly software tools under the proper biological assumptions, and validate all discovered hypotheses rigorously using appropriate models. In closing, we would like to thank all the authors that contributed a chapter in the book. We are also indebted to Randi Cohen, our outstanding publishing editor. Randi efficiently managed timelines and deadlines, gracefully handled the communication with the authors and the reviewers, and took care of ev- ery little detail associated with this project. This book could not have been possible without her. Our thanks also go to our families for their support throughout the book project. Jake Y. Chen Indianapolis, Indiana Stefano Lonardi Riverside, California
  • 18. Editors Jake Chen is an assistant professor of informatics at Indiana University School of Informatics and assistant professor of computer science at Purdue School of Science, Indiana. He is the founding director of the Indiana Cen- ter for Systems Biology and Personalized Medicine—the first research center in the region to promote the development of systems biology tools towards solving future personalized medicine problems. He is an IEEE senior mem- ber and a member of several other interdisciplinary Indiana research centers, including: Center for Computational Biology and Bioinformatics, Center for Bio-computing, Indiana University Cancer Center, and Indiana Center for En- vironmental Health. He was a scientific co-founder and chief informatics officer (2006–2008) of Predictive Physiology and Medicine, Inc. and the founder of Medeolinx, LLC-Indiana biotech startups developing businesses in emerging personalized medicine and translational bioinformatics markets. Dr. Chen received PhD and MS degrees in computer science from the University of Minnesota at Twin Cities and a BS in molecular biology and biochemistry from Peking University in China. He has extensive industrial research and management experience (1998–2003), including developing com- mercial GeneChip microarrays at Affymetrix, Inc. and mapping the first hu- man protein interactome at Myriad Proteomics. After rejoining academia in 2004, he concentrated his research on “translational bioinformatics,” studies aiming to bridge the gaps between bioinformatics research and human health applications. He has over 60 publications in the areas of biological data man- agement, biological data mining, network biology, systems biology, and various disease-related omics applications. Stefano Lonardi is associate professor of computer science and engineering at the University of California, Riverside. He is also a faculty member of the graduate program in genetics, genomics and bioinformatics, the Center for Plant Cell Biology, the Institute for Integrative Genome Biology, and the graduate program in cell, molecular and developmental biology. Dr. Lonardi received his “Laurea cum laude” from the University of Pisa in 1994 and his PhD, in the summer of 2001, from the Department of Com- puter Sciences, Purdue University, West Lafayette, IN. He also holds a PhD in electrical and information engineering from the University of Padua (1999). During the summer of 1999, he was an intern at Celera Genomics, Department of Informatics Research, Rockville, MD. xiii
  • 19. xiv Editors Dr. Lonardi’s recent research interests include designing of algorithms, computational molecular biology, data compression, and data mining. He has published more than 30 papers in major theoretical computer science and computational biology journals and has about 45 publications in refereed in- ternational conferences. In 2005, he received the CAREER award from the National Science Foundation.
  • 20. Contributors Muhammad Abulaish Department of Computer Science Jamia Millia Islamia New Delhi, India Alberto Apostolico College of Computing Georgia Institute of Technology Atlanta, Georgia Simon Beaulah InforSense, Ltd. London, United Kingdom Paola Bertolazzi Istituto di Analisi dei Sistemi ed Informatica Antonio Ruberti Consiglio Nazionale delle Ricerche Rome, Italy Paul E. Blower Department of Pharmacology Ohio State University Columbus, Ohio Charles Buck Bindley Bioscience Center Purdue University West Lafayette, Indiana Jennifer Cai Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Dongsheng Che Department of Computer Science East Stroudsburg University East Stroudsburg, Pennsylvania Yi-Ping Phoebe Chen School of Information Technology Deakin University Melbourne, Australia Hyeyoung Cho Bindley Bioscience Center Purdue University West Lafayette, Indiana and Department of Bio and Brain Engineering KAIST Daejeon, South Korea Jeong-Hyeon Choi Center for Genomics and Bioinformatics and School of Informatics Indiana University Bloomington, Indiana Giovanni Ciriello Department of Information Engineering University of Padova Padova, Italy xv
  • 21. xvi Contributors Matteo Comin Department of Information Engineering University of Padua Padova, Italy Mick Correll InforSense, LLC Cambridge, Massachusetts John Crispino Hematology Oncology Northwestern University Chicago, Illinois Carl Dahlke Health Information Systems Northrop Grumman, Inc. Rockville, Maryland Mehmet Dalkilic School of Informatics Indiana University Bloomington, Indiana Lipika Dey Innovation Labs Tata Consultancy Services New Delhi, India Patrick Dunn Health Information Systems Northrop Grumman, Inc. Rockville, Maryland Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica Antonio Ruberti Consiglio Nazionale delle Ricerche Rome, Italy Raffaele Giancarlo Dipartimento di Matematica ed Applicazioni University of Palermo Palermo, Italy Concettina Guerra College of Computing Georgia Institute of Technology Atlanta, Georgia and Department of Information Engineering University of Padua Padova, Italy Yike Guo InforSense, Ltd. London, United Kingdom Herb Hagler Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Jing-Dong J. Han Key Laboratory of Molecular Developmental Biology Center for Molecular Systems Biology Institute of Genetics and Developmental Biology Chinese Academy of Sciences Beijing, People’s Republic of China Christine E. Heitsch School of Mathematics Georgia Institute of Technology Atlanta, Georgia
  • 22. Contributors xvii Hai Hu Windber Research Institute Windber, Pennsylvania Yang Huang National Institutes of Health Bethesda, Maryland Zan Huang Hematology Oncology Northwestern University Chicago, Illinois Hongmei Jiang Department of Statistics Northwestern University Evanston, Illinois David Karp Division of Rheumatology University of Texas Southwestern Medical Center Dallas, Texas George Karypis Deparment of Computer Science University of Minnesota Minneapolis, Minnesota Weimao Ke University of North Carolina Chapel Hill, North Carolina Daisuke Kihara Department of Biological Sciences and Department of Computer Science Markey Center for Structural Biology College of Science Purdue University West Lafayette, Indiana Sun Kim Center for Genomics and Bioinformatics and School of Informatics Indiana University Bloomington, Indiana Megan Kong Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Yazhene Krishnaraj Wayne State University Detroit, Michigan Giuseppe Lancia Dipartimento di Matematica e Informatica University of Udine Udine, Italy Chia-Ju Lee Biomedical Informatics Center Northwestern University Chicago, Illinois Jamie Lee Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Gang Li School of Information Technology Deakin University Melbourne, Australia
  • 23. xviii Contributors Guojun Li Department of Biochemistry and Molecular Biology and Institute of Bioinformatics University of Georgia Athens, Georgia and School of Mathematics and System Sciences Shandong University Jinan, People’s Republic of China Li Liao Computer and Information Sciences University of Delaware Newark, Delaware Simon Lin Biomedical Informatics Center Northwestern University Chicago, Illinois Elizabeth McClellan Division of Biomedical Informatics University of Texas Southwestern Medical Center Dallas, Texas and Department of Statistical Science Southern Methodist University Dallas, Texas Monnie McGee Department of Statistical Science Southern Methodist University Dallas, Texas Yehia Mechref National Center for Glycomics and Glycoproteomics Department of Chemistry Indiana University Bloomington, Indiana Tijana Milenković Department of Computer Science University of California Irvine, California Jason H. Moore Computational Genetics Laboratory Norris-Cotton Cancer Center Departments of Genetics and Community and Family Medicine Dartmouth Medical School Lebanon, New Hampshire and Department of Computer Science University of New Hampshire Durham, New Hampshire and Department of Computer Science University of Vermont Burlington, Vermont and Translational Genomics Research Institute Phoenix, Arizona Javed Mostafa University of North Carolina Chapel Hill, North Carolina Robin Munro InforSense, Ltd. London, United Kingdom Glenn J. Myatt Myatt & Johnson, Inc. Jasper, Georgia Chris J. Needham School of Computing University of Leeds Leeds, United Kingdom
  • 24. Contributors xix Cheolhwan Oh Bindley Bioscience Center Purdue University West Lafayette, Indiana Mihai Pop Center for Bioinformatics and Computational Biology University of Maryland College Park, Maryland Teresa M. Przytycka National Institutes of Health Bethesda, Maryland Nataša Pržulj Department of Computer Science University of California Irvine, California Yu Qian Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Naren Ramakrishnan Department of Computer Science Virginia Tech Blacksburg, Virginia Huzefa Rangwala Department of Computer Science George Mason University Fairfax, Virginia Chandan Reddy Wayne State University Detroit, Michigan Catherine P. Riley Bindley Bioscience Center Purdue University West Lafayette, Indiana Jia Rong School of Information Technology Deakin University Melbourne, Australia Lee Sael Department of Computer Science Purdue University West Lafayette, Indiana Davide Scaturro Dipartimento di Matematica ed Applicazioni University of Palermo Palermo, Italy Richard H. Scheuermann Department of Pathology Division of Biomedical Informatics University of Texas Southwestern Medical Center Dallas, Texas Kazuhiro Seki Organization of Advanced Science and Technology Kobe University Kobe, Japan Jonathan Sheldon InforSense Ltd. London, United Kingdom Barry Smith Department of Philosophy University at Buffalo Buffalo, New York
  • 25. xx Contributors Junilda Spirollari New Jersey Institute of Technology Newark, New Jersey Burke Squires Department of Pathology University of Texas Southwestern Medical Center Dallas, Texas Haixu Tang School of Informatics National Center for Glycomics and Glycoproteomics Indiana University Bloomington, Indiana Jahiruddin Department of Computer Science Jamia Millia Islamia New Delhi, India Filippo Utro Dipartimento di Matematica ed Applicazioni University of Palermo Palermo, Italy Jason T. L. Wang New Jersey Institute of Technology Newark, New Jersey Jeff Wiser Health Information Systems Northrop Grumman, Inc. Rockville, Maryland Mohammed Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, New York Giuseppe Zanotti Department of Biological Chemistry University of Padua Padova, Italy Xiang Zhang Department of Chemistry Center of Regulatory and Environmental Analytical Metabolomics University of Louisville Louisville, Kentucky Jie Zheng National Institutes of Health Bethesda, Maryland
  • 26. Part I Sequence, Structure, and Function 1
  • 28. Chapter 1 Consensus Structure Prediction for RNA Alignments Junilda Spirollari and Jason T. L. Wang New Jersey Institute of Technology 1.1 Introduction ........................................................... 3 1.2 Algorithms ............................................................ 5 1.2.1 Folding of a single RNA sequence ............................. 6 1.2.1.1 Preliminaries ......................................... 6 1.2.1.2 Algorithm ............................................ 8 1.2.2 Calculation of covariance scores ............................... 12 1.2.2.1 Covariance score ...................................... 12 1.2.2.2 Pairing threshold ..................................... 13 1.2.3 Algorithms for RSpredict ...................................... 14 1.3 Results ................................................................ 16 1.3.1 Performance evaluation on Rfam alignments of high similarity .................................................. 17 1.3.2 Performance evaluation on Rfam alignments of medium and low similarity ..................................... 17 1.4 Conclusions ............................................................ 22 References .................................................................. 23 1.1 Introduction RNA secondary structure prediction has been studied for quite awhile. Many minimum free energy (MFE) methods have been developed for pre- dicting the secondary structures of single RNA sequences, such as mfold [1], RNAfold [2], MPGAfold [3], as well as recent tools presented in the liter- ature [4, 5]. However, the accuracy of predicted structures is far from per- fect. As evaluated by Gardner and Giegerich [6], the accuracy of the MFE methods for single sequences is 73% when averaged over many different RNAs. Recently, a new concept of energy density for predicting the secondary structures of single RNA sequences was introduced [7]. The normalized free energy, or energy density, of an RNA substructure is the free energy of that substructure divided by the length of its underlying sequence. A dynamic 3
  • 29. 4 Biological Data Mining programming algorithm, called Densityfold, was developed, which delocalizes the thermodynamic cost of computing RNA substructures and improves on secondary structure prediction via energy density minimization [7]. Here, we extend the concept used in Densityfold and present a tool, called RSpredict, for RNA secondary structure prediction. RSpredict computes the RNA structure with minimum energy density based on the loop decomposition scheme used in the nearest neighbor energy model [8]. RSpredict focuses on the loops in an RNA secondary structure, whereas Densityfold considers RNA substructures where a substructure may contain several loops. While the energy density model creates a foundation for RNA secondary structure prediction, there are many limitations in Densityfold, just like in all other single sequence-based MFE methods. Optimal structures predicted by these methods do not necessarily represent real structures [9]. This happens due to several reasons. The thermodynamic model may not be accurate. The bases of structural RNAs may be chemically modified and these processes are not included in the prediction model. Finally, some functional RNAs may not have stable secondary structures [6]. Thus, a more reliable approach is to use comparative analysis to compute consensus secondary structures from multiple related RNA sequences [9]. In general, there are three strategies with the comparative approach. The first strategy is to predict the secondary structures of individual RNA se- quences separately and then align the structures. Tools such as RNAshapes [10,11], MARNA [12], STRUCTURELAB [13], and RADAR [14,15] are based on this strategy. RNA Sampler [9] and comRNA [16] compare and find stems conserved across multiple sequences and then assemble conserved stem blocks to form consensus structures, in which pseudoknots are allowed. The second strategy predicts common secondary structures of two or more RNA sequences through simultaneous alignment and consensus structure in- ference. Tools based on this strategy include RNAscf [17], Foldalign [18], Dy- nalign [19], stemloc [20], PMcomp [21], MASTR [22], and CARNAC [23]. These tools utilize either folding free energy change parameters or stochastic context-free grammars (SCFGs) and are considered derivations of Sankoff’s method [24]. The third strategy is to fold multiple sequence alignments. RNAalifold [25, 26] uses a dynamic programming algorithm to compute the consensus secondary structure with MFE by taking into account thermodynamic stabil- ity, sequence covariation together with RIBOSUM-like scoring matrices [27]. Pfold [28] is a SCFG algorithm that produces a prior probability distribution of RNA structures. A maximum likelihood approach is used to estimate a phylogenetic tree for predicting the most likely structure for input sequences. A limitation of Pfold is that it does not run on alignments of more than 40 se- quences and in some cases produces no structures due to under-flow errors [6]. Maximum weighted matching (MWM), based on a graph-theoretical approach and developed by Cary and Stormo [29] and Tabaska et al. [30], is able to
  • 30. Consensus Structure Prediction for RNA Alignments 5 predict common secondary structures allowing pseudo-knots. KNetFold [31] is a recently published machine learning method, implemented using a hierar- chical network of k-nearest neighbor classifiers that analyzes the base pairings of alignment columns in the input sequences through their mutual information, Watson–Crick base pairing rules and thermodynamic base pair propensity de- rived from RNAfold [2]. The method presented in this chapter, RSpredict, joins the many tools using the third strategy; it accepts a multiple alignment of RNA sequences as input data and predicts the consensus secondary struc- ture for the input sequences via energy density minimization and covariance score calculation. We also considered two variants of RSpredict, referred to as RSefold and RSdfold respectively. Both RSefold and RSdfold use the same covariance score calculation as in RSpredict. The differences among the three approaches lie in the folding algorithms they adopt. Rse-fold predicts the consensus secondary structure for the input sequences via free energy minimization, as opposed to energy density minimization used in RSpredict. RSdfold does the prediction via energy density minimization, though its energy density is calculated based on RNA substructures as in Densityfold, rather than based on the loops used in RSpredict. The rest of the chapter is organized as follows. We first describe the imple- mentation and algorithms used by RSpredict, and analyze the time complexity of the algorithms (see Section 1.2). We then present experimental results of running the RSpredict tool as well as comparison with the existing tools (see Section 1.3). The experiments were performed on a variety of datasets. Finally we discuss some properties of RSpredict, possible ways to improve the tool and point out some directions for future research (see Section 1.4). 1.2 Algorithms RSpredict, which can be freely downloaded from http://datalab.njit.edu/ biology/RSpredict, was implemented in the Java programming language. The program accepts, as input data, a multiple sequence alignment in the FASTA or ClustalW format and outputs the consensus secondary structure of the input sequences in both the Vienna style dot bracket format [26] and the connectivity table format [32]. Below, we describe the energy density model adopted by RSpredict. We then present a dynamic programming algorithm for folding a single RNA sequence via energy density minimization. Next, we describe techniques for calculating covariance scores based on the input alignment. Finally we summarize the algorithms used by RSpredict, combining both the folding technique and the covariance scores obtained from the input alignment, and show its time complexity.
  • 31. 6 Biological Data Mining 1.2.1 Folding of a single RNA sequence 1.2.1.1 Preliminaries We represent an RNA secondary structure as a fully decomposed set of loops. In general, a loop L can be one of the following (see Figure 1.1): i. A hairpin loop (which is a loop enclosed by only one base pair; the smallest possible hairpin loop consists of three nucleotides enclosed by a base pair) ii. A stack, composed of two consecutive base pairs iii. A bulge loop, if two base pairs are separated only on one side by one or more unpaired bases iv. An internal loop, if two base pairs are separated by one or more unpaired bases on both sides v. A multibranched loop, if more than two base pairs are separated by zero or more unpaired bases in the loop We now introduce some terms and definitions. Let S be an RNA sequence consisting of nucleotides or bases A, U, C, G. S[i] denotes the base at position i of the sequence S and S[i, j] is the subsequence starting at position i and ending at position j in S. A base pair between nucleotides at positions i and j is denoted as (i, j) or (S[i], S[j]), and its enclosed sequence is S[i, j]. Given a loop L in the secondary structure R of sequence S, the base pair (i∗ , j∗ ) in L is called the exterior pair of L if S[i∗ ](S[j∗ ], respectively) is closest to the 5 (3 , respectively) end of R among all nucleotides in L. All other nonexterior base pairs in L are called interior pairs of L. The length of a loop L is the number of nucleotides in L. Note that two loops may overlap on a base pair. For example, the interior pair of a stack may be the exterior pair of another stack, or the exterior pair of a hairpin loop. Also note that a bulge or an internal loop has exactly one exterior pair and one interior pair. We use the energy density concept as follows. Given a secondary structure R, every base pair (i, j) in R is the exterior pair of some loop L. We assign (i, j) and L an energy density, which is the free energy of the loop L divided by the length of L. The set of free energy parameters for nonmultibranched loops used in our algorithm is acquired from [33]. The free energy of a multibranched loop is computed based on the approach adopted by mfold [1], which is a linear function of the number of unpaired bases and the number of base pairs inside the loop, namely a + b × n1 + c × n2, where a, b, c are constants, n1 is the number of unpaired bases and n2 is the number of base pairs inside the multibranched loop. We adopt the loop decomposition scheme used in the nearest neighbor energy model developed by Turner et al. [8]. The secondary structure R contains multiple loop components and the energy densities of
  • 32. Consensus Structure Prediction for RNA Alignments 7 Hairpin Stack Bulge Internal loop Hairpin Stack Bulge Multibranched loop Bulge Internal loop Stack 5' 3' FIGURE 1.1: Illustration of the loops in an RNA secondary structure. Each loop has at least one base pair. A stem consists of two or more consecutive stacks shown in the figure. the loop components are additive. Our folding algorithm computes the total energy density of R by taking the sum of the energy densities of the loop components in R. Thus, the RNA folding problem can be formalized as follows. Given an RNA sequence S, find the set of base pairs (i, j) and loops with (i, j) as exterior pairs, such that the total energy density of the loops (or equivalently, the exterior pairs) is minimized. The set of base pairs constitutes the optimal secondary structure of S. When generalizing the folding of a single sequence to the prediction of the consensus structure of a multiple sequence alignment, we introduce the notion of refined alignments. At times, an input alignment may have some columns each of which contains more than 75% gaps. Some tools including RSpredict delete these columns to get a refined alignment [28]; some tools simply use the
  • 33. 8 Biological Data Mining original input alignment as the refined alignment. Suppose the original input alignment Ao has N sequences and no columns, and the refined alignment A has N sequences and n columns, n ≤ no. Formally, the consensus structure of the refined alignment A is a secondary structure R together with its sequence S such that each base pair (S[i], S[j]), 1 ≤ i j ≤ n, in R corresponds to the pair of columns i, j in the alignment A, and each base S[i], 1 ≤ i ≤ n, is the representative base of the ith column in the alignment A. There are several ways to choose the representative base. For example, S[i] could be the most frequently occurring nucleotide, excluding gaps, in the ith column of the alignment A. Furthermore, there is an energy measure value associated with each base pair (S[i], S[j]) or more precisely its corresponding column pair (i, j), such that the total energy measure value of all the base pairs in R is minimized. The consensus secondary structure of the original input alignment Ao is defined as the structure Ro, obtained from R, as follows: (i) the base (base pair, respectively) for column Co (column pair (Co1, Co2), respectively) in Ao is identical to the base (base pair, respectively) for the corresponding column C (column pair (C1, C2), respectively) in A if Co ((Co1, Co2), respectively) is not deleted when getting A from Ao; (ii) unpaired gaps are inserted into R, such that each gap corresponds to a column that is deleted when getting A from Ao (see Figure 1.2). In Figure 1.2, the RSpredict algorithm transforms the original input alignment Ao to a refined alignment A by deleting the fourth column (the column in red) of Ao. The algorithm predicts the consensus structure of the refined alignment A. Then the algorithm generates the consensus structure of Ao by inserting an unpaired gap to the fourth position of the consensus structure of A. The numbers inside parentheses in the refined alignment A represent the original column numbers in Ao. In what follows, we first present an algorithm for folding a single RNA sequence based on the energy density concept described here. We then gener- alize the algorithm to predict the consensus secondary structure for a set of aligned RNA sequences. 1.2.1.2 Algorithm The functions and parameters used in our algorithm are defined below where S[i, j] is a subsequence of S and R[i, j] is the optimal secondary struc- ture of S[i, j]. i. NE(i, j) is the total energy density of all loops in R[i, j], where nu- cleotides at positions i, j may or may not form a base pair. ii. NEp(i, j) is the total energy density of all loops in R[i, j] if nucleotides at positions i, j form a base pair. iii. eH(i, j)(EH(i, j), respectively) is the free energy (energy density, respec- tively) of the hairpin with exterior pair (i, j).
  • 34. Consensus Structure Prediction for RNA Alignments 9 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 A ( ( ( . . . ) ) ) ( ( ( . . . ) ) ) ) G C A G C – C A A G C U C A A G C U 9 Original alignment Ao Refined alignment A Consensus structure of A Consensus structure of Ao (1) (2) (3) (5) (6) (7) (8) (9) (10) FIGURE 1.2: Illustration of the consensus structure definition used by RSpredict. iv. eS(i, j)(ES(i, j), respectively) is the free energy (energy density, respec- tively) of the stack with exterior pair (i, j) and interior pair (i+1, j −1). v. eB(i, j, i , j ), (EB(i, j, i , j ), respectively) is the free energy (energy density, respectively) of the bulge or internal loop with exterior pair (i, j) and interior pair (i , j ). vi. eJ (i, j, i 1, j 1, i 2, j 2, . . . , i k, j k) EJ (i, j, i 1, j 1, i 2, j 2, . . . , i k, j k) respectively, is the free energy (energy density, respectively) of the multibranched loop with exterior pair (i, j) and interior pairs (i 1, j 1) , (i 2, j 2) , . . . , (i k, j k) .
  • 35. 10 Biological Data Mining It is clear that EH (i, j) = eH (i, j) j − i + 1 (1.1) ES(i, j) = eS(i, j) 4 (1.2) EB (i, j, i , j ) = eB (i, j, i , j ) i − i + j − j + 2 (1.3) EJ (i, j, i 1, j 1, i 2, j 2, . . . , i k, j k) = eJ (i, j, i 1, j 1, i 2, j 2, . . . , i k, j k) n1 + 2 × n2 (1.4) Here n1 is the number of unpaired bases and n2 is the number of base pairs in the multibranched loop in (vi). Thus, the total energy density of all loops in R[i, j] where (i, j) is a base pair is computed by Equation 1.5: NEP (i, j) = min                EH(i, j) ES (i, j) + NEP (i + 1, j − 1) min iijj {EB (i, j, i , j ) + NEP (i , j )} min ii 1j 1i 2j 2···i kj kj {EJ (i, j, i 1, j 1, i 2, j 2, . . . , i k, j k) + k r=1 NEP (i r, j r)} (1.5) That is, the energy density is calculated by taking the minimum of the following four cases: i. (i, j) is the exterior pair of a hairpin, in which case the energy density NEP (i, j) equals EH(i, j), which is the energy density of the hairpin ii. (i, j) is the exterior pair of a stack, in which case NEP (i, j) equals the energy density of the stack, i.e., ES(i, j), plus NEP (i + 1, j − 1) iii. (i, j) is the exterior pair of a bulge or an internal loop, in which case NEP (i, j) equals the minimum of the energy density of the bulge or internal loop EB(i, j, i , j ) plus NEP (i , j ) for all i i j j iv. (i, j) is the exterior pair of a multibranched loop, in which case NEP (i, j) equals the minimum of the energy density of the multibranched loop Ej i, j, i 1, j 1, i 2,j 2, . . . , i k, j k plus k r=1 NEP (i r, j r), for all i i 1 j 1 i 2 j 2 · · · i k j k j Equation 1.6 below shows the recurrence formula for calculating NE(i, j): NE (i, j) = min        NE (i, j − 1) NE (i + 1, j) NEP (i, j) minihj{NE (i, h − 1) + NE (h, j)} (1.6)
  • 36. Consensus Structure Prediction for RNA Alignments 11 (d) NE(i, h – 1) NE(i, j) NE(h, j) j h h – 1 i NE(i, j – 1) (a) (b) (c) NE(i + 1, j) NEp(i, j) NE(i, j) NE(i, j) j i i j – 1 i + 1 j i j FIGURE 1.3: Illustration of the cases in Equation 1.6. a) the total nor- malized energy of all loops in the optimal secondary structure R[i, j − 1] of subsequence S[i, j − 1]; b) the total normalized energy of all loops in the op- timal secondary structure R[i + 1, j] of subsequence S[i + 1, j]; c) the total normalized energy of all loops in the optimal secondary structure R[i, j] of subsequence S[i, j], where S[i] and S[j] form a base pair; d) the minimum of NE(i, k − 1) plus NE(k, j) for all i k j; The dashed line between two nucleotides means that the two nucleotides may or may not form a base pair. The solid line between two nucleotides means that the two nucleotides form a base pair. That is, the energy density is computed by taking the minimum of the following four cases: i. The total energy density of all loops in the optimal secondary structure R [i, j − 1] of subsequence S [i, j − 1] (Figure 1.3a) ii. The total energy density of all loops in the optimal secondary structure R [i + 1, j] of subsequence S [i + 1, j] (Figure 1.3b) iii. The total energy density of all loops in the optimal secondary structure R[i, j] of subsequence S[i, j], where S[i] and S[j] form a base pair (Figure 1.3c)
  • 37. 12 Biological Data Mining iv. The minimum of NE(i, h − 1) plus NE(h, j) for all i h j (Figure 1.3d) Note that case (iii) of Equation 1.6 is not considered when the nucleotides at positions i, j are forbidden to form a base pair, i.e., (S[i], S[j]) is a nonstandard base pair. A standard base pair is any of the following: (A,U), (U,A), (G,C), (C,G), (G,U), (U,G); all other base pairs are nonstandard. In calculating the time complexity of the folding algorithm, there is a need to check for finding the optimal i , j where i i j j in case (iii) (the optimal i 1, j 1, i 2, j 2, . . . , i k, j k where i i 1 j 1 i 2 j 2 · · · i k j k j in case (iv), respectively) of Equation 1.5. It can be shown that it takes linear time to compute NEP (i, j) in Equation 1.5. Hence, the time complexity of the folding algorithm is O(n3 ) since we need to calculate NEP (i, j) for all 1 ≤ i j ≤ n, where n is the number of nucleotides in the given sequence S. The energy density of the optimal secondary structure R for the sequence S equals NE(1, n). 1.2.2 Calculation of covariance scores When applying the above folding algorithm to a multiple sequence align- ment Ao, we take into consideration the correlation between columns of the alignment. In many cases, the sequences in the alignment may have highly varying lengths. We refine the alignment Ao by deleting columns containing more than 75% gaps to get a refined alignment A [28]. We will use this refined alignment throughout the rest of this subsection. 1.2.2.1 Covariance score We use the covariance score introduced by RNAalifold [25, 26, 34] to quantify the relationship between two columns in the refined alignment. Let fij(XY ) be the frequency of finding both base X in column i and base Y in column j, where X, Y are in the same row of the refined alignment. We exclude the occurrences of gaps in column i or column j when calculating fij(XY). The covariation measure for columns i, j, denoted Cij, is calculated by Equation 1.7: Cij = XY, X Y fij (XY ) Dij (XY, X Y ) fij (X Y ) 2 (1.7) Here, Dij(XY, X Y ) is the Hamming distance between the two base pairs (X, Y ) and (X , Y ) if both of the base pairs are standard base pairs, or 0 otherwise. The Hamming distance between (X, Y ) and (X , Y ) is calculated as follows: Dij (XY, X Y ) = 2 − δ (X, X ) − δ (Y, Y ) (1.8) where δ (X, X ) = 1 if X = X 0 otherwise (1.9)
  • 38. Consensus Structure Prediction for RNA Alignments 13 Observe that the information acquired from the two base pairs (X, Y ) and (X , Y ) is the same as that from (X , Y ) and (X, Y ). Thus, we divide the numerator in Equation 1.7 by two so as to obtain the non-redundant information between column i and column j in the refined alignment. For every pair of columns i, j in the refined alignment, the covariance score of the two columns i and j, denoted Covij, is calculated in Equation 1.10: Covij = Cij + c1 × NFij (1.10) Here, Cij is as defined in Equation 1.7, c1 is a user-defined coefficient (in the study presented here, c1 has a value of −1), and NFij = NCij N (1.11) where N is the total number of sequences and NCij is the total number of conflicting sequences in the refined alignment. A conflicting sequence is one that has a gap in column i or column j, or has a nonstandard base pair in the columns i, j of the refined alignment. A sequence with gaps in both columns i, j is not conflicting. 1.2.2.2 Pairing threshold We say that column i and column j in the refined alignment can possibly form a base pair if their covariance score is greater than or equal to a pairing threshold; otherwise, column i and column j are forbidden to form a base pair. The pairing threshold, η, used in RSpredict is calculated as follows. It is known that, on average, 54% of the nucleotides in an RNA sequence S are involved in the base pairs of its secondary structure [35]. We use this information to calculate an alignment-dependent pairing threshold, observing that the base pairs in the consensus secondary structure of a sequence align- ment represent the column pairs with the highest covariance scores. Given that different structures contain different numbers of base pairs, we consider two different percentages of columns, namely, 30% and 65%, in the sequence align- ment. For each percentage p, there are at most Tp possible base pairs, where Tp = (p × n) × (p × n − 1) 2 (1.12) and n is the number of columns in the sequence alignment. Now, we calculate the covariance scores of all pairs of columns in the given refined alignment, and sort the covariance scores in descending order. We then select the top Tp largest covariance scores and store the covariance scores in the set STp. Thus, the set ST0.65 contains the top largest covariance scores that involve 65% of the columns in the refined alignment; the set ST0.30 contains the top largest covariance scores that involve 30% of the columns in the refined alignment; and ST0.65ST0.30 is the set difference that contains covariance scores in ST0.65 but not in ST0.30 (see Figure 1.4). The pairing
  • 39. 14 Biological Data Mining ST0.30 ST0.65 T0.30 T0.65 FIGURE 1.4: Illustration of the pairing threshold computation. The pairing threshold used in RSpredict is computed as the average of the covariance scores inside the shaded area. threshold η used in RSpredict is calculated as the average of the covariance scores in ST0.65ST0.30, as shown in Equation 1.13: η = Covij ∈ ST0.65ST0.30Covij |ST0.65ST0.30| (1.13) where the denominator is the cardinality of the set difference ST0.65ST0.30. If the covariance score of columns i and j is greater than or equal to η, then column i and column j can possibly form a base pair, and we refer to (i, j) as a pairing column. If the covariance score of the columns i and j is less than η, we will check the covariance scores of the immediate neighboring column pairs of i, j to see if they are above a user-defined threshold [31] (in the study presented here, this threshold is set to 0). The immediate neighboring column pairs of i, j are i + 1, j − 1 and i − 1, j + 1. If the covariance scores of both of the immediate neighboring column pairs of i, j are greater than or equal to max{η, 0}, then (i, j) is still considered as a paring column. 1.2.3 Algorithms for RSpredict Given a refined multiple sequence alignment A with N sequences, let (i, j) be a pairing column in A. Let XS i (Y S j , respectively) be the nucleotide at position i (j, respectively) of the sequence S in the alignment A. XS i , Y S j must be the exterior pair of some loop L in S. We use e XS i , Y S j to repre- sent the free energy of that loop L. If XS i , Y S j is a nonstandard base pair, e XS i , Y S j = 0. We assign the pairing column (i, j) a pseudo-energy eij where eij = 1 N S∈A e XS i , Y S j + c2 × Covij (1.14) Here, c2 is a user-defined coefficient (in the study presented here, c2 = −1). Thus, every pairing column in the refined alignment A has a pseudo-energy. We then apply the minimum energy density folding algorithm described in the beginning of this section to the refined alignment A, treating each pairing column in A as a possible base pair considered in the folding algorithm. Notice that when calculating the energy density for the loop L, the se- quence S is in the refined alignment A, which may have fewer columns than
  • 40. Consensus Structure Prediction for RNA Alignments 15 the original input alignment Ao (cf. Figure 1.2). RSpredict computes all energy densities based on the refined alignment, and the program uses loop lengths from the refined alignment A rather than the original input alignment Ao. Let R be the consensus secondary structure, computed by RSpredict, for the refined alignment A. We obtain the consensus structure Ro of the original input alignment Ao by inserting unpaired gaps to the positions in R whose corresponding columns are deleted when getting A from Ao (cf. Figure 1.2). The following summarizes the algorithms for RSpredict: 1. Input an alignment Ao in the FASTA or ClustalW format. 2. Delete the columns with more than 75% gaps from Ao to obtain a refined alignment A. 3. Compute the pseudo-energy eij for every pairing column (i, j) in A as in Equation 1.14. 4. Run the minimum energy density folding algorithm on A, using the pseudo-energy values obtained from step (3) to produce the consensus secondary structure R of the refined alignment A. The base at position i of the consensus secondary structure R is the most frequently occurring nucleotide, excluding gaps, in the ith column of the refined alignment A. 5. Map the consensus structure R back to the original alignment Ao by in- serting unpaired gaps to the positions of R whose corresponding columns are deleted in Step (2). Notice that Equation 1.6 is used to compute the NE values only. To gen- erate the optimal structure R in Step (4), we maintain a stack of pointers that point to the substructures of loops with minimum energy density as we compute the NE values. Once all the NE values are calculated and the energy density of the optimal secondary structure R is obtained, we pop up the point- ers from the stack to extract the optimal predicted structure. In step (5), we map the bases (base pairs, respectively) for the columns (column pairs, respec- tively) in A to their corresponding columns (column pairs, respectively) in Ao. For example, consider Figure 1.2 again. In the figure, the refined alignment A is obtained by deleting column 4 from the original input alignment Ao. The bases for columns 1, 2, 3, 4 in A are mapped to columns 1, 2, 3, 5 in Ao. The base pair between column 1 and column 9 in A becomes the base pair between column 1 and column 10 in Ao; the base pair between column 2 and column 8 in A becomes the base pair between column 2 and column 9 in Ao. An unpaired gap is inserted to the position corresponding to the deleted column 4 in Ao. Let N be the number of sequences and no be the number of columns in the input alignment Ao. Step (2) takes O(Nno) time. Step (3) takes O n2 o time. Step (4) takes O n3 o time. Step (5) takes O(no) time. Therefore, the time complexity of RSpredict is O Nno + n3 o , which is approximately O n3 o as Nis usually much smaller than no.
  • 41. 16 Biological Data Mining 1.3 Results We conducted a series of experiments to evaluate the performance of RSpredict and compared it with five related tools including KNetFold, Pfold, RNAalifold, RSefold, and RSdfold. We tested these tools on Rfam [36] se- quence alignments with different similarities. The Rfam sequence alignments come with consensus structures. For evaluation purposes, we used the Rfam consensus structures as reference structures and compared them against the consensus structures predicted by the six tools. The similarity of a sequence alignment is determined by the average pairwise sequence identity (APSI) of that alignment [6]. In the study presented here, a sequence alignment is of high similarity if its APSI value is greater than 75%, is of medium similarity if its APSI value is between 55% and 75%, or is of low similarity if its APSI value is less than 55%. The data sets used in testing included 20 Rfam se- quence alignments of high similarity and 36 Rfam sequence alignments of low and medium similarity. These data sets were chosen to form a collection of sequence alignments with different (low, medium and high) APSI values, dif- ferent numbers of sequences, as well as different sequence alignment lengths. More specifically, the data sets contained sequence alignments that ranged in size from 2 to 160 sequences, in length from 33 to 262 nucleotides and had APSI values ranging from 42% to 99%. The performance measures used in our study include sensitivity (SN ) and selectivity (SL) [6], where SN = TP TP + FN (1.15) SL = TP TP + (FP − ξ) . (1.16) Here, TP is the number of correctly predicted base pairs (“true positives”), FN is the number of base pairs in a reference structure that were not predicted (“false negatives”) and FP is the number of incorrectly predicted base pairs (“false positives”). False positives are classified as inconsistent, contradicting or compatible [6]. When predicting the consensus secondary structure for a multiple sequence alignment, a predicted base pair (i, j) is inconsistent if col- umn i in the alignment is paired with column q, q = j, or column j is paired with column p, p = i, and p, q form a base pair in the reference structure of the alignment. A base pair (i, j) is contradicting if there exists a base pair (p, q) in the reference structure of the alignment, such that i p j q. A base pair (i, j) is compatible if it is a false positive but is neither inconsistent nor contra- dicting. The ξ in SL represents the number of compatible base pairs, which are considered neutral with respect to algorithmic accuracy. Therefore ξ is sub- tracted from FP. Finally, we used the Matthews correlation coefficient (MCC) to combine the sensitivity and selectivity, where MCC is approximated to the
  • 42. Consensus Structure Prediction for RNA Alignments 17 geometric mean of the two measures, i.e., MCC ≈ √ SN × SL [18]. The larger MCC, SN, SL values a tool has, the better performance that tool achieves and the more accurate that tool is. 1.3.1 Performance evaluation on Rfam alignments of high similarity The first data set consisted of seed alignments of high similarity taken from 20 families in Rfam. The APSI values of these seed alignments ranged from 77% to 99%. The alignments ranged in size from 2 to 160 sequences and in length from 33 to 159 nucleotides. Table 1.1 presents the accession number, description, number of sequences, and length of the seed alignment of each of the 20 Rfam families used in the experiment. The seed alignments of the 20 families are of high similarity; their APSI values are shown in the last column of the table. The families are sorted, from top to bottom, in ascending order on the APSI values. All six tools including RSpredict, KNetFold, RNAalifold, Pfold, RSefold and RSdfold were tested on this data set. The graphs in Figure 1.5 show the trend of the MCC, SN, and SL, which are sorted in descending order for each tool under analysis. The X-axis shows, therefore, the rank of the MCC (SN and SL, respectively) from highest to lowest. For example, number 1 in the X-axis corresponds to the highest score achieved by each tool. The Y-axis represents the MCC, SN, and SL, respectively. It can be seen from Figure 1.5 that RSpredict performed the best while RSdfold performed the worst among the six tools. The Pfold tool had good performance in selectivity but did not perform well in sensitivity and as a result in MCC. It also suffered from a size limitation (the Pfold web server can accept a multiple alignment of up to 40 sequences). Only 17 out of the 20 sequence alignments used in the experiment were accepted by the Pfold server; the other three alignments (RF00386, RF00041, and RF00389) had more than 40 sequences and therefore could not be run on the Pfold server. RSpredict had stable performance with the best mean 0.85 (standard deviation 0.16, respectively) in MCC, while the other methods’ MCC values varied a lot and had means (standard deviations, respectively) ranging from 0.37 to 0.82 (0.24 to 0.34, respectively). 1.3.2 Performance evaluation on Rfam alignments of medium and low similarity In the second experiment, we compared RSpredict with the other five methods on multiple sequence alignments of low and medium similarity. The test dataset included seed alignments of 36 families taken from Rfam [36]. The APSI values of the seed alignments ranged from 42 to 75%, the number of sequences in the alignments ranged from 3 to 114, and the alignment lengths ranged from 43 to 262 nucleotides. Table 1.2 presents the accession number,
  • 43. 18 Biological Data Mining TABLE 1.1: Rfam alignments of high similarity. Number of Accession Description sequences Length APSI RF00460 U1A polyadenylation inhibition element (PIE) 8 75 77% RF00326 Small nucleolar RNA Z155 8 81 79% RF00560 Small nucleolar RNA SNORA17 38 132 82% RF00453 Cardiovirus cis-acting replication element (CRE) 12 33 82% RF00386 Enterovirus 5 cloverleaf cis-acting replication element 160 91 83% RF00421 Small nucleolar RNA SNORA32 9 122 84% RF00302 Small nucleolar RNA SNORA65 8 130 84% RF00465 Japanese encephalitis virus (JEV) hairpin structure 20 60 86% RF00501 Rotavirus cis-acting replication element (CRE) 14 68 87% RF00041 Enteroviral 3 UTR element 60 123 87% RF00575 Small nucleolar RNA SNORD70 4 88 89% RF00362 Pospiviroid RY motif stem loop 16 79 92% RF00105 Small nucleolar RNA SNORD115 23 82 92% RF00467 Rous sarcoma virus (RSV) primer binding site (PBS) 23 75 93% RF00389 Bamboo mosaic virus satellite RNA cis-regulatory element 42 159 93% RF00384 Poxvirus AX element late mRNA cis-regulatory element 7 62 93% RF00098 Snake H/ACA box small nucleolar RNA 22 150 93% RF00607 Small nucleolar RNA SNORD98 2 67 98% RF00320 Small nucleolar RNA Z185 2 86 98% RF00318 Small nucleolar RNA Z175 3 81 99% description, number of sequences, and length of the seed alignment of each of the 36 Rfam families used in the experiment. The seed alignments of the 36 families are of low and medium similarity; their APSI values are shown in the last column of the table. The families are sorted, from top to bottom, in ascending order on the APSI values.
  • 44. Consensus Structure Prediction for RNA Alignments 19 Matthews correlation coefficient 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict Sensitivity 0.00 0.20 0.40 0.60 0.80 1.00 1.20 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict Selectivity 0.00 0.20 0.40 0.60 0.80 1.00 1.20 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict FIGURE 1.5: Comparison of the MCC, SN, and SL values of the six tools under analysis on the seed alignments of high similarity taken from the 20 families listed in Table 1.1.
  • 45. 20 Biological Data Mining TABLE 1.2: Rfam alignments of low and medium similarity. Number of Accession Description sequences Length APSI RF00230 T-box leader 103 262 42% RF00080 yybP-ykoY leader 50 131 44% RF00515 PyrR binding site 72 125 47% RF00557 Ribosomal protein L10 leader 66 149 48% RF00504 Glycine riboswitch 93 111 50% RF00029 Group II catalytic intron 114 94 52% RF00458 Cripavirus internal ribosome entry site (IRES) 7 203 54% RF00559 Ribosomal protein L21 leader 33 81 54% RF00234 glmS glucosamine-6-phosphate activated ribozyme 11 218 55% RF00556 Ribosomal protein L19 leader 24 43 55% RF00519 suhB 13 80 56% RF00379 ydaO/yuaA leader 25 150 58% RF00380 ykoK leader 36 172 59% RF00445 mir-399 microRNA precursor family 13 119 59% RF00522 PreQ1 riboswitch 22 47 59% RF00095 Pyrococcus C/D box small nucleolar RNA 25 59 60% RF00442 ykkC-yxkD leader 11 111 60% RF00430 Small nucleolar RNA SNORA54 5 134 60% RF00521 SAM riboswitch (alpha-proteobacteria) 12 79 61% RF00049 Small nucleolar RNA SNORD36 20 82 63% RF00513 Tryptophan operon leader 11 100 63% RF00309 Small nucleolar RNA snR60/ Z15/Z230/Z193/J17 23 106 63% RF00451 mir-395 microRNA precursor family 21 112 64% RF00464 mir-92 microRNA precursor family 33 80 64% RF00507 Coronavirus frameshifting stimulation element 23 85 66% RF00388 Qa RNA 5 103 70% RF00357 Small nucleolar RNA R44/ J54/Z268 family 19 105 70% RF00434 Luteovirus cap-independent translation element (BTE) 17 108 71% RF00525 Flavivirus DB element 111 76 71% RF00581 Small nucleolar SNORD12/ SNORD106 8 91 71% RF00238 ctRNA 48 88 72% RF00477 Small nucleolar RNA snR66 5 105 72% RF00608 Small nucleolar RNA SNORD99 3 80 72% RF00468 Heaptitis C virus stem-loop VII 110 66 74% RF00489 ctRNA 14 80 74% RF00113 QUAD RNA 14 150 75% The MCC, SN, and SL values are sorted in descending order for each tool under analysis and placed in the graphs in Figure 1.6. The X-axis shows, there- fore, the rank of the MCC (SN and SL, respectively) from highest to lowest. For example, number 1 in the X-axis corresponds to the highest score achieved by each tool. The Y -axis represents the MCC, SN, and SL, respectively.
  • 46. Consensus Structure Prediction for RNA Alignments 21 Matthews correlation coefficient 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Sensitivity 0.00 0.20 0.40 0.60 0.80 1.00 1.20 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Selectivity 0.00 0.20 0.40 0.60 0.80 1.00 1.20 KNetFold Pfold RNAalifold RSefold RSdfold RSpredict FIGURE 1.6: Comparison of the MCC, SN, and SL values of the six tools under analysis on the seed alignments of low and medium similarity taken from the 36 families listed in Table 1.2.
  • 47. 22 Biological Data Mining Comparing Figures 1.5 and 1.6, we see that the methods under analysis generally performed better on sequence alignments of medium and low similar- ity than on sequence alignments of high similarity. Like what was observed in the previous experiment, RSdfold performed the worst (cf. Figure 1.5). The structures predicted by RSdfold tend to be stem-like structures; therefore, many structures, particularly those containing multibranched loops, were mis- predicted. For this reason, RSdfold yielded very low MCC, SN and SL values. RSpredict outperformed the other five methods based on the three per- formance measures used in the experiment. The tool achieved a high mean value of 0.94 in MCC, better than those of KNetFold (0.86), Pfold (0.88) and RNAalifold (0.89). Similar results were observed for sensitivity and se- lectivity values. Furthermore, RSpredict exhibited stable performance across all the families tested in the experiment. The tool had an MCC, SN and SL standard deviation of 0.08, 0.09 and 0.08, respectively. These numbers were better than the standard deviation values obtained from the other five meth- ods, which ranged from 0.11 to 0.34. Pfold suffered from a size limitation; it could not generate a structure for the large seed alignments with more than 40 sequences in 9 families, including RF00230, RF00080, RF00515, RF00557, RF00504, RF00029, RF00525, RF00238 and RF00468. 1.4 Conclusions In this chapter we presented a software tool, called RSpredict, capable of predicting the consensus secondary structure for a set of aligned RNA sequences via energy density minimization and covariance score calculation. Our experimental results showed that RSpredict is competitive with some widely used tools including RNAalifold and Pfold on tested datasets, sug- gesting that RSpredict can be a choice when biologists need to predict RNA secondary structures of multiple sequence alignments, especially those with low and medium similarity. Notice that RSpredict differs from KNetFold [31] in that KNetFold is a machine learning method that relies on precompiled training data derived from existing RNA secondary structures. RSpredict, on the other hand, is based on a dynamic programming algorithm for folding sequences and does not utilize training data. Given a multiple sequence alignment Ao, our work is focused on predicting the consensus structure of the aligned sequences in Ao, rather than folding each individual sequence in Ao. Our approach is to first transform Ao to a refined alignment A by deleting columns with more than 75% gaps from Ao, then pre- dict the consensus structure for A, and finally extend the consensus structure by inserting gaps to the positions corresponding to the deleted columns in Ao (cf. Figure 1.2). The predicted structure may not correspond exactly to any individual sequence in the original alignment Ao. As an example, assume for
  • 48. Consensus Structure Prediction for RNA Alignments 23 simplicity that Ao is the same as A, i.e., no columns are deleted when getting A from Ao. Consider a particular sequence S in Ao. Assume that the position (column) i of S has a gap due to the alignment with the other sequences in Ao. On the other hand, the position i in the consensus structure of Ao has the most frequently occurring nucleotide in column i of Ao, which cannot be a gap. As a result, the consensus structure of Ao, which is at least one nucleotide longer than S, cannot be mapped exactly back onto S. In future work we plan to look into ways for improving on consensus structure predic- tion. Possible ways include the utilization of evolutionary information [37], more sophisticated models of covariance scoring, and training data for more accurate pairing thresholds. References [1] Zuker, M. 2003. Mfold web server for nucleic acid folding and hybridiza- tion prediction. Nucleic Acids Res. 31:3406–3415. [2] Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic Acids Res. 31:3429–3431. [3] Shapiro, B.A., Kasprzak, W., Grunewald, C., Aman, J. 2006. Graphi- cal exploratory data analysis of RNA secondary structure dynamics pre- dicted by the massively parallel genetic algorithm. J. Mol. Graph. Model. 25:514–531. [4] Bellamy-Royds, A.B., Turcotte, M. 2007. Can Clustal-style progressive pairwise alignment of multiple sequences be used in RNA secondary struc- ture prediction? BMC Bioinformatics 8:190. [5] Horesh, Y., Doniger, T., Michaeli, S., Unger, R. RNAspa: a shortest path approach for comparative prediction of the secondary structure of ncRNA molecules. BMC Bioinformatics 8:366. [6] Gardner, P.P., Giegerich, R. 2004. A comprehensive comparison of com- parative RNA structure prediction approaches. BMC Bioinformatics 5:140. [7] Alkan, C., Karakoc, E., Sahinalp, S.C., Unrau, P., Alexander, E., Zhang, K., Buhler, J. 2006. RNA secondary structure prediction via energy density minimization. In Proceedings of the Research in Computational Molecular Biology (RECOMB), Springer Berlin/Heidelberg, Venice, Italy, 130–142. [8] Xia, T., SantaLucia, J., Burkard, M.E., Kierzek, R., Schroeder, S.J., Jiao, X., Cox, C., Turner, D.H. 1998. Thermodynamic parameters for an
  • 49. 24 Biological Data Mining expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37:14719–14735. [9] Xu, X., Yongmei, J., Stormo, G.D. 2007. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics 23:1883–1891. [10] Giegerich, R., Voss, B., Rehmsmeier, M. 2007. Abstract shapes of RNA. Nucleic Acids Res. 32:4843–4851. [11] Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J., Giegerich, R. 2006. RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 22:500–503. [12] Siebert, S., Backofen, R. 2005. MARNA: multiple alignment and consen- sus structure prediction of RNAs based on sequence structure compar- isons. Bioinformatics 21:3352–3359. [13] Shapiro, B.A., Bengali, D., Kasprzak, W., Wu, J.C. 2001. RNA folding pathway functional intermediates: their prediction and analysis. J. Mol. Biol. 312:27–44. [14] Khaladkar, M., Bellofatto, V., Wang, J.T.L., Tian, B., Shapiro, B.A. 2007. RADAR: a web server for RNA data analysis and research. Nucleic Acids Res. 35:W300–W304. [15] Liu, J., Wang, J.T.L., Hu, J., Tian, B. 2005. A method for aligning RNA secondary structures and its application to RNA motif detection. BMC Bioinformatics 6:89. [16] Ji, Y., Xu, X., Stormo, G.D. 2004. A graph theoretical approach for pre- dicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics 20:1591–1602. [17] Bafna, V., Tang, H., Zhang, S. 2006. Consensus folding of unaligned RNA sequences revisited. J. Comput. Biol. 13:283–295. [18] Gorodkin, J., Stricklin, S.L., Stormo, G.D. 2001. Discovering com- mon stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res. 29:2135–2144. [19] Mathews, D.H., Turner, D.H. 2002. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol. 317:191–203. [20] Holmes, I., Rubin, G.M. 2002. Pairwise RNA structure comparison with stochastic context-free grammars. In Proceedings of the Pacific Sympo- sium Biocomputing, Lihue, Hawaii, 163–174.
  • 50. Consensus Structure Prediction for RNA Alignments 25 [21] Hofacker, I.L., Bernhart, S.H.F., Stadler, P.F. 2004. Alignment of RNA base pairing probability matrices. Bioinformatics 20:2222–2227. [22] Lindgreen, S., Gardner, P.P., Krogh, A. 2007. MASTR: multiple align- ment and structure prediction of non-coding RNAs using simulated an- nealing. Bioinformatics 23:3304–3311. [23] Touzet, H., Perriquet, O. 2004. CARNAC: folding families of related RNAs. Nucleic Acids Res. 32:W142–W145. [24] Sankoff, D. 1985. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. 45:810–825. [25] Hofacker, I.L., Fekete, M., Stadler, P.F. 2002. Secondary structure pre- diction for aligned RNA sequences. J. Mol. Biol. 319:1059–1066. [26] Bernhart, S.H., Hofacker, I.L., Will, S., Gruber, A.R., Stadler, P.F. 2008. RNAalifold: improved consensus structure prediction for RNA align- ments. BMC Bioinformatics 9:474. [27] Klein, R.J., Eddy, S.R. 2003. RSEARCH: finding homologs of single struc- tured RNA sequences. BMC Bioinformatics 4:44. [28] Knudsen, B., Hein, J. 2003. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 31:3423–3428. [29] Cary, R.B., Stormo, G.D. 1995. Graph-theoretic approach to RNA mod- eling using comparative data. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 75–80. [30] Tabaska, J.E., Cary, R.B., Gabow, H.N., Stormo, G.D. 1998. An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics 14:691–699. [31] Bindewald, E., Shapiro, B.A. 2006. RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classi- fiers. RNA 12:342–352. [32] Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M., Turner, D.H. 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary struc- ture. Proc. Natl. Acad. Sci. USA. 101:7287–7292. [33] Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H. 1999. Expanded se- quence dependence of thermodynamic parameters provides robust pre- diction of RNA secondary structure. J. Mol. Biol. 288:911–940. [34] Lindgreen, S., Gardner, P.P., Krogh, A. 2006. Measuring covariation in RNA alignments: physical realism improves information measures. Bioin- formatics 22:2988–2995.
  • 51. 26 Biological Data Mining [35] Mathews, D.H., Banerjee, A.R., Luan, D.D., Eickbush, T.H., Turner, D.H. 1997. Secondary structure model of the RNA recognized by the reverse transcriptase from the R2 retrotransposable element. RNA 3:1–16. [36] Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., Eddy, S.R. 2003. Rfam: an RNA family database. Nucleic Acids Res. 31:439–441. [37] Seemann, S.E., Gorodkin, J., Backofen, R. 2008. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res. 36:6355–6362.
  • 52. Chapter 2 Invariant Geometric Properties of Secondary Structure Elements in Proteins Matteo Comin University of Padua Concettina Guerra Georgia Institute of Technology and University of Padua Giuseppe Zanotti University of Padua 2.1 Introduction ........................................................... 28 2.1.1 The dilemma of protein folding ............................... 28 2.1.2 Protein classification and the discovery of hidden rules ....... 29 2.2 The Use of Geometric Invariants and Hashing for a Simplified Representation of Secondary Structure Elements (SSEs) ............ 30 2.2.1 Simplified representations of three-dimensional (3D) structures ................................................ 30 2.2.2 Segment approximation of secondary structure element (SSE) ................................................. 32 2.2.3 Building of the hash table for triplets of secondary structure element (SSE) ....................................... 32 2.2.4 Building the hash table ........................................ 34 2.3 The Use of Geometric Invariants for Three-Dimensional (3D) Structures Comparison ................................................ 34 2.3.1 Retrieving similarity from the table ........................... 34 2.3.2 Pair-wise alignment of secondary structures .................. 35 2.3.3 Ranking candidate proteins ................................... 36 2.3.4 Atomic superposition .......................................... 36 2.3.5 Benchmark applications ....................................... 37 2.4 Statistical Analysis of Triplets and Quartets of Secondary Structure Element (SSE) .............................................. 39 2.4.1 Methodology for the analysis of angular patterns ............. 40 2.4.2 Results of the statistical analysis .............................. 42 2.4.3 Selection of subsets containing secondary structure element (SSE) in close contact ................................ 44 2.5 Conclusions ............................................................ 46 References .................................................................. 47 27
  • 53. 28 Biological Data Mining 2.1 Introduction 2.1.1 The dilemma of protein folding Proteins and nucleic acids represent the two major classes of biological macromolecules present in living organisms. They both are necessary to a cell to perform most of its functions, but their role is profoundly different: whilst in nucleic acids the information content is kept in the form of a string, i.e., it resides in the linear sequence of the four bases, the most important aspect of a protein (at least of the globular ones) is its three-dimensional (3D) architec- ture. Using the 20 different amino acids that can constitute a protein (we are neglecting here posttranslational modifications, which can be physiologically very important, but are not relevant for the problem of folding), it is in princi- ple possible to build an impressive number of different sequences:∗ considering, for example, a polypetide chain of only 100 amino acids, this number is 20100 . Only a very small fraction of these sequences is actually present in a cell. For example, the genome of a simple gram-negative bacterium, like Escherichia coli, codes for less than 2000 genes, whilst the genome of a complex organ- ism, like a man, contains many more genes (according to different estimates, between 20,000 and 30,000 genes) and consequently many more proteins. The previous numbers drastically decrease if we consider tertiary structures. It is in fact well known that the 3D structure of a protein is much more conserved than its amino acid sequence, and proteins with different primary structure can display the same fold.† Quite often the same fold corresponds to the same function, and this is one of the reasons why it is necessary to know the 3D structure of a protein and not simply its amino acid sequence; but there are also common protein folds that correspond to totally different functions. We will not discuss here if the latter phenomenon has to be ascribed to conver- gent or divergent evolution, but the practical consequence of this fact is the relatively limited number of different protein folds present in nature. If we con- sider the Protein Data Bank (PDB, http://www.rcsb.org), the database that collects all the 3D structures of biological macromolecules till now experimen- tally determined, either through X-ray or electron diffraction or NMR, there are at present about 47,000 structures of proteins deposited. They correspond to about 1,050 different folds according to SCOP (Murzin et al., 1995) or to 850 according to CATH (Orengo and Thornton, 2005). We do not know yet if they can be considered representative of all the possible folds present in living ∗ The amino acid sequence is also called the primary structure. The level of organization of a protein include three other levels: the secondary structure considers how the polypep- tide chain folds on itself, forming pieces of repeated conformation; the tertiary structure describes how secondary structure elements (SSEs) organize in 3D space; the quaternary structure (which is not present in all the proteins) describes the organization of more than one polypeptide chain. † The term “fold” is used to indicate the way SSEs are arranged in space and is roughly a synonym of “tertiary structure.”
  • 54. Invariant Geometric Properties of Secondary Structure Elements 29 organisms: until some years ago it was estimated that the possible folds could have been about 1000; since completely new folds have not been discovered in the last four years, it is quite reasonable to assume that the number of folds we know is probably quite close to the total number of the existent ones. If so, this means that in nature a limited number of 3D architectures have been developed, and those are used to perform all the necessary functions of cells and organisms. Interestingly, similar 3D folds can be present in proteins that bear small or even undetectable sequence homology, but at present we are not yet able, given an amino acid sequence unrelated with that of previously known 3D structures, to predict with sufficient reliability which folding that particular sequence will assume. 2.1.2 Protein classification and the discovery of hidden rules The concept of fold similarity is not exempt from ambiguities. Do protein families really exist, or is it more likely that there is a sort of “continuous” of similarities? The idea of grouping proteins into “families” according to their fold similarity possibly derives from our needs of classification and categoriza- tion (Gibrat et al., 1996). Whilst in some cases two proteins clearly share the same fold, in others this similarity is questionable, and, in fact, different pro- grams estimate a different numbers of total folds and classify some proteins as belonging to the same family or not (Figure 2.1). This need of categorization has, however, a great practical relevance, both in structure prediction and in function assignment. The experimental determination of the 3D structure of a protein, either by X-ray or NMR, takes nowadays months or, in difficult cases, years, while the sequences of entire genomes, and consequently of the proteins coded by them, are determined at a very high rate.∗ In this respect, the ability of predicting the 3D structure of a protein is of paramount importance. At the same time, the recognition of structural similarities in proteins that present limited or nonexistent sequence similarity can sometimes be used to assign a biological role to a protein of unknown function, when its 3D structure has been determined. Sometimes similarity does not involve entire structures, but only a por- tion of them: it is limited to a single domain, i.e., to a substructure that can be defined as an independent structural unit inside a larger protein. In order to detect similarities, at least in all cases that are not self-evident, the pa- rameters and the algorithm used become relevant and can strongly influence the final results. Different algorithms have been devised to compare and su- perimpose protein structures, but none of them is completely free of failures. Some impose the constraint of continuity of the matched atoms along the pri- mary sequence, in other words preserve the sequential order of the matched atoms; other methods try to minimize the so-called “soap-bubble area” be- tween two structures, or involve other techniques, like lattice fitting (surveys ∗ At the time of writing of this chapter 680 genomes of bacteria (http://www. ebi.ac.uk/genomes/bacteria.html) and 33 of eukaryotes (http://www.ebi.ac.uk/2can/ genomes/eukaryotes.html) are available, and many others are in progress.
  • 55. Other documents randomly have different content
  • 56. There had been loss of life here—no great amount as loss of life is measured these times in this country, but attended by conditions that made the disaster hideous and distressing. The blood of victims still trickled in runlets between the paving stones where we walked, and there were mangled bodies stretched on the floor of an improvised morgue across the way—mainly bodies of poor working women, and one, I heard, the body of a widow with half a dozen children, who now would be doubly orphaned, since their father was dead at the Front. Back again at my hotel after a forenoon packed with curious experiences, I found in my quarters a very badly scared chambermaid, trying to tidy a room with fingers that shook. In my best French, which I may state is the worst possible French, I was trying to explain to her that the bombardment had probably ended— and for a fact there had been a forty-minute lull in the new frightfulness—when one of the shells struck and went off among the trees and flowerbeds of a public breathing place not a hundred and fifty yards away. With a shriek the maid fell on her knees and buried her head, ostrich fashion, in a nest of sofa pillows. I stepped through my bedroom window upon a little balcony in time to see the dust cloud rise in a column and to follow with my eyes the frenzied whirlings of a great flock of wood pigeons flighting high into the air from their roosting perches in the park plot. The next instant I felt a violent tugging at the back breadth of the leather harness that I wore. Unwittingly, in her panic the maid had struck upon the only possible use to which a Sam Browne belt may be put—other than the ornamental, and that is a moot point among fanciers of the purely decorative in the matter of military gearing for the human form. By accident she had divined its one utilitarian purpose. She had risen and with both hands had laid hold upon the crosspiece of my main surcingle and was striving to drag me inside. I rather gathered from the tenor of her contemporaneous remarks, which she uttered at the top of her voice and into which she interjected the names of several saints, that she feared the sight of
  • 57. me in plain view on that stone ledge might incite the invisible marauder to added excesses. But I was the larger and stronger of the two, and my buckles held, and I had the advantage of an iron railing to cling to. After a short struggle my would-be rescuer lost. She turned loose of my kicking straps and breech bands, and making hurried reference to various names in the calendar of the canonised she fled from my presence. I heard her falling down the stairs to the floor below. The next day I had a new chambermaid; this one had tendered her resignation. Not until the middle of the afternoon was the proper explanation for the phenomenon forthcoming. It came then from the Ministry of War, in the bald and unembroidered laconics of a formal communiqué. At the first time of hearing it the announcement seemed so inconceivable, so manifestly impossible that official sanction was needed to make men believe Teuton ingenuity had found a way to upset all the previously accepted principles touching on gravity and friction; on arcs and orbits; on aims and directions; on projectiles and projectives; on the resisting tensility of steel bores and on the carrying power of gun charges—by producing a cannon with a ranging scope of somewhere between sixty and ninety miles. Days of bombardment followed—days which culminated on that never-to-be-forgotten Good Friday when malignant chance sped a shell to wreck one of the oldest churches in Paris and to kill seventy- five and wound ninety worshippers gathered beneath its roof. After the first flurry of uncertainty the populace for the most part grew tranquil; now that they knew the origin of the far-flung punishment there was measurably less dread of the consequences among the masses of the people. On days when the shells exploded futilely the daily press and the comedians in the music halls made jokes at the expense of Big Bertha; as, for example, on a day when a fragment of shell took the razor out of the hand of a man who was shaving himself, without doing him the slightest injury; and again when a whole shell wrecked a butcher shop and strewed the neighbourhood with kidneys and livers and rib ends of beef, but spared the butcher and his family. On days when the colossal piece
  • 58. scored a murderous coup for its masters and took innocent life, the papers printed the true death lists without attempt at concealment of the ravages of the monster. And on all the bombardment days, women went shopping in the Rue de la Paix; children played in the parks; the flower women of the Madeleine sold their wares to customers with the reverberations of the explosions booming in their ears; the crowds that sat sipping coloured drinks at small tables in front of the boulevard cafés on fair afternoons were almost as numerous as they had been before the persistent thing started; and unless the sound was very loud indeed the average promenader barely lifted his or her head at each recurring report. In America we look upon the French as an excitable race, but here they offered to the world a pattern for the practice of fortitude. A good many people departed from Paris to the southward. However, there was calmness under constant danger. Our own people, who were in Paris in numbers mounting up into the thousands, likewise set a fine example of sang-froid. On the evening of the opening day of the bombarding, when any one might have been pardoned for being a bit jumpy, an audience of enlisted men which packed the American Soldiers and Sailors' Club in the Rue Royale was gathered to hear a jazz band play Yankee tunes and afterward to hear an amateur speaker make an address. The cannon had suspended its annoying performances with the going down of the sun, but just as the speaker stood up by the piano the alerte for an air attack—which, by the way, proved to be a false alarm, after all —was heard outside. There was a little pause, and a rustling of bodies. Then the man, who was on his feet, spoke up. “I'll stay as long as any one else does,” he said. “Anyhow, I don't know which is likely to be the worse of two evils—my poor attempts at entertaining you inside or the boche's threatened performances outside.” A great yell of approval went up and not a single person left the building until after the chairman announced that the programme for the evening had reached its conclusion. I know this to be a fact because I was among those present.
  • 59. To be sure, the strain of the harassment got upon the nerves of some; that would be inevitable, human nature being what it is. Attendance at the theatres, especially for the matinées, fell off appreciably; this, though, being attributable, I think, more to fear of panic inside the buildings than to fear of what the missiles might do to the buildings themselves. And there was no record of any individual, whether man or woman, quitting a post of responsibility because of the personal peril to which all alike were exposed. Likewise on those days when the great gun functioned promptly at twenty-minute intervals one would see men sitting in drinking places with their eyes glued to the faces of their wrist watches while they waited for the next crash. For those whose nerves lay close to their skins this damnable regularity of it was the worst phase of the thing. There was something so characteristically and atrociously German, something so hellishly methodical in the tormenting certainty that each hour would be divided into three equal parts by three descending steel tubes of potential destruction. Big Bertha operated on a perfect schedule. She opened up daily at seven a. m. sharp; she quit at six-twenty p. m. It was as though the crew that tended her carried union cards. They were never tardy. Neither did they work overtime. But if the Prussians counted upon bedeviling the people into panic and distracting the industrial and social economies of Paris they missed their guess. They made some people desperately unhappy, no doubt, and they frightened some; but the true organism of the community remained serene and unimpaired. Some share of this, I figure might be attributed to the facts that in a city as great as Paris the chances of any one individual being killed were so greatly reduced that the very size of the town served to envelop its inhabitants with a sense of comparative immunity; the number of buildings, and their massiveness inspired a feeling of partial security. I know I felt safer than I have felt out in the open when the enemy's playful batteries were searching out the terrain round about. In a smaller city this condition probably would not have been manifest to the same degree. There almost everybody would
  • 60. be likely to know personally the latest victim or to be familiar with the latest scene of damage and this would serve doubtlessly to bring the apprehensive home to all households. Howsoever, be the underlying cause what it might, Paris weathered the brunt of the ordeal with splendid fortitude and an admirable coolness. Being frequently in Paris between visits to one or another sector of the front, I was able to keep a fairly accurate score in the ravages of the bombardment and to get a fairly average appraisal of the effects upon the Parisian temper. Likewise by reading translated extracts out of German newspapers I got impressions of another phase of the tragedy which almost was as vivid as though I had been an eye witness to events which I knew of only at second-hand from the published descriptions of them. I had the small advantage though on my side of being able to vizualise the setting in the Forest of St. Gobain, to the west of Laon for I was there once in German company. I could conjure up a presentiment of the scene there enacted on the day when Big Bertha's makers and masters sprang their well-guarded surprise, which so carefully and so secretly had been evolved during months of planning and constructing and experimentations. Behold then the vision: It is a fine spring morning. There is dew on the grass and there is song in the throats of the birds and young foliage is upon the trees. The great grey gun—it is nearly ninety feet long and according to inspired Teutonic chronicles resembles a vast metal crone—squats its misshapen mass upon a prepared concrete base in the edge of the woods, just on the timbered shoulder of a hill. Its long muzzle protrudes at an angle from the interlacing boughs of the thicket where it hides; at a very steep angle, too, since the charge it will fire must ascend twenty miles into the air in order to reach its objective. Behind it is a stenciling of white birdies and slender poplars flung up against the sky line; in front of it is a disused meadow where the newly minted coinage of a prodigal springtime— dandelions that are like gold coins and wild marguerites that are like silver ones—spangle the grass as though the profligate season had strewn its treasures broadcast there. The gunners make ready the
  • 61. monster for its dedication. They open its great navel and slide into its belly a steel shell nine inches thick and three feet long nearly and girthed with beltings of spun brass. The supreme moment is at hand. From a group of staff officers advances a small man, grown old beyond his time; this man wears the field uniform of a Prussian field marshal. He has a sword at his side and spurs on his booted feet and a spiked helmet upon his head. He has a withered arm which dangles abortively, foreshortened out of its proper length. His hair is almost snow-white and his moustache with its fiercely upturned and tufted ends is white. From between slitted lids imbedded in his skull behind unhealthy dropical pouches of flesh his brooding, morbid eyes show as two blue dots, like touches of pale light glinting on twin disks of shallow polished agate. He bears himself with a mien that either is imperial or imperious, depending upon one's point of view. While all about him bow almost in the manner of priests making obeisance before a shrine, he touches with one sacred finger the button of an electrical controller. The air is blasted and the earth rocks then to the loudest crash that ever issued from the mouth of a gun; for all its bulk and weight the cannon recoils on its carriage and shakes itself; the tree tops quiver in a palsy. The young grass is flattened as though by a sudden high wind blowing along the ground; the frightened birds flutter about and are mute. The bellowing echoes die away in a fainter and yet fainter cadence. The-Anointed-of-God turns up his good wrist to consider the face of the watch strapped thereon; his staff follow his royal example. One minute passes in a sort of sacerdotal silence. There is drama in the pause; a fine theatricalism in the interlude. Two minutes, two minutes and a half pass. This is one part of the picture; there is another part of it: Seventy miles away in a spot where a busy street opens out into a paved plaza all manner of common, ordinary work-a-day persons are busied about their puny affairs. In addition to being common and ordinary these folks do not believe in the divine right of kings; truly
  • 62. a high crime and misdemeanour. Moreover, they persist in the heretical practice of republicanism; they believe actually that all men were born free and equal; that all men have the grace and the authority within them to choose their own rulers; that all men have the right to live their own lives free from foreign dictation and alien despotism. But at this particular moment they are not concerned in the least with politics or policies. Their simple day is starting. A woman in a sidewalk kiosk is ranging morning papers on her narrow shelf. A half-grown girl in a small booth set in the middle of the square where the tracks of the tramway end, is selling street car tickets to working men in blouses and baggy corduroy trousers. Hucksters and barrow-men have established a small market along the curbing of the pavement. A waiter is mopping the metal tops of a row of little round tables under the glass markee of a café. Wains and wagons are passing with a rumble of wheels. Here there is no drama except the simple homely drama of applied industry. Three minutes pass: Far away to the north, where the woods are quiet again and the birds have mustered up courage to sing once more, The Regal One drops his arm and looks about him at his officers, nodding and smiling. Smiling, they nod back in chorus, like well-trained automatons. There is a murmur of interchanged congratulations. The effort upon which so much invaluable time and so much scientific thought have been expended, stands unique and accomplished. Unless all calculations have failed the nine-inch shell has reached its mark, has scored its bull's eye, has done its predestined job. It has; those calculations could not go wrong. Out of the kindly and smiling heavens, with no warning except the shriek of its clearing passage through the skies, the bolt descends in the busy square. The glass awning over the café front becomes a darting rain of sharp-edged javelins; the paving stones rise and spread in hurtling fragments from a smoking crater in the roadway. There are a few minutes of mad frenzy among those people assembled there. Then a measure of quiet succeeds to the tumult. The work of rescue starts. The woman who vended papers is a crushed mass under the
  • 63. wreckage of her kiosk; the girl who sold car tickets is dead and mangled beneath her flattened booth; the waiter who wiped the table-tops off lies among his tables now, the whole crown of his head sliced away by slivers of glass; here and there in the square are scattered small motionless clumps that resemble heaps of bloodied and torn rags. Wounded men and women are being carried away, groaning and screaming as they go. But in the edge of the woods at St. Gobain the Kaiser is climbing into his car to ride to his headquarters. It is his breakfast-time and past it and he has a fine appetite this morning. The picture is complete. The campaign for Kultur in the world has scored another triumph, the said score standing: Seven dead; fifteen injured.
  • 65. T CHAPTER XV. WANTED: A FOOL- PROOF WAR HERE was a transportload of newly made officers coming over for service here in France. There was on board one gentleman in uniform who bore himself, as the saying goes, with an air. By reason of that air and by reason of a certain intangible atmospheric something about him difficult to define in words he seemed intent upon establishing himself upon a plane far remote from and inaccessible to these fellow voyagers of his who were crossing the sea to serve in the line, or to act as interpreters, or to go on staffs, or to work with the Red Cross or the Y. M. C. A. or the K. of C. or what not. He had what is called the superior manner, if you get what I mean—and you should get what I mean, reader, if ever you had lived, as I have, for a period of years hard by and adjacent to that particular stretch of the eastern seaboard of North America where, as nowhere else along the Atlantic Ocean or in the interior, are to be found in numbers those favoured beings who acquire merit unutterable by belonging to, or by being distantly related to, or by being socially acquainted with, the families that have nothing but. Nevertheless, and to the contrary notwithstanding, divers of his brother travellers failed to keep their distance. Toward this distinguished gentleman they deported themselves with a familiarity and an offhandedness that must have been acutely distasteful to one unaccustomed to moving in a mixed and miscellaneous company. Accordingly he took steps on the second day out to put them in their proper places. A list was being circulated to get up a subscription for something or other, and almost the very first person to whom this list came in its rounds of the first cabin was the person
  • 66. in question. He took out a gold-mounted fountain pen from his pocket and in a fair round hand inscribed himself thus: “Bejones of Tuxedo” There were no initials—royalty hath not need for initials—but just the family name and the name of the town so fortunate as to number among its residents this notable—which names for good reasons I have purposely changed. Otherwise the impressive incident occurred as here narrated. But those others just naturally refused to be either abashed or abated. They must have been an irreverent, sacrilegious lot, by all accounts. The next man to whom the subscription was carried took note of the new fashion in signatures and then gravely wrote himself down as “Spirits of Niter”; and the next man called himself “Henri of Navarre”; and the third, it developed, was no other than “Cream of Tartar”; and the next was “Timon of Athens”; and the next “Mother of Vinegar”—and so on and so forth, while waves of ribald and raucous laughter shook the good ship from stem to stem. However, the derisive ones reckoned without their host. For them the superior mortal had a yet more formidable shot in the locker. On the following day he approached three of the least impressed of his temporary associates as they stood upon the promenade deck, and apropos of nothing that was being said or done at the moment he, speaking in a clear voice, delivered himself of the following crushing remark: “When I was born there were only two houses in the city of New York that had porte-cochères, and I—I was born in one of them.” Inconceivable though it may appear, the fact is to be recorded that even this disclosure failed to silence the tongues of ridicule aboard that packet boat. Rather did it enhance them, seeming but to spur the misguided vulgarians on and on to further evidences of disrespect. There are reasons for believing that Bejones of Tuxedo, who had been born in the drafty semipublicity of a porte-cochère, left the vessel upon its arrival with some passing sense of relief, though it should be stated that up until the moment of his
  • 67. debarkation he continued ever, while under the eye of the plebes and commoners about him, to bear himself after a mode and a port befitting the station to which Nature had called him. He vanished into the hinterland of France and was gone to take up his duties; but he left behind him, among those who had travelled hither in his company, a recollection which neither time nor vicissitude can efface. Presumably he is still in the service, unless it be that ere now the service has found out what was the matter with it. I have taken the little story concerning him as a text for this article, not because Bejones of Tuxedo is in any way typical of any group or subgroup of men in our new Army—indeed I am sure that he, like the blooming of the century plant, is a thing which happens only once in a hundred years, and not then unless all the conditions are salubrious. I have chosen the little tale to keynote my narrative for the reason that I believe it may serve in illustration; of a situation that has arisen in Europe, and especially in France, these last few months—a condition that does not affect our Army so much as it affects sundry side issues connected more or less indirectly with the presence on European soil of an army from the United States, like most of the nations having representative forms of government that have gone into this war, we went in as an amateur nation so far as knowledge of the actual business of modern warfare was concerned. Like them, we have had to learn the same hard lessons that they learned, in the same hard school of experience. Our national amateurishness beforehand was not altogether to our discredit; neither was it altogether to our credit. Nobody now denies that we should have been better prepared for eventualities than we were. On the other hand it was hardly to be expected that a peaceful commercial country such as ours—which until lately had been politically remote as it was geographically aloof upon its own hemisphere from the political storm-centres of the Old World, and in which there was no taint of the militarism that has been Germany's curse, and will yet be her undoing—should in times of peace greatly concern itself with any save the broad general details of the game of war, except as a heart-moving spectacle enacted upon the stage of
  • 68. another continent and viewed by us with sympathetic and sorrowing eyes across three or four thousand miles of salt water. Prior to our advent into it the war had no great appeal upon the popular conscience of the United States. Out of the fulness of our hearts and out of the abundance of our prosperity we gave our dollars, and gave and gave and kept on giving them for the succour of the victims of the world catastrophe; but a sense of the impending peril for our own institutions came home to but few among us. Here and there were individuals who scented the danger; but they were as prophets crying in the wilderness; the masses either could not oc would not see it. They would not make ready against the evil days ahead. So we went into this most highly specialised industry, which war has become, as amateurs mainly. Our Navy was no amateur navy, as very speedily developed, and before this year's fighting is over our enemy is going to realise that our Army is not an amateur army. We may have been greenhorns at the trade wherein Germans were experts by training and education; still we fancy ourselves as a reasonably adaptable breed. But if the truth is to be told it must be confessed that in certain of the Allied branches of the business we are yet behaving like amateurs. After more than a year of actual and potential participation in the conflict we even now are doing things and suffering things to be done which would make us the laughingstock of our allies if they had time or tempter for laughing. I am not speaking of the conduct of our operations in the field or in the camps or on the high seas. I am speaking with particular reference to what might be called some of the by-products. None of us is apt to forget, or cease to remember with pride, the flood of patriotic sacrifice that swept our country in the spring of 1917. No other self-governing people ever adopted a universal draft before their shores had been invaded and before any of their manhood had fallen in battle. No other self-governing people ever accepted the restrictions of a food-rationing scheme before any of the actual provisions concerning that food-rationing scheme had been embodied into the written laws. Other countries did it under
  • 69. compulsion, after their resources showed signs of exhaustion. We did it voluntarily; and it was all the more wonderful that we should have done it voluntarily when all about us was human provender in a prodigal fullness. There was plenty for our own tables. By self-imposed regulations we cut down our supplies so that our allies might be fed with the surplus thus made available. Outside of a few sorry creatures there was scarcely to be found in America an individual, great or small, who did not give, and give freely, of the work of his or her heart and hands to this or that phase of the mighty undertaking upon which our Government had embarked and to which our President, speaking for us all, had solemnly dedicated all that we were or had been or ever should be. All sorts of commissions, some useful and important beyond telling, some unutterably unuseful and incredibly unimportant, sprang into being. And to and fro in the land, in numbers amounting to a vast multitude, went the woman who wanted to do her part, without having the least idea of what that part would be or how she would go about doing it. She knew nothing of nursing; kitchen work, a vulgar thing, was abhorrent to her nature and to her manicured nails; she could not cook, neither could she sew or sweep—but she must do her part. She was not satisfied to stay on at home and by hard endeavour to fit herself for helping in the task confronting every rational and willing being between the two oceans. No, sir-ree, that would be too prosaic, too commonplace an employment for her. Besides, the working classes could attend to that job. She must do her part abroad—either in France within sound of the guns or in racked and desolated Belgium. Of course her intentions were good. The intentions of such persons are nearly always good, because they change them before they have a chance to go stale. I think the average woman of this type had a mental conception of herself wearing a wimple and a coif of purest white, in a frock that was all crisp blue linen and big pearl buttons, with one red cross blazing upon her sleeve and another on her cap, sitting at the side of a spotless bed in a model hospital that was fragrant with flowers,
  • 70. and ministering daintily to a splendid wounded hero with the face of a demigod and the figure of a model for an underwear ad. Preferably this youth would be a gallant aviator, and his wound would be in the head so that from time to time she might adjust the spotless bandage about his brow. I used to wish sometimes when I met such a lady that I might have drawn for her the picture of reality as I had seen it more times than once—tired, earnest, competent women who slept, what sleep they got, in lousy billets that were barren of the simplest comforts, sleeping with gas masks under their pillows, and who for ten or twelve or fifteen or eighteen hours on a stretch performed the most nauseating and the most necessary offices for poor suffering befouled men lying on blankets upon straw pallets in wrecked dirty houses or in half-ruined stables from which the dung had hurriedly been shoveled out in order to make room for suffering soldiers— stables that reeked with the smells of carbolic and iodoform and with much worse smells. It is an extreme case that I am describing, but then the picture is a true picture, whereas the idealistic fancy painted by the lady who just must do her part at the Front had no existence except in the movies or in her own imagination. It never occurred to her that there would be slop jars to be emptied or filthy bodies, alive with crawling vermin, to be cleansed. It never occurred to her that she would take up room aboard ship that might better be filled with horse collars or hardtack or insect powder; nor that while over here she would consume food that otherwise would stay the stomach of a fighting man or a working woman; nor that if ever she reached the battle zone she would encounter living conditions appallingly bare and primitive beyond anything she could conceive; nor that she could not care for herself, and was fitted neither by training nor instinct to help care for any one else. When I left America last winter a great flow of national sanity had already begun to rise above the remaining scourings of national hysteria; and the lady whose portrait I have tried in the foregoing paragraphs to sketch was not quite so numerous or so vociferous as
  • 71. she had been in those first few exalted weeks and months following our entrance into the war as a full partner in the greatest of enterprises. My surprise was all the greater therefore to find that she had beaten me across the water. She had pretty well disappeared at home. One typical example of this strange species crossed in the same ship with me. Heaven alone knows what political or social influence had availed to secure her passport for her. But she had it, and with it credentials from an organisation that should have known better. She was a woman of independent wealth seemingly, and her motives undoubtedly were of the best; but as somebody might have said: Good motives butter no parsnips, and hell is paved with buttered parsnips. Her notion was to drive a car at the Front—an ambulance or a motor truck or a general's automobile or something. She had owned cars, but she had never driven one, as she confessed; but that was a mere detail. She would learn how, some day after she got to Europe, and then somebody or other would provide her with a car and she would start driving it; such was her intention. Unaided she could no more have wrested a busted tire off of a rusted rim than she could have marcelled her own back hair; and so far as her knowledge of practical mechanics went, I am sure no reasonably prudent person would have trusted her with a nutpick; but she had the serene confidence of an inspired and magnificent ignorance. She had her uniform too. She had brought it with her and she wore it constantly. She said she designed it herself, but I think she fibbed there. No one but a Fifth Avenue mantuamaker of the sex which used to be the gentler sex before it got the vote could have thought up a vestment so ornate, so swagger and so complicated. It was replete with shoulder straps and abounding in pleats and gores and gussets and things. Just one touch was needed to make it a finished confection: By rights it should have buttoned up the back. The woman who had the cabin next to hers in confidence told a group of us that she had it from the stewardess that it took the lady a full hour each day to get herself properly harnessed into her caparisons. Still I must say the effect, visually speaking, was worthy
  • 72. of the effort; and besides, the woman who told us may have been exaggerating. She was a registered and qualified nurse who knew her trade and wore matter-of-fact garments and fiat-heeled, broad- soled shoes. She was not very exciting to look at, but she radiated efficiency. She knew exactly what she would do when she got over here and exactly how she would do it. We agreed among ourselves that if we were in quest of the ornamental we would search out the lady who meant to drive the car—provided there was any car; but that if anything serious ailed any of us we would rather have the services of one of the plain nursing sisterhood than a whole skating- rinkful of the other kind round. In the latter part of 1917 there landed in France a young woman hailing from a Far Western city whose family is well known on the Pacific Slope. She brought with her letters of introduction signed by imposing names and a comfortable sum of money, which had been subscribed partly out of her own pocket and partly out of the pockets of well-meaning persons in her home state whom she had succeeded in interesting in her particular scheme of wartime endeavour. She was very fair to see and her uniform, by all accounts, was very sweet to look upon, it being a horizon-blue in colour with much braiding upon the sleeves and collar. It has been my observation since coming over that when in doubt regarding their vocations and their intentions these unattached lady zealots go in very strongly for striking effects in the matter of habiliments. Along the boulevards and in the tearooms I have encountered a considerable number who appeared to have nothing to do except to wear their uniforms. However, this young person had no doubt whatever concerning her motives and her purposes. The whole thing was all mapped out in her head, as developed when she called upon a high official of our Expeditionary Forces at his headquarters in the southern part of France. She told him she had come hither for the express purpose of feeding our starving aviators. He might have told her that so long as there continued to be served fried potato chips free at the Crillon bar there was but little danger of any airman going hungry, in Paris at
  • 73. least. What he did tell her when he had rallied somewhat from the shock was that he saw no way to gratify her in her benevolent desire unless he could catch a few aviators and lock them up and starve them for two or three days, and he rather feared the young men might object to such treatment. As a matter of fact, I understand he so forgot himself as to laugh at the young woman. At any rate his attitude was so unsympathetic that he practically spoiled the whole v war for her, and she gave him a piece of her mind and went away. She had departed out of the country before I arrived in it, and I learned of her and her uniform and her mission and her disappointment at its unfulfillment by hearsay only; but I have no doubt, in view of some of the things I have myself seen, that the account which reached me was substantially correct. Along this line I am now prepared to believe almost anything. Here, on the other hand, is a case of which I have direct and first- hand knowledge. I encountered a group of young women attached to one of the larger American organisations engaged in systematised charities and mercies on this side of the water. Now, plainly these young women were inspired by the very highest ideals; that there was no discounting. They were full of the spirit of service and sacrifice. Mainly they were college graduates. Without exception they were well bred; almost without exception they were well educated. The particular tasks for which they had been detailed were to care for pauperised repatriates returning to France through Switzerland from areas of their country occupied by the enemy, and to aid these poor folks in reestablishing their home life and to give them lessons in domestic science. To the success of their ministrations there was just one drawback: They were dealing with peasants mostly—furtive, shy, secretive folks who under ordinary circumstances would be bitterly resentful of any outside interference by aliens with their mode of life, and who in these cases had been rendered doubly suspicious by reason of the misfortunes they had endured while under the thumb of the Germans.
  • 74. To understand them, to plumb diplomatically the underlying reasons for their prejudices, to get upon a basis of helpful sympathy with them, it was highly essential that those dealing with them not only should have infinite tact and finesse but should be able to fathom the meaning of a nod or a gesture, a sidelong glance of the eyes or the inflection of a muttered word. And yet of those zealous young women who had been assigned to this delicate task there was scarcely one in six who spoke any French at all. It inevitably followed that the bulk of their patient labours should go for naught; moreover, while they continued in this employment they were merely occupying space in an already crowded country and consuming food in an already needy country; the both of which—space and food— were needed for people who could accomplish effective things. An American woman who is reputed to be a dietetic specialist came over not long ago, backed by funds donated in the States. Her instructions were to establish cafeterias at some of the larger French munition works. Probably her chagrin was equalled only by her astonishment when she learned that for reasons which seemed to it good and sufficient—and which no doubt were—the French Government did not want any American-plan cafeterias established at any of its munition works. Apparently it had not seemed feasible and proper to the sponsors of the diet specialist to find out before dispatching her overseas whether the plan would be agreeable to the authorities here; or whether there already were eating places suitable to the desires of the working people at these munition plants; or how long it would take, given the most favourable conditions, to cure the workers of their tenacious instinct for eating the kind of midday meal they have been eating for some hundreds of years and accustom them and their palates and their stomachs to the Yankee quick lunch with its baked pork and beans, its buckwheat cakes with maple sirup and its four kinds of pie. In their zeal the promoters, it would seem, had entirely overlooked those essential details. It is just such omissions as this one that the fine frenzy of helping out in wartime appears to develop in a nation that is given
  • 75. to boasting of its business efficiency and that vaunts itself that it knows how to give generously without wasting foolishly. The field manager of an organisation that is doing a great deal for the comfort of our soldiers and the soldiers of our allies told me of one of his experiences. He had a sense of humour and he could laugh over it, but I think I noted a suggestion of resentment behind the laughter. He said that some months before lie set up and assumed charge of a plant well up toward the trenches in a sector that had been taken over by the American troops. It was a large and elaborate concern, as these concerns are rated in the field. The men were pleased with its accommodations and facilities, and the field manager was proud of it. One day there appeared a businesslike young woman who introduced herself as belonging to a kindred organisation that was charged with the work of decorating the interiors of such establishments as the one over which he presided. Somewhat puzzled, he showed her, first of all, his canteen. It was as most such places are: There were boxes of edibles upon counters, in open boxes, so that the soldier customers might appraise the wares before investing; upon the shelves there were soft drinks and smoking materials and all manner of small articles of wearing apparel; likewise baseballs and safety razors and soap, toilet kits and the rest of it. Altogether the manager and his two assistants were rather pleased with the arrangement. The newly arrived young woman swept the scene with a cold professional eye. “On the whole this will do fairly well,” she said with a certain briskness, in her tone. “Yes, I may say it will do very well indeed— with certain changes, certain touches.” “As for example, what, please?” inquired the superintendent. “Well,” she said, “for one thing we must put up some bright curtains at the windows; and to lighten up the background I think we'll run a stenciled pattern in some cheerful colour round the walls at the top.”
  • 76. It was not for the manager to inquire how the decorator meant to get her curtains and her stencils and her wall paints up over a road that was being alternately gassed and shelled at nights and on which the traffic capacity was already taxed to the utmost by the business of bringing up supplies, munitions and rations from the base some fifteen miles in the rear. He merely bowed and awaited the lady's further commands. “And now,” she said, “where is the rest room?” “The rest room, did you say?” “Certainly, the rest room—the recreation hall, the place where these poor men may go for privacy and innocent amusement?” “Well, you see, thus close up near the Front we haven't been able to make provision for a regular rest room,” explained the manager. “Besides, in case of a withdrawal or an attack we might have to pull out in a hurry and leave behind everything that is not readily portable on wagons or trucks. The nearest approach that we have to a rest room is here at the rear.” He led the way to a room at the back. It contained such plenishings as one generally finds in improvised quarters in the field—that is to say, it contained a curious equipment made up partly of crude bits of furniture collected on the spot out of villagers' abandoned homes and partly of makeshift stools and tables coopered together from barrels and boxes and stray bits of planking. Also it contained at this time as many soldiers as could crowd into it. A phonograph was grinding out popular airs, and divers games of checkers and cards were in progress, each with its fringe of interested onlookers ringing in the players. “Oh, but this will never do—never!” stated the inspecting lady. “It is too bare, too cheerless! It lacks atmosphere. It lacks coziness; it lacks any appeal to the senses—in short it lacks everything! We must have some immediate improvements here by all means.” The man was beginning to lose his temper. By an effort he retained it. “The men seem fairly well satisfied; at least I have heard no complaint,” he said. “What would you suggest in the way of changes?”
  • 77. As she answered, the visitor ticked off the items of her mental inventory of essentials on her fingers. “Well, to begin with we must clear all this litter out of here,” she said. “Then we must install some really comfortable chairs and at least two or three roomy sofas and some simple couches where the men may lie down. I should also like to see a piano here. That, with the addition of some curtains at the windows and some simple treatment of the walls and a few appropriate pictures properly spaced and properly hung, will be different, I think.” “Yes,” demurred the manager, “but admitting that we could get the things you have enumerated up here, another problem would arise: This room, which, as you see, is not large, would be so crowded with the furnishings that there would be room in it for very many less men than usually come here. There are probably fifty men in it now. If it were filled up with sofas and couches and a piano I doubt whether we could crowd twenty men inside of it.” “Very well, then,” stated the lady decorator calmly, “you must admit only twenty men at a time.” “Quite so; but how,” he demanded—“how am I going to select the twenty?” The young woman considered the question for a moment. Then a solution came to her. “I should select the twenty neatest ones,” she said. Whereupon the manager excused himself and went out to frame a dispatch to headquarters embodying an ultimatum, which ultimatum was that the lady decorator went away from there forthwith or his resignation must take effect, coincident with his immediate departure from his present post. The home office must have called the lady off, because when I saw him he was still in harness, and swinging a man-size job in a competent way. I would not have the reader believe that I am casting discredit upon either the patriotic impulses or the honest motives of the bulk of the lay workers who have journeyed to Europe, paying their own way and their own living expenses. Often they arrive, many of them,
  • 78. to strike hands with the military authorities in the task which faces our nation on Continental soil. There is room and a welcome in France, in Italy, in England and in Flanders for every civilian recruit who really knows how to do something helpful and who has the strength, the self-reliance and the hardihood to perform that particular function under difficult and complicated conditions, which nearly always are physically uncomfortable and which may become physically dangerous. Nor would I wish any one to assume that I am deprecating by inference or by frontal attack the very fine things that are being accomplished every day by fine American women and girls who answered the first call for trained helpers, to serve in hospitals or canteens or huts, in settlement work or at telephone exchanges. It will make any American thrill with pride to enter a ward where the American Red Cross is in charge, or where a medical unit from one of the great hospitals or one of our great universities back home has control. The French and the British are quick enough to speak in terms of highest praise of the achievements of American surgeons, American nurses and American ambulance drivers. They say, and with good reason for saying it, that our people have pluck and that they have skill and that they above all are amazingly resourceful. Personally I know of no smarter exhibition of native wit and courage that the war has produced than was shown by that group of Smith College girls who had been organising and directing colonisation work among the peasants in the reclaimed districts of Northern France and who were driven out by the great spring advance of the Germans. I met some of those young women. They were modest enough in describing their adventure. It was by gathering a shred of a story there and a scrap of an anecdote here that I was able to piece together a fairly accurate estimate of the self-imposed discipline, the clean-strained grit and the initiative which marked their conduct through three trying weeks. Perhaps it was a mistake in their instance, as in the instances of divers similar organisations, that the work of resettling the wasted lands above the Aisne and the Oise should have been undertaken at
  • 79. points that would be menaced in the event of a quick onslaught by the Prussian high command. The British, I understand, privately objected to the undertakings on the ground that the presence of American women In villages which might fall again into the foe's hands—and which as it turned out did fall again into his hands— entailed an added burden and an added responsibility upon the fighting forces. The British were right. Practically all of the repatriated peasants had to flee for the second time, abandoning their rebuilt homes and their newly sowed fields. On the heels of these, improvements which represented many thousands of American dollars and many months of painstaking labour on the part of devoted American women went up in flames. The torch was applied rather than that the little model houses and the tons of donated supplies on hand should go into hostile hands. Those Smith College girls did not run away, though, until the Germans were almost upon them. Up to the very last minute they stayed at their posts, feeding and housing not only refugees but many exhausted soldiers, British and French, who staggered in, spent and sped after alternately fighting and retreating through a period of days and nights. When finally they did come away each one of them came driving her own truck and bearing in it a load of worn-out and helpless natives. One girl brought out a troop of frightened dwarfs from a stranded travelling caravan. Another ministered day and night to a blind woman nearly ninety years old and a family of orphaned babies. The passengers of a third were four inmates of a little communal blind asylum that happened to be in the invader's path. On the way, in addition to tending their special charges, they cooked and served hundreds of meals for hungry soldiers and hungry civilians. They spent the nights in towns under shell fire, and when at length the German drive had been checked they assembled their forces in Beauvais. Thus and with characteristic adaptability some became drivers of ambulances and supply trucks plying along the lines of communication, and some opened a kitchen for the benefit of passing soldiers at the local railway station. If the faculty
  • 80. and the students and the alumnæ of Smith College did not hold a celebration when the true story of what happened in March and April reached them they were lacking in appreciation—that's all I have to say about it. Right here seems a good-enough place for me to slip in a few words of approbation for the work which another 'organisation has accomplished in France since we put our men into the field. Nobody asked me to speak in its favour because so far as I can find out it has no publicity department. I am referring to the Salvation Army— may it live forever for the service which, without price and without any boasting on the part of its personnel, it is rendering to our boys in France! A good many of us who hadn't enough religion, and a good many more of us who mayhap had too much religion, look rather contemptuously upon the methods of the Salvationists. Some have gone so far as to intimate that the Salvation Army was vulgar in its methods and lacking in dignity and even in reverence. Some have intimated that converting a sinner to the tap of a bass drum or the tinkle of a tambourine was an improper process altogether. Never again, though, shall I hear the blare of the cornet as it cuts into the chorus of hallelujah whoops where a ring of blue-bonneted women and blue-capped men stand exhorting on a city street corner under the gas lights, without recalling what some of their enrolled brethren —and sisters—have done and are doing in Europe. The American Salvation Army in France is small, but, believe me, it is powerfully busy! Its war delegation came over without any fanfare of the trumpets of publicity. It has no paid press agents here and no impressive headquarters. There are no well-known names, other than the names of its executive heads, on its rosters or on its advisory boards. None of its members is housed at an expensive hotel and none of them has handsome automobiles in which to travel about from place to place. No compaigns to raise nation-wide millions of dollars for the cost of its ministrations overseas were ever held at home. I imagine it is the pennies of the poor that mainly fill its war chest.
  • 81. I imagine, too, that sometimes its finances are an uncertain quantity. Incidentally I am assured that not one of its male workers here is of draft age unless he holds exemption papers to prove his physical unfitness for military service. The Salvationists are taking care to purge themselves of any suspicion that potential slackers have joined their ranks in order to avoid the possibility of having to perform duties in khaki. Among officers as well as among enlisted men one occasionally hears criticism—which may or may not be based on a fair judgment —for certain branches of certain activities of certain organisations. But I have yet to meet any soldier, whether a brigadier or a private, who, if he spoke at all of the Salvation Army, did not speak in terms of fervent gratitude for the aid that the Salvationists are rendering so unostentatiously and yet so very effectively. Let a sizable body of troops move from one station to another, and hard on its heels there came a squad of men and women of the Salvation Army. An army truck may bring them, or it may be they have a battered jitney to move them and their scanty outfits. Usually they do not ask for help from any one in reaching their destinations. They find lodgment in a wrecked shell of a house or in the corner of a barn. By main force and awkwardness they set up their equipment, and very soon the word has spread among the troopers that at such-and-such a place the Salvation Army is serving free hot drinks and free doughnuts and free pies. It specialises in doughnuts, the Salvation Army in the field does—the real old-fashioned homemade ones that taste of home to a homesick soldier boy. I did not see this, but one of my associates did. He saw it last winter in a dismal place on the Toul sector. A file of our troops were finishing a long hike through rain and snow over roads knee-deep in half-thawed icy slush. Cold and wet and miserable, they came tramping into a cheerless, half-empty town within sound and range of the German guns. They found a reception committee awaiting them there—in the person of two Salvation Army lassies and a Salvation Army captain. The women had a fire going in the dilapidated oven of a vanished villager's kitchen. One of them was
  • 82. rolling out the batter on a plank with an old wine bottle for a rolling pin and using the top of a tin can to cut the dough into circular strips. The other woman was cooking the doughnuts, and as fast as they were cooked the man served them out, spitting hot, to hungry wet boys clamouring about the door, and nobody was asked to pay a cent. At the risk of giving mortal affront to ultra-doctrinal practitioners of applied theology I am firmly committed to the belief that by the grace of God and the grease of doughnuts those three humble benefactors that day strengthened their right to a place in the Heavenly Kingdom. As I said a bit ago, there is in France room and to spare and the heartiest sort of welcome for competent, sincere lay workers, both men and women. But there is no room, and if truth be known, there is no welcome for any other sort. These people over here long ago passed out of the experimental period in the handling of industrial and special problems that have grown up out of war. They have entirely emerged from the amateur stage of endeavour and direction. If any man doubts the truth of this he has only to see, as I have seen, the thousands of women who have taken men's jobs in the cities in order that the men might go to the colours; has only to see the overalled women in the big munition plants; has only to see how the peasant women of France are labouring in the fields and how the girls of the British auxiliary legions—the members of the W. A. A. C. for a conspicuous example—are carrying their share of the burden; has only to see women of high degree and low, each doing her part sanely, systematically and unflinchingly—to appreciate that, though Britain and France can find employment for every pair of willing and able hands somewhere behind the lines, they have no use whatsoever for the unorganised applicant or for the purely ornamental variety of volunteer or yet for the mere notoriety seeker. I make so bold as to suggest that it is time we were taking the same lesson to heart; time to start the sifting process ourselves. I have seen in Paris a considerable number of American women who appeared to have no business here except to air their most
  • 83. becoming uniforms in public places and to tell in a vague broad way of the things they hope to do. The French, proverbially, are a polite race, and the French Government will endure a great deal of this kind of infliction rather than run the risk of engendering friction, even to the most minute extent, with the people or the administration of an Allied nation. But in wartime especially, too much patience becomes a dubious virtue, and if practiced for overlong may become a fault. As yet there has been no intimation from any official source that the French would rather our State Department did not issue quite so many passports to Americans who have no set and definite purpose in making the journey to these shores, but even a superficial knowledge of the French language and the most casual acquaintance with the French nature enable one to get at what the French people are thinking. I am sure that had the prevalent condition been reversed our papers would have voiced the popular protest at the imposition long before now. Some of these days, unless we apply the preventive measures on our own side of the Atlantic, the perfectly justifiable resentment of the hard-pressed French is going to find utterance; and then quite a number of well- intentioned but utterly inutile persons will be going back home with their feelings all harrowed up.
  • 84. P CHAPTER XVI. CONDUCTING WAR BY DELEGATION LEASE do not think that because I have mainly dwelt thus far upon the women offenders that there are no American men in France who do not belong here, because that would be a wrong assumption. I merely have mentioned the women first because by reason of their military garbing—or what some of them fondly mistake for military garbing—they offer rather more conspicuous showing to the casual eye than the male civilian dress. The men are abundantly on hand though; make no mistake about that! Some of them come burdened with frock-coated dignity as members of special commissions or special delegations; in certain quarters there appears to be a somewhat hazy but very lively inclination to try to run our share of this war by commission. Some, I am sure, came for the same reason that the young man in the limerick went to the stranger's funeral—because they are fond of a ride. Some I think came in the hope of enjoying an exciting sort of junketing expedition, and some because they were all dressed up and had nowhere to go. As well as may be judged by one who has been away from home for going on five months now, the special-commission notion is being rather overdone. Individuals and groups of individuals bearing credentials from this fraternal organisation or that religious organisation or the other research society reach England on nearly every steamer that penetrates through the U-boat zone. Almost invariably these gentlemen carry letters of introduction testifying to their personal probity and their collective importance, which letters are signed by persons sitting in high places. It may be that the English are thereby deceived into believing that the visitors are entitled to special consideration—as indeed some of
  • 85. them are, and indeed some of them most distinctly are not. Or then again it may be that the English are not aware of a device very common among our men of affairs for getting rid of a bore who is intent on going somewhere to see somebody and craves to be properly vouched for upon his arrival. In certain circles this habit is called passing the buck. In others it is known as writing letters of introduction. At any rate the English take no chances on offending the right party, even at the risk of favouring the wrong one. When a half dozen Yankees appear at the Foreign Office laden with letters addressed “To Whom it May Concern” the Foreign Office immediately becomes concerned. How is a guileless Britisher intrenched behind a flat-top desk to know that the August and Imperial Order of Supreme Potentates whose chosen emissaries are now present desirous of having a look at the war, and afterward to approve of it in a report to the Grand Lodge at its next annual convention, if so be they do see fit to approve of it—how, I repeat, is he to know that the August and Imperial Order of Supreme Potentates has a membership largely composed of class-C bartenders? Not knowing, he acts in accordance with the best dictates of his ignorance. The commission or the delegation or the presentation, whatever it calls itself, is provided with White Passes all round. On the strength of these White Passes the investigators are at the public expense transferred across the Channel and housed temporarily at the American Visitors' Château. From there they are taken in automobiles and under escort of very bored officers on a kind of glorified Cook's tour behind the British Front. Thereafter they are turned over to the French Mission or to the American forces for similar treatment. As a result they accumulate an assortment of soft-boiled and yolkless impressions which they incubate into the spoken or the written word on the way back home, after they have held a meeting to decide whether they like the way the war is going on or whether they do not like the way the war is going on. Always there is the
  • 86. possibility that as a result of the dissemination of underdone and undigested misinformations which they have managed to acquire these persons, though actuated by the best intentions in the world, may do considerable harm in shaping public opinion in America. And likewise one may be very sure a lot of pestered British and French functionaries are left to wonder what sort of folks the masses of American citizenship must be if these are typical samples of the thought-moulding class. I am not exaggerating much when I touch on this particular phase of the topic now engaging me, for I have seen two delegations in Europe, of the variety I have sought briefly to describe in the lines immediately foregoing; and we are expecting more in on the next boat. There was no imaginable reason why those whom I saw should be in a country that is at war at such a time of crisis as this time is, but the main point was that they were here, eating three large rectangular meals a day apiece and taking up the valuable time of overworked military men who accompanied them while they week-ended at the war. How many more such delegations will sift through the State Department and seep by the passport bureau and journey hither during the latter half of 1918 unless the Administration at Washington shuts down on the game no man can with accuracy calculate. Away down in the south of France I ran into a gentleman of a clerical aspect who lost no time in telling me about himself. He was tall and slender like a wand, and of a willowy suppleness of figure, and he was terribly serious touching on his mission. He represented a religious denomination that has several hundreds of thousands of communicants in the United States. He had been dispatched across, he said, by the governing body of his church. His purpose, he explained, was to inquire into the bodily and spiritual well-being of his coreligionists who were on foreign service in the Army and the Navy, with a view subsequently to suggesting reforms for any existing evil in the military and naval systems when he reported back to the main board of his church.
  • 87. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com