SlideShare a Scribd company logo
Active Mining New Directions of Data Mining
Frontiers in Artificial Intelligence and
Applications 1st Edition by Hiroshi Motoda ISBN
158603264X 9781586032647 pdf download
https://ebookball.com/product/active-mining-new-directions-of-
data-mining-frontiers-in-artificial-intelligence-and-
applications-1st-edition-by-hiroshi-motoda-
isbn-158603264x-9781586032647-19756/
Explore and download more ebooks or textbooks
at ebookball.com
Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats
Multi Relational Data Mining Frontiers in Artificial Intelligence and
Applications 1st Edition by Arno Knobbe ISBN 1586036610 9781586036614
https://ebookball.com/product/multi-relational-data-mining-
frontiers-in-artificial-intelligence-and-applications-1st-
edition-by-arno-knobbe-isbn-1586036610-9781586036614-19734/
Adaptive Stream Mining Pattern Learning and Mining from Evolving Data
Streams Volume 207 Frontiers in Artificial Intelligence and
Applications 1st Edition by Albert Bifet ISBN 1607500906 9781607500902
https://ebookball.com/product/adaptive-stream-mining-pattern-
learning-and-mining-from-evolving-data-streams-
volume-207-frontiers-in-artificial-intelligence-and-
applications-1st-edition-by-albert-bifet-
isbn-1607500906-9781607500902/
Artificial Intelligence and Education Frontiers in Artificial
Intelligence and Applications 1st Edition by Bierman, Breuker,
Sandberg ISBN 9051990146 9789051990140
https://ebookball.com/product/artificial-intelligence-and-
education-frontiers-in-artificial-intelligence-and-
applications-1st-edition-by-bierman-breuker-sandberg-
isbn-9051990146-9789051990140-19708/
Knowledge Discovery Practices and Emerging Applications of Data Mining
Trends and New Domains 1st edition by Senthil Kumar
160960069XÂ 9781609600693
https://ebookball.com/product/knowledge-discovery-practices-and-
emerging-applications-of-data-mining-trends-and-new-domains-1st-
edition-by-senthil-kumar-160960069x-9781609600693-14490/
Annotation for the Semantic Web Frontiers in Artificial Intelligence
and Applications 1st Edition by Siegfried Handschuh, Steffen Staab
ISBN 158603345X 9781586033453
https://ebookball.com/product/annotation-for-the-semantic-web-
frontiers-in-artificial-intelligence-and-applications-1st-
edition-by-siegfried-handschuh-steffen-staab-
isbn-158603345x-9781586033453-19754/
New Directions in Dental Anthropology paradigms methodologies and
outcomes 1st edition by Grant Townsend, Eisaku Kanazawa, Hiroshi
Takayam 9780987171870
https://ebookball.com/product/new-directions-in-dental-
anthropology-paradigms-methodologies-and-outcomes-1st-edition-by-
grant-townsend-eisaku-kanazawa-hiroshi-
takayam-9780987171870-1642/
Agent Intelligence Through Data Mining Multiagent Systems Artificial
Societies and Simulated Organizations 14 1st edition by Andreas
Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528
https://ebookball.com/product/agent-intelligence-through-data-
mining-multiagent-systems-artificial-societies-and-simulated-
organizations-14-1st-edition-by-andreas-symeonidis-pericles-
mitkas-isbn-0387243526-978-0387243528-19574/
Data Mining and Predictive Analysis Intelligence Gathering and Crime
Analysis 1st Edition by Colleen McCue 0750677961 9780750677967
https://ebookball.com/product/data-mining-and-predictive-
analysis-intelligence-gathering-and-crime-analysis-1st-edition-
by-colleen-mccue-0750677961-9780750677967-19238/
Constraint Solving over Multi Valued Logics Application to Digital
Circuits Frontiers in Artificial Intelligence and Applications 1st
Edition by Francisco Azevedo ISBN 1586033042 9781586033040
https://ebookball.com/product/constraint-solving-over-multi-
valued-logics-application-to-digital-circuits-frontiers-in-
artificial-intelligence-and-applications-1st-edition-by-
francisco-azevedo-isbn-1586033042-9781586033040-19704/
Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647
ACTIVE MINING
Frontiers in Artificial Intelligence
and Applications
Series Editors: J. Breuker, R. Lopez de Mantaras, M. Mohammadian, S. Ohsuga and
W. Swartout
Volume 79
Volume 3 in the subseries
Knowledge-Based Intelligent Engineering Systems
Editor: L.C.Jain
Previously published in this series:
Vol. 78. T. Vidal and P. Liberatore (Eds.), STAIRS 2002
Vol. 77. F. van Harmelen (Ed.). ECAI 2002
Vol. 76. P. SinCak et al. (Eds.), Intelligent Technologies - Theory andApplications
Vol. 75.1.F. Cruz et al. (Eds.). The Emerging Semantic Web
Vol. 74, M. Blay-Fornarino et al. (Eds.). Cooperative Systems Design
Vol. 73. H. Kangassalo et al. (Eds.), Information Modelling and Knowledge Bases XIII
Vol. 72, A. Namatame et al. (Eds.), Agent-Based Approaches in Economic and Social Complex Systems
Vol. 71. J.M. Abe and J.I. da Silva Filho (Eds.), Logic. Artificial Intelligence and Robotics
Vol. 70, B. Verheij et al. (Eds.), Legal Knowledge and Information Systems
Vol. 69, N. Baba et al. (Eds.), Knowledge-Based Intelligent Information Engineering Systems & Allied
Technologies
Vol. 68, J.D. Moore et al. (Eds.), Artificial Intelligencein Education
Vol. 67. H. Jaakkola et al. (Eds.), Information Modelling and Knowledge Bases XII
Vol. 66, H.H. Lund et al. (Eds.), Seventh Scandinavian Conference on Artificial Intelligence
Vol. 65, In production
Vol. 64. J. Breuker et al. (Eds.). Legal Knowledgeand Information Systems
Vol. 63.1. Gent et al. (Eds.), SAT2000
Vol. 62. T. Hruska and M. Hashimoto (Eds.), Knowledge-Based SoftwareEngineering
Vol. 61, E. Kawaguchiet al. (Eds.). Information Modellingand Knowledge Bases XI
Vol. 60, P. Hoffman and D. Lemke (Eds.), Teaching and Learning in a Network World
Vol. 59, M. Mohammadian (Ed.), Advances in Intelligent Systems: Theory andApplications
Vol. 58. R. Dieng et al. (Eds.), Designing Cooperative Systems
Vol. 57, M. Mohammadian (Ed.), New Frontiers in Computational Intelligence and its Applications
Vol. 56, M.I. Torres and A. Sanfeliu (Eds.), Pattern Recognition and Applications
Vol. 55, G. Cumming et al. (Eds.). Advanced Research in Computers and Communications in Education
Vol. 54. W. Horn (Ed.), ECAI 2000
Vol. 53, E. Motta. Reusable Components for Knowledge Modelling
Vol. 52. Inproduction
Vol. 51, H. Jaakkola et al. (Eds.), InformationModellingand Knowledge Bases X
Vol. 50. S.P. Lajoie and M. Vivet (Eds.), Artificial Intelligence in Education
Vol. 49. P. McNamara and H. Prakken (Eds.), Norms. Logics and Information Systems
Vol. 48. P. Navrat and H. Ueno (Eds.), Knowledge-Based Software Engineering
Vol. 47. M.T. Escrig and F. Toledo, Qualitative Spatial Reasoning: Theory and Practice
Vol. 46. N. Guarino (Ed.), Formal Ontology in Information Systems
Vol. 45. P.-J. Charrel et al. (Eds.). Information Modelling and Knowledge Bases IX
ISSN: 0922-6389
Active Mining
New Directions of Data Mining
Edited by
Hiroshi Motoda
Division of Intelligent Systems Science,
The Institute of Scientific and Industrial Research,
Osaka University, Osaka, Japan
/OS
P r e s s
Ohmsha
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
© 2002, Hiroshi Motoda
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmined.
in any form or by any means, without the prior written permission from the publisher.
ISBN 158603 264 X (IOS Press)
ISBN 4 274 90521 7 C3055 (Ohmsha)
Library of Congress Control Number: 2002106944
Publisher
IOS Press
Nieuwe Hemweg 6B
1013BG Amsterdam
The Netherlands
fax:+31 206203419
e-mail: order@iospress.nl
Distributor in the UK and Ireland
IOS Press/Lavis Marketing
73 Lime Walk
Headington
Oxford OX3 7AD
England
fax:+44 1865750079
Distributor in the USA and Canada
IOS Press, Inc.
5795-G Burke Centre Parkway
Burke, VA 22015
USA
fax:+l 703 323 3668
e-mail: iosbooks@iospress.com
Distributor in Germany, Austria and Switzerland
IOS Press/LSL.de
Gerichtsweg 28
D-04103 Leipzig
Germany
fax:+49 341 995 4255
Distributor in Japan
Ohmsha, Ltd.
3-1 KandaNishiki-cho
Chiyoda-ku. Tokyo 101–8460
Japan
fax:+81 3 3233 2426
LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.
PRINTED IN THE NETHERLANDS
Preface
Our ability to collect data, be it in business, government, science, and perhaps personal life
has been increasing at a dramatic rate. However, our ability to analyze and understand
massive data lags far behind our ability to collect them. The value of data is no longer in
"how much of it we have". Rather, the value is in how quickly and how effectively can the
data be reduced, explored, manipulated and managed.
Knowledge Discovery and Data mining (KDD) is an emerging technique that extracts
implicit, previously unknown, and potentially useful information (or patters) from data.
Recent advancement made through extensive studies and real world applications reveals
that no matter how powerful computers are now or will be in the future, KDD researchers
and practitioners must consider how to manage ever-growing data which is, ironically, due
to the extensive use of computers and ease of data collection, ever-increasing forms of data
which different applications require us to handle, and ever-changing requirements for new
data and mining target as new evidences are collected and new findings are made. In short,
the need for 1) identifying and collecting the relevant data from a huge information search
space, 2) mining useful knowledge from different forms of massive data efficiently and
effectively, and 3) promptly reacting to situation changes and giving necessary feedback to
both data collection and mining steps, is ever increasing in this era of informationoverload.
Active mining is a collection of activities each solving a part of the above need, but
collectively achieving the various mining objectives. By "collectively achieving" we mean
that the total effect outperforms the simple add-sum effect that each individual effort can
bring. Said differently, a spiral effect of these interleaving three steps is the target to be
pursued. To achieve this goal the initial action is to explore mechanisms of 1) active
information collection where necessary information is effectively searched and pre-
processed, 2) user-centered active mining where various forms of information sources are
effectively mined, and 3) active user reaction where the mined knowledge is easily assessed
and prompt feedback is made possible.
This book is a joint effort from leading and active researchers in Japan with a theme
about active mining. It provides a forum for a wide variety of research work to be presented
ranging from theories, methodologies, algorithms, to their applications. It is a timely report
on the forefront of data mining. It offers a contemporary overview of modern solutions with
real-world applications, shares hard-learned experiences, and sheds light on future
development of activemining.
This collection evolved from a project on active mining and the papers in this
collection were selected from among over 40submissions.
The book consists of 3 parts. Each part corresponds to one of the three mechanisms
mentioned above. Namely, part I consists of chapters on Data Collection, part II on User-
centered Mining, and part III on User Reaction and Interaction. Some of the chapters
overlap each other but have to be placed in one of these three parts. The topics covered in
27 chapters include online text mining, clustering for information gathering, online
monitoring of Web page updates, technical term classification, active information
gathering, substructure mining from Web and graph structured data, web community
discovery and classification,spatial data mining, automatic configuration of mining tools,
worst case analysis of exceptional rule mining, data squashing applied to boosting, outlier
detection, meta-learning for evidenced based medicine, knowledge acquisition from both
human expert and data, data visualization, active mining in business application world,
meta analysis and many more.
This book is intended for a wide audience, from graduate students who wish to learn
basic concepts and principles of data mining to seasoned practitioners and researchers who
want to take advantage of the state-of-the-art development for active mining. The book can
be used as a reference to find recent techniques and their applications, as a starting point to
find other related research topics on data collection, data mining and user interaction, or as
a stepping stone to develop novel theories and techniques meeting the exciting challenges
ahead of us.
Active mining is a new direction in the knowledge discovery process for real-world
applications handlinghuge amounts of data with actual user need.
Hiroshi Motoda
Acknowledgments
As the field of data mining advances, the interest in as well as the need for integrating
various components intensifies for effective and successful data mining. A lot of research
ensues. This book project resulted from the active mining initiatives that started during
2001 as a grant-in-aid for scientific research on priority area by the Japanese Ministry of
Education, Science, Culture, Sports and Technology. We received many suggestions and
support from researchers in machine learning, data mining and database communities from
the very beginning of this book project. The completion of this book is particularly due to
the contributors from all areas of data mining research in Japan, their ardent and creative
research work. The editorial members of this project have kindly provided their detailed
and constructive comments and suggestions to help clarify terms, concepts, and writing in
this truly multi-disciplinary collection. I wish to express my sincere thanks to the following
members: Numao Masayuki, Yukio Ohsawa, Einoshin Suzuki, Takao Terano, Shusaku
Tsumoto and Takahira Yamaguchi.
We are also grateful to the editorial staff of IOS Press, especially Carry Koolbergen
and Anne Marie de Rover for their swift and timely help in bringing this book to a
successful conclusion.
During the process of this book development, I was generously supported by our
colleagues and friends at Osaka University.
This page intentionally left blank
Contents
Preface, Hiroshi Motoda
Acknowledgments
I. Data Collection
Toward Active Mining from On-line Scientific Text Abstracts Using Pre-existing
Sources, TuanNam Tran and Masayuki Numao 3
Data Mining on theWAVEs - Word-of-mouth-Assisting Virtual Environments,
Masayuki Numao, Masashi Yoshida and Yusuke Ito  1
Immune Network-based Clustering for WWW Information Gathering/Visualization,
Yasufumi Takama and Kaoru Hirota 21
Interactive Web Page Retrieval with Relational Learning-based Filtering Rules,
Masayuki Okabe and Seiji Yamada 31
Monitoring Partial Update of Web Pages by Interactive Relational Learning,
Seiji Yamada and Yuki Nakai 41
Context-based Classification of Technical Terms Using Support Vector Machines,
Masashi Shimbo, Hiroyasu Yamada and Yuji Matsumoto 51
Intelligent Tickers: An Information Integration Scheme for Active Information
Gathering, Yasukiro Kitamura 61
II. User Centered Mining
Discovery of Concept Relation Rules Using an Incomplete Key Concept Dictionary,
Shigeaki Sakurai, Yumi Ichimura and Akihiro Suyama 73
Mining Frequent Substructures from Web, Kenji Abe, Shinji Kawasoe, Tatsuya Asai,
Hiroki Arimura, Hiroshi Sakamoto and Setsuo Arikawa 83
Towards the Discovery of Web Communities from Input Keywords to a Search Engine,
Tsuyoshi Murata 95
Temporal Spatial Index Techniques for OLAP in Traffic Data Warehouse,
Hiroyuki Kawano 103
Knowledge Discovery from Structured Data by Beam-wise Graph-Based Induction,
Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida and Takashi Washio 115
PAGA Discovery: A Worst-Case Analysisof Rule Discovery for ActiveMining,
Einoshin Suzuki 127
Evaluating the Automatic Composition of Inductive Applications Using StatLog
Repository of Data Set, Hidenao Abe and Takahira Yamaguchi 139
Fast Boosting Based on Iterative Data Squashing, Yuta Choki and Einoshin Suzuki 151
Reducing Crossovers in ReconciliationGraphs Using the Coupling Cluster Exchange
Method with a Genetic Algorithm, Hajime Kitakami and Yasuma Mori 163
Outlier Detection using Cluster Discriminant Analysis, Arata Sato, Takashi Suenaga
and Hitoshi Sakano 175
III. User Reaction and Interaction
Evidence-Based Medicine and Data Mining:Developing a Causal Model via
Meta-Learning Methodology, Masanori Inada and Takao Terano 87
KeyGraph for Classifying Web Communities, Yukio Ohsawa, Yutaka Matsuo, Naohiro
Natsumura, Hirotaka Soma and Masaki Usui 95
Case Generation Method for Constructingan RDR Knowledge Base, Keisei Fujiwara,
Tetsuya Yoshida, Hiroshi Motoda and Takashi Washio 205
Acquiring Knowledge from Both Human Experts and Accumulated Data in an
Unstable Environment, Takuya Wada, Tetsuya Yoshida, Hiroshi Motoda and
Takashi Washio 217
Active Participation of Users with Visualizaiton Tools in the Knowledge Discovery
Process, Tu Bao Ho, Trong Dung Nguyen, Duc Dung Nguyen and Saori
Kawasaki 229
The Future Direction of Active Miningin the Business World, Katsutoshi Yada 239
Topographical Expression of a Rule for Active Mining, Takashi Okada 247
The Effect of Spatial Representation of Information on Decision Making in Purchase.
Hiroko Shoji and Koichi Hori 259
A Hybrid Approach of Multiscale Matching and Rough Clustering to Knowledge
Discovery in Temporal Medical Databases, Shoji Hirano and Shusaku Tsumoto 269
Meta Analysis for Data Mining, Shusaku Tsumoto 279
Author Index 291
DATA COLLECTION
I
This page intentionally left blank
Active Mining
H. Moloda (Ed.)
IOS Press, 2002
Toward Active Mining from On-line Scientific Text
Abstracts Using Pre-existing Sources
TuanNam Tran and Masayuki Numao
tt-nam@nm.cs.titech.ac.jp, nurnao@cs.titech.ac.jp
Department of Computer Science,
Tokyo Institute of Technology
2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, JAPAN
Abstract. As biomedical research enters the post-genome era and most
new information relevant to biology research is still recorded as free
text, there is an extensively increasing needs of extracting information
from biological literature databases such as MEDLINE. Different from
other work so far, in this paper we presents a framework for mining
MEDLINE by making use of a pre-existing biological database on a
kind of Yeast called S.cerevisiae. Our framework is based on an active
mining prospect and consists of two tasks: an information retrieval task
of actively selecting articles in accordance with users' interest, and a
text data mining task using association rule mining and term extraction
techniques. The preliminary results indicate that the proposed method
may be useful for consistency checking and error detection in annotation
of MeSH terms in MEDLINE records. It is considered that the proposed
approach of combining information retrieval making use of pre-existing
databases and text data mining could be expanded for other fields such
as Web mining.
1 Introduction
Because of the rapid growth of computer hardwares and network technologies, a vast
amount of information could be accessed through a variety of databases and sources.
Biology research inevitably plays an essential role in this century, producing a large
number of papers and on-line databases on this field. However, even though the number
and the size of sequence databases are growing rapidly, most new information relevant
to biology research is still recorded as free text. As biomedical research enters the post-
genome era, new kinds of databases that contain information beyond simple sequences
are needed, for example, information on protein-protein interactions, gene regulation
etc. Currently, most of early work on literature data mining for biology concentrated on
analytical tasks such as identifying protein names [5],simple techniques such as word
co-occurrence [12], pattern matching [8], or based on more general natural language
parsers that could handle considerably more complex sentences [9],[15].
In this paper, a different approach is proposed for dealing with literature data mining
from MEDLINE, a biomedical literature database which contains a vast amount of
useful information on medicine and bioinformatics. Our approach is based on active
mining, which focuses on active information gathering and data mining in accordance
with the purposes and interests of the users. In detail, our current, system contains two
subtasks: the first task exploits existing databases and machine learning techniques
for selecting useful articles, and the second one using association rule mining and term
4 T. Tran and M. Numao / Toward Active Mining
extraction techniques to conduct text data mining from the set of documents obtained
by the first task.
The remainder of this paper is organized as follows. Section 2 gives a brief overview
on literature data mining. Section 3 describes in detail the task of making use ofexisting
databases to retrieve relevant documents (the information retrieval task). Given the
results obtained from the Section 3. Section 4 introduces the text mining task by using
association rule mining and term extraction. Section 5 describes some directions for
future work. Finally Section 6 presents our conclusions.
2 Overview on literature data mining for biology
In this section we give a brief overview of current work on literature data ming for bi-
ology. As described above, even though the number and the size of sequence databases
are growing rapidly, most new information relevant to biology research is still recorded
as free text. As a result, biologists need information contained in text to integrate
information across articles and update databases. Current automated natural language
systems could be classified as information retrieval systems (which return documents
relevant to a subject), information extraction systems (which identify entities or re-
lations among entities in text) and question answering system (which answer factual
questions using large document collections). However, it should be noted that most of
these systems work on newswire. and text mining for biology is considered to be harder
because the syntax is more complex, new terms are introduced constantly and there is
a confusion between genes and proteins [6].
On the other hand, since natural language processing offers the tools to make infor-
mation in text accessible, there are an increasing numbers of groups workingon natural
language processing for biology. Fukuda et. al. [5] attempt to identifying protein
names from biological papers. Andrade and Valencia [2]also concentrate on extraction
of keywords, not mining factual assertions. There have been many approaches to the
extraction of factual assertions using natural language processing techniques such as
syntactic parsing. Sekimizu et. al. [11] attempt to generate automatic database entries
containing relations extracted from MEDLINE abstracts. Their approach is to parse,
determine noun phrases, spot the frequently-occurring verbs and choose the most likely
subject and object from the candidate NPs in the surrounding text. Rindflesch [10]
uses a stochastic part-of-speech tagger to generate an underspecified syntactic parse
and then uses semantic and pragmatic information to construct its assertions. This
system can only extract mentions of well-characterized genes, drugs cell types, not the
interactions among them. Thomas et. al. [13] use an existing information extraction
system called SRI's Highlight for gathering data on protein interactions. Their work
concentrates on finding relations directly between proteins. Blaschke et. al. [3] at-
tempt to generate functional relationship maps from abstracts, however, it requires a
pre-defined list of all named entities and cannot handle syntactically complex sentences.
3 Retrieving relevant documents by making use of existing database
We describe our information retrieval task, which can be considered as a specific task for
retrieving relevant documents from MEDLINE. Current systems for accessing MED-
LINE such as PubMed (1
) accept keyword-based queries to text sources and return
1
http://www.ncbi.nlm.nih.gov/PiibMod/
T. Tran and M. Numao / Toward Active Mining
documents that are hopefully relevant to the query. Since MEDLINE contains an enor-
mous amount of papers and the current MEDLINE search engines is a keyword-base
one, the number of returned documents is often large, and many of them in fact are
non-relevant. The approach to solve this issue is to make use of existing databases of
organisms such as S.cerevisiae using supervised machine learning techniques.
Figure 1 shows the illustration of the information retrieval task. In this Figure, YPD
database (standing for Yeast Protein Database 2
) is a biological database which contains
genetic functions and other characteristics of a kind of Yeast called S.cerevisiae. Given
a certain organism X, the goal of this task is to retrieve its relevant documents, i.e.
documents containing useful genetic: information for biological research.
Collection of
S.cerevisiae
(MS)
Negative
Examples
(MS-YS)
Collection of
target organism
(MX)
Figure 1: Outline of the information retrieval task
Let MX, MS be the sets of documents retrieved from MEDLINE by querying for
the target organism X and S.cerevisiae respectively (without any machine learning
filtering) and YS be the set of documents found by querying for the YPD terms for
S.cerevisiae (YS is omitted in Figure 1 for the reason of simplification). The set of
positive and negative examples then are collected as the intersection set and difference
set of MS and YS respectively. Given the training examples. OX is the output set of
documents obtained by applying Naive Bayes classifier on MX.
3.1 Naive Bayes classifier
Naive Bayes classifiers ([7]) are among the most successful known algorithms for learning
to classify text documents. A naive Bayes classifier is constructed by using the training
data to estimate the probability of each category given the document feature values of
a new instance. The probability a instance d belongs to a class Ck is estimates by Bayes
theorem as follows:
Since P(dC — ck) is often impractical to compute without simplifyingassumptions, for
the Naive Bayes classifier, it is assumed that the features X1 ,X2 ,.. ,Xn areconditionally
T. Tran and M. Numao/ Toward ActiveMining
independent, given the category variable C. As a result :
3.2 Experimental results of information retrieval task
Our experiments use YPD as an existing database. From this database weobtain 14572
articles pertaining to S.cerevisiae. For the target organisms, initially we collect 3073
and 8945 articles for two kinds of Yeast called Pombe and Candida respectively. After
conducting experiments as in Figure 1, we obtain the output containing 1764 and 285
articles for Pombe and Candida respectively.
A certain number of documents (50 in this experiment) in each of dataset is taken
randomly, checked by hand whether they are relevant or not. Figure 2 shows the Recall-
Precision curve for Pombe and Candida. It can be seen from this Figure that using
machine learning approaches remarkably improved the precision. The reason the recall
in the case of Candida is rather lower compared to the case of Pombe is that Pombe is
a yeast which has many similar genetic characteristics than Candida.
Figure 2: Recall-Precision curve for Pombe and Candida
4 Mining MEDLINE by combining term extraction and association rule
mining
In this section, we attempt to mine the set of MEDLINE documents obtained in the
previous section by combining term extraction and association rule mining.
The text mining task from the collected dataset consists of two main modules:
the Term Extraction module and the Association-Rule Generation module. The Term
Extraction module itself includes the following stages:
• XML translation: This stage translates the MEDLINErecord from HTML form
into a XML-like form, conducting some pre-processing dealing with punctuation.
• Part-of-speech tagging: Here, the rule-based Brill part-of-speech tagger [4] was
used for tagging the title and the abstract part.
T. Tran and M. Numao / Toward Active Mining
• Term Generation: sequences of tagged words are selected as potential term
candidates on the basis of relevant morpho-syntactic patterns (such as "Noun
Noun", "Noun Adjective Noun", "Adjective Noun", "Noun Preposition Noun"
etc). For example, "in vivo", "saccharomyces cerevisiae" are terms extracted
from this stage.
• Stemming: Stemming algorithm was used to find variations of the same word.
Stemming transforms variations of the same word into a single one, reducing
vocabulary size.
• Term Filtering: In order to decrease the number of "bad terms", in the abstract
part, only sentences containing verbs listed in the "verbs related to biological
events" Table in [14] have been used for Term Generation stage.
After necessary terms have been generated from the Term Extraction module, the
Association-Rule Generation module then applies the Apriori algorithm[1] using the set
of generated terms to produce association rules (each line of the input file of Apriori-
based program consists every terms extracted from a certain MEDLINE record in the
dataset).
Figure 3 and Figure 4 show the list of twenty rules among obtained rules demon-
strating" the relationships among extracted terms for Pornbe and Candida respectively.
For example, the 5th rule in Figure 4 implies that "the rule that in a MEDLINE record
if aspartyl proteinases occurs then this MEDLINE document is published in the Jour-
nal of Bacteriology has the support of 1.3% and the confidence of 100.0%.". It can be
seen that the relation between journal name and terms extracted from the title and the
abstract has been discovered from this example. It can be seen from Figure 3 and 4
that making use of terms can produced interesting rules that cannot be obtained using
only single-words.
5 Future Work
5.1 For the information retrieval task
Although using an existing database of S.cerevisiae is able to obtain a high precision for
other yeasts and organisms, the recall value is still low, especially for the yeasts which
are different remarkably from S.cerevisiae. Since yeasts such as Candida might have
many unique attributes, we may improve the recall by feeding the documents checked
by hand back to the classifier and conduct the learning process again. The negative
training set has still contained many positive examples so we need to reduce this noise
by making use of the learning results.
5.2 For the text mining task
By combining term extraction and association rule mining, it is able to obtain inter-
esting rules such as the relations among journal names and terms, terms and terms.
Particularly, the relations among MeSH terms and "Substances" may be useful for error
detection in annotation of MeSH terms in MEDLINE records. However, the current al-
gorithm treats extracted terms such as "cdc37_caryogamy_defect","cdc37_injnitosy",
T. Tran and M. Numao / Toward Active Mining
1: fission_yeast_schizosaccharomyc_pomb <-
transcript_control (0.3%, 80.0%.)
2: cell_cycle <- period (0.6%, 77.87.)
3: mutant <- other_mutant (0.4%, 83.37.)
4: essenty <- gene_disrupt_expery (0.5%, 75.07.)
5: mitosy <- passag_through_start (0.3%, 80.07.)
6: transcript <- mat2-mat3_interval (0.3%, 80.07.)
7: embo_j <- p34cdc2_kinas_activity (0.5%, 75.07.)
8: nucleu <- periphery (0.3%, 80.07.)
9: structur <- function_similar (0.3%, 80.07.)
10: meiosy <- premeiot_dna_synthesy (0.5%, 75.07.)
11: meiosy <- pair (0.3%, 80.07.)
12: s.phase <- complet.of_s_phase (0.4%, 83.37.)
13: amino_acid_sequ <- alignment (0.4%, 83.37.)
14: amino_acid_sequ <- _residu (0.3%, 80.07.)
15: human <- mous_homolog (0.3%, 80.07.)
16: open_read_frame <- uninterrupt (0.4%, 83.37.)
17: subunit <- rpb2 (0.3%, 80.07.)
18: centromer <- central_core (0.4%, 83.37.)
19: centromer <- centromer_function (0.4%, 83.37.)
20: weel <- mikl (0.5%, 85.77.)
Figure 3: First twenty rules obtained for the set of Pombe documents obtained in Section 3
(minimum support = 0.003. minimum confidence = 0.75)
"cdc37_mutat" to be mutually independent. It may be necessary to construct semi-
automatically term taxonomy, for instance users are able to choose only interesting
rules or terms then feedback to the system.
5.3 Mutual benefits between two tasks
Gaining mutual benefits between two tasks is also an important issue for future work.
First, by applying text mining results, it should be noted that we can decrease the
number of documents being "leaked" in the information retrieval task. As a result, it
is possible to improve the recall. Conversely, since the current text mining algorithm
create many unnecessary rules (from the viewpoint of biological research), it is also
possible to apply the information retrieval task first for filtering relevant documents,
then apply to the text mining task to decrease the number of unnecessary rulesobtained
and to improve the quality of the text mining task.
6 Conclusions
This paper has introduced a framework for mining MEDLINE by making use of exist-
ing biological databases. Two tasks concerninginformation extractionfrom MEDLINE
have been presented. The first task is used for retrieving useful documents for biology
research with high precision. Given the obtained set of documents, the second task
attempts to apply association rule mining and term extraction for mining these docu-
ments. It can be seen from this paper that making use of the obtained results is useful
for consistency checking and error detection in annotation of MeSH terms in MEDLINE
records. In future work, combining these two tasks together may be essential to gain
mutual benefits for both two tasks.
T. Tran and M. Numao/Toward Active Mining
1: open_read_frame <- molecular_weight (1.8%, 75.0%)
2: open_read_frame <- molecular_mass (1.8%, 75.0%)
3: open_read_frame <- cdna_clone (1.3%, 100.0%)
4: virul <- growth_rate (1.8%, 75.0%)
5: j_bacteriol <- aspartyl_proteinas (1.3%, 100.0%)
6: j_bacteriol <- gene_code (1.3%, 100.0%)
7: j_bacteriol <- sucros (1.3%, 100.0%)
8: organism <- immunoelectron_microscopy
(1.3%, 100.0%)
9: resist <- transport (1.8%, 75.0%)
10: similar <- hyphal_growth (1.8%, 75.0%)
11: clone <- southern_blot (1.3%, 100.0%)
12: white <- opaqu (1.8%, 75.0%)
13: white <- opaqu_phase (1.8%, 75.0%)
14: white <- opaqu_cell (1.8%, 75.0%)
15: amino_acid_sequ <- comparison (2.7%, 83.3%)
16: amino_acid_sequ <- escherichia_coly (1.8%, 75.0%)
17: amino_acid_sequ <- alignment (1.8%, 75.0%)
18: fragment <- molecular_mass (1.8%, 75.0%)
19: cell_wall <- moiety (1.3%, 100.0%)
20: cell_wall <- immunoelectron_microscopy
(1.3%, 100.0%)
Figure 4: First twenty rules obtained for the set of Candida documents obtained in Section 3
(minimum support = 0.01, minimum confidence = 0.75)
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings
of the 20th International Conference on Very Large Databases, 1994.
[2] M.A. Andrade and A. Valencia. Automatic annotation for biological sequences by ex-
traction of keywords from medline abstracts, development of a prototype system. In
Proceedings of the 5th International Conference on Intelligent Systems for Molecular
Biology, 1997.
[3] C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia. Automatic extraction of
biological information from scientific text: protein-protein interactions. In Proceedings
of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999.
[4] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the Third Conference
on Applied Natural Language Processing, 1992.
[5] K. Fukuda, A. Tamura, T. Tsunoda, and T. Takagi. Toward information extraction:
identifying protein names from biological papers. In Proceedings of the Pacific Symposium
on Biocornputing, 1998.
[C] L. Hirschman. Mining the biomedical literature: Creating a challenge evaluation. Tech-
nical report, The MITRE Corporation, 2001.
[7] D.D. Lewis and M. Ringuette. A comparison of two learning algorithms for text catego-
rization. In Third Annual Symposium on Document Analysis and Information Retrieval.
1994.
[8] S. K. Ng and M. Wong. Toward routine automatic pathway discovery from on-line
scientific text abstracts. Genome Informatics, 10:104 11, December 1999.
[9] J. C. Park, H. S. Kim, and J. J. Kim. Bidirectional incremental parsing for automatic
pathway identification with cornbinatory categorial grammar. In Proceedings of the Pa-
cific Symposium on Biocornputing, 2001.
[10] T.C. Rindnesch. Edgar: Extraction of drugs, genes and relations from the biomedical
literature. In Proceedings of the Pacific Symposium, on Biocornputing, 2000.
10 T. Tran and M. Numao / Toward Active Mining
[11] T. Sekimizu, H.S. Park, and J. Tsujii. Identifying the interaction between genes and
gene products based on frequently seen verbs in medline abstracts. Genome Informatics.
pages 62-71, 1998.
[12] B. J. Stapley and G. Benoit. Biobibliometrics: Information retrieval and visualization
from co-occurrences of gene names in medline abstracts. In Proceedings of the Pacific
Symposium on Biocomputing. 2000.
[13] J. Thomas. D. Milward. C. Ouzounis, S. Pulman, and M. Carroll. Automatic extraction
of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium
on Biocomputing. 2000.
[14] J. Tsujii. Information extraction from scientific texts. In Proceedings of the Pacific
Symposium on Biocomputing, 2001.
[15] A. Yakushiji, Y. Tateisi, Y. Miyao Y., and J. Tsujii. Event extraction from biomedical
papers using a full parser. In Proceedings of the Pacific Symposium on Biocomputing.
2001.
Active Mining
H. Moloda(Ed.)
1OS Press, 2002
Data Mining on the WAVEs —
Word-of-mouth-Assisting Virtual Environments
Masayuki Numao, Masashi Yoshida and Yusuke Ito
numao@cs.titech.ac.jp
http://www.nrn.cs.titech.ac.jp
Department of Computer Science
Tokyo Institute of Technology
2-12-1 O-okayama, Meguro 152-8552, JAPAN
Abstract. Recently, computers play an important role not only in
knowledge processing but also as communication media. However, they
often cause troubles in communication, since it is hard for us to select
only useful pieces of information. To overcome this difficulty, we pro-
pose a new tool, WAVE (Word-of-mouth-Assisting Virtual Environmen-
t), which helps us to communicate and spread information by relaying
a message like Chinese whispers. This paper describes its concept, an
implementation and its preliminary evaluation.
1 Introduction
Chinese whispers a game in which a message is distorted by being passed around in
a whisper (also called Russian scandal).
word of mouth (a) oral communication or publicity; (b) done, given, etc., by speak
ing: oral.
- New Shorter Oxford English Dictionary
WWW and e-mail are very useful tools for communication. However, we sometimes
feel uncomfortable because of flaming or mental barriers to participate in Computer-
Mediated Communication (CMC). There are some important differences between CMC
and direct comrnunication[5].
Another problem is that computer networks deliver too many pieces of information,
by which it is too hard to select useful pieces. Although search engines, such as Yahoo,
Goo and Google, are very useful to find web pages, we need another type of tool without
requiring a keyword for search. Good candidates are a mailing list and a network news
system, where we need a filtering system to select only useful messages. Although
content-based filtering[6] and collaborative filtering[8] are good solutions, the current,
methods have not achieved high precision and recall. This paper presents another
approach by relaying a message like Chinese whispers to gather useful information, to
alleviate mental barriers and to block flames.
12 M. Numao et al. / Data Mining on the WAVEs
request request
Figure 1: Spread of information
2 Spread of information by Chinese whispers
Fig. 1 shows spread of information by word of mouth, where each person relays a
message like Chinese whispers. Although a message is distorted by being passed around
in the game, in a computer-assisted environment we expect that a delivered message is
the same as its original. In such a process, we even have a merit that, as a result of
evaluation and selection by each person, this process delivers only useful information.
Each person knows whom (s)he should ask on a current topic, and retrieve a small
amount that can be handled, where only interesting information survives.
3 WAVE
To assist spread of information by Chinese whispers, wepropose a system WAVE (Word-
of-mouth-Assisting Virtual Environment) for smooth communication and information
gathering. Compared to agent systems proposed to automate word of mouth [1.9. 2. 7].
WAVE is a simpler tool and works as directed by the user except for a separated
recommendation module. The authors believe that, in most situations, a simple and
intuitive tool is better than an automated complicated tool, since users construct a
model of the tool easily.
Fig. 2 shows a diagram of WAVE. The user's operations are posting, opening and
reviewing an article. In addition, in a recommendation window, the system shows some
good articles based on the user's log.
3.1 Posting an article
The user can post an article as shown in Fig. 3, which may contain a text and URLs
of web pages or photos. (S)he gives evaluation 1-5 (1 for the worst and 5 for the best)
and a category to the article. The posted article is open to others as shown in Fig. 4
and referred by other users like WWW and a mailing list.
The user can browse articles posted by her/his friends. Fig. 5 shows a list of friends.
Each person is identified by an address 'user_namefihost:port'. If an article is interest-
ing. (s)he can post its review, by which (s)he relays the article to his friends as shown
in Fig. 2. Fig. 4 shows a list of articles the user has posted or reviewed.
M. Nurnao et al. /Data Mining on the WAVEs
Figure 2: Word-of-mouth-Assisting Virtual Environment
Figure 3: Posting an article Figure 4: Articles posted or reviewed
14 M. Numao et al. / Data Mining on the WAVEs
Figure 5: Your friends Figure 6: Reviews by your friend
3.2 Open articles
Articles posted or reviewed by the user are stored in her/his database. It is open to
people who registered her/him as a friend. The user can register an address of her/his
friends, or notify her/his address to another user. For example, if C registered A and
B to her/his friend's list, C can see the databases of A and B.
Since each user knows her/his friends, (s)he can judge their reliability, which is
very useful to select information from them. In addition, it is comfortable to join the
community because (s)he exchanges messages only with her/his friends.
3.3 Review an article
If C is interested in an article from A in Fig. 2, C can browse its body and give
an evaluation and a comment as shown in Fig. 7. After this operation, the article
is automatically retrieved and stored in C's database, which is open to C's friends.
Chaining the operation propagates an article.
As such, WAVE seamlessly assists opening, browsing, evaluation, retrieve of an
article. This saves us a lot of time and labor of uploading, advertisement, etc. In
BBS and mailing lists, most participants feel mental barriers to post an article. In
contrast, a user first posts an article only to his friends in WAVE. Mental barriers are
alleviated in this fashion. ROMs (Read Only Members) often form a bridge between
two communities. WAVE is useful to activate a bridge.
3.4 Automatic recommendation
When a user has many friends, it might be good to order articles based on her/his
model. Modeling a person is difficult since we cannot directly measure a mental state.
Even if it can by using MRI or other devices, it is still hard to clarify a relation between
M. Numao et al. /Data Mining on the WAVEs
Figure 7: An article
Figure 8: Recommendation
Figure 10: Modeling based on communi-
Figure 9: Modeling cation
16 M. Numao et al. / Data Mining on the WAVEs
Figure 11: Recommending process
a brain state and its social effects, since a person has many activities and aspects
(Fig. 9). Instead, we propose to model a relation between two persons by logging their
communication.
To model a relation between two persons, we need a log of communications between
all combinations of persons. This causes a trouble in analyzing WWW. a news system
or a mailing list. In contrast, all communications are occurred only among friends in
WAVE. We have no combinatorial problem in analyzing communications and modeling
relations, since the number of friends of one person is not usually large.
Fig. 11 shows a process of ordering articles for recommendation, where C s history
is analyzed based on an evaluation function to order articles in databases of A and B.
and evaluation is based on the following factors:
• Evaluation of the article by the last reviewer.
• Evaluation of the last reviewer by the user.
• The user's preference for the category of article.
• How old is the article?
• Howmany people relay the article?
3.5 Distributed implementation
The system is implemented on Java servelet and works on a web server as shown in
Fig. 12. The user first registers her/his name and password, and accesses the system
by using a web browser.
M. Numao et at. / Data Mining on the WAVEs
Figure 12: Distributed implementation
18 M. Numao et al. / Data Mining on the WAVEs
Figure 14: Two example flows of an article
The system is distributed easily to several hosts. In Fig. 12. Mr. A registered on
hostl to use the system. Ms. B registered on host2. Mr. A can see Ms. B's article by
specifying her address. As such, the system is scalable by being distributed over many
hosts.
4 Preliminary evaluation
33 users test the system for 20 days. The result is visualized as shown in Fig. 13. This
map is based on one by KrackPlot[4]. which is a program for network visualization
designed for social network analysts.
Each node denotes a user, whose shape denotes the number of articles (s)he posts.
Here, myoshida. blankey. roy and t-sugie are opinion leaders that post many articles.
A directed arc denotes that articles are retrieved and reviewed in that direction. Its
thickness denotes the number of articles retrieved. In the network, we can see many
triangles, each of which forms triad strongly connecting each other.
Two example flows of an article are shown in Fig. 14. One flow is in thick solid line.
The other is in thick dotted line. S denotes their origin. Each attached number denotes
evaluation by each person. In most cases, the evaluation degrades as people relay an
article.
Each island circled in Fig. 15 shows a community the authors observed, where
people know each other in their real life. An article moves mainly in a community.
Some people appear in multiple communities, and play a role of gatekeeper[3]. who
bridges information between communities.
M. Numao et at, / Data Mining on the WAVEx 19
Figure 15: Communities in the real life
5 Conclusion
We have proposed a system for information propagation and gathering by relaying a
message like Chinese whispers. The URL of the experimental system is:
http://www.mn.es. titeeh.ac.jp: 12581/worn/
The authors are preparing a distribution package of the system for experiments in the
distributed manner shown in Fig. 12.
References
[1] L. N. Foner. A multi-agent referral system for matchmaking. In Proceedings of the Inter-
national Conference on the Practical Applications of Intelligent Agents and Multi-Agent
Technology, 1996.
[2] L. N. Foner. Yenta: a multi-agent, referral-based matchmaking system. In AA-97. pages
301 307, 1997.
[3] S. Goto and H. Nojima. Analysis of the three-layered structure of information flow in
human societies. Journal of Japanese Society for Artificial Intelligence (in Japanese).
8(3):348 356. 1993. This paper also appears in Artifical Intelligence.
[4] KrackPlot, URL: http://www.contrib.andrew.cmu.edU/~ kraek/.
[5] M. Lea. Contexts of computer-mediated communication. Harvester Wheatsheaf, pages
30 65. 1992.
20 M. Numao et al. / Data Mining on the WAVEs
[6] Pattie Maes. Agents that reduce work and information. CACM. 37(7):30– 40. 1994.
[7] Takeshi Otani and Toshiro Minami. Searching for information resources by word of mouth.
In MACC 97 (In Japanese). 1997. http://www.kecl.ntt.co.jp/csl/msrg/events/macc97-
/ohtani.html.
[8] P. Resnick, N. lacovou. M. Suchak. P. Bergstrom. and J. Riedl. Grouplens: An open
architechture for collaborative filtering of net news. In CSCW '94-pages 175 186. 1994.
[9] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating
"word of mouth". In CHI. pages 210 217. 1997.
Active Mining
H. Motoda (Ed.)
1OS Press, 2002
Immune Network-based Clustering for WWW
Information Gathering/Visualization
Yasufumi Takarna1
'2
and Kaoru Hirota1
{takama,hirota}@hrt.dis.titech.ac.jp
1
Tokyo Institute of Technology
4259 Nagatsuta, Midori-ku, Yokohama 226-8502 JAPAN
2
PREST, Japan Science and Technology Corporation. JAPAN
Abstract. A clustering method based on the immune network model is
proposed to visualize the topic distributionover the document set that
is found on the WWW. The method extracts the keywords that can
be used as the landmarksof the major topics in a document set, while
the documentclustering is performed with the keywords. The proposed
method employs the immune network model to calculate the activation
values of keywords as well as to improve the understandability of the web
information visualization system. The questionnairesare performed to
compare the quality of clusters between the proposed method and k-
nieans clustering method, of which the results show that the proposed
method can get better results in terms of coherence as well as under-
standability than k-means clustering method.
1 Introduction
A WWW information visualization method to find topic distribution from document
sets is proposed. When the WWW is considered as the information resource, it has
several significant characteristics, such as hugeness, dynamic nature, and hyperlinked
structure, among which we focus on the fact that the information on the WWW tends to
be obtained by users as a set of documents. For example, there are so many online-news
sites on the WWW, which constantly release a set of news articles of various topics day
by day. As another example, a series of user's retrieval processes also provides the user
with a sequence of document sets. Although the hugeness of the WWW as well as its
dynamic nature is burden for the users, it will also bring them a chance for business and
research if they can notice the trends or movement of the real world from the WWW,
which cannot be found from a single document but from a set of documents.
Information visualization systems[6, 15, 16, 18]are promising approaches to help the
user notice the trends of topics on the WWW. The Fish View system[15] extracts the
user's viewpoint as a set of concepts, and the extracted concepts are used not only to
construct the vector space that is sensitive to the user's viewpoint, but also to present
the user's current viewpoint in an explicit manner.
In this paper, an information visualization method based on document set-wise
processing is proposed to find the topic distribution over a set of documents. One of
the characteristic features of the proposed method is the generation of keyword map as
well as document clustering. That is, a landmark that is a representative keyword on
a keyword map is found, while the documents containing the same landmark form a
document cluster.
22 Y. Takama and K. Hirota / Immune Network-based Clustering
When landmark keywords are found based on the propagation of keywords" activa-
tion values over the keyword network, the keywords should be activated with related
keywords, while the keywords relating to each other should not be highly activated at
the same time. To achieve this kind of nonlinear activation, the immune network model
[1, 5, 7, 8] is employed to calculate the activation values of keywords.
The understandability of the information visualizationsystem for users can be im-
proved by employing an appropriate metaphor. From this viewpoint, the method based
on the immune network model is expected to improve the understandability of the
keyword map, by incorporating the additional information, such as landmark and its
suppressing keywords, into the ordinary keyword map, on which only the distance be-
tween keywords is a clue to understand the topic distribution over a document set.
The concept of the clustering method based on the immune network model as well
as its algorithm are proposed in Section 2, followed by the experimental results that
compare the quality of the clusters generated by the proposed method and that by
k-means clustering method in Section 3. An application of the proposed method to
information visualization / gathering systems is considered in Section 4.
2 Immune Network-based Clustering Method
2.1 Concept of Immune Network-based Clustering
Generally, the information visualization systems designed for handling documents are
divided into 2 types, an information visualizationsystem based on document clustering,
and a keyword map. In this paper, the information visualization system that arranges
the keywords extracted from documents on (usually) a 2-D space according to their
similarities is called a keyword map [6, 9, 16]. A keyword map is often adopted to
visualize the topic distribution over a document set.
The clustering method[1l, 12, 13, 14]proposed in this paper aims to generate a key-
word map, while performing a document clustering. On a keyword map, the keywords
relating to the same topic are assumed to gather and form a cluster. The proposed
method extracts a representative keyword, called landmark, from each cluster. As the
border of keyword clusters on a keyword map is usually not obvious, another constraint
for extracting a landmark is adopted from the viewpoint of document clustering. That
is, when the documents containing the same landmark are classified into the same clus-
ter, there should not exist overlapping among clusters. From the viewpointofdocument
clustering, a landmark is called as a cluster identifier, because it defines the member of
a document cluster.
To extract a landmark (a cluster identifier) from a keyword map. the proposed
method calculates an activation value of each keyword based on the interaction between
the keywords that relate to each other. In this paper, the immune network model is
employed to calculate a keyword's activation value, which is described in Section 2.2.
2.2 Immune Network Model
Th Immune network model has been proposed by Jerne[5] to explain the functionality of
an immune system, such as variety and memory. The model assumes that an antibody
can be active by recognizing the related antibody as well as the antigen of a specific
type. As antibodies form a network by recognizing each other, the antibody that has
once recognized an invading antigen can outlive after the antigen has been removed.
Y. Takama and K. Hirota / Immune Network-based Clustering 23
Concerning the immune network model, several models have been proposed in the
field of computational biology[1, 7, 8]. among which one of the simplest model is em-
ployed in this paper:
3
here Xl and Ai are the concentration (activation) values of antibody i and antigen
i, respectively. The s is a source term modeling a constant cell flux from the bone
marrow and r is a reproduction rate of the antigen, while kb,and kg are the decay terms
of the antibody and antigen, respectively. The and {0,WC, SC}) indicate the
strength ofthe connectivity between the antibodies i and j, and that between antibody
i and antigen j, respectively. The influence on antibody i by other connected antibodies
and antigens is calculated by the proliferation function (5), which has a log-bell form
with the maximum proliferation rate p.
Using Eq. (5) does not only activate the antibody by recognizing other antibodies
or antigens, but also suppresses the antibody if the influence by other objects is too
strong. The characteristics of immune systems such as immune response and tolerance1
can be explained by the model[l, 7, 10].
The dynamics and the stability of the immune network model have been analyzed
by fixing the structure or the topology of the network[l, 7, 10]. As the structure of
the keyword network that is generated in the proposed method is defined based on
the occurrence of keywords in a set of documents, the analysis noted above cannot be
applicable. However, the consideration about the combination of the activation states
between the connected antibodies leads to the following constraints [13]:
• An antibody can take one of 4 states in terms of activation value; virgin state,
suppressed state, weakly-activated state, and highly-activated state.
• It is unstable that both of the antibodies connected to each other take highly-
activated state at the same time.
• When there are several antibodies that connect to the same antibody of highly-
activated state, the antibodies with strong connection2
are suppressed, while
those with weak connection become weakly-activated.
Applying such a nonlinear activation mechanism of immune network model enables
to satisfy the following contradictory conditions for a landmark.
1
A tolerance indicates the fact that the immune system of a body does not attack the cells of
oneself.
"As noted in Section2.3. there are two types of connectionsin terms ofstrength.
24 Y. Takama and K. Hirota / Immune Network-based Clustering
• A landmark should form a keyword cluster with a certain number of connected
keywords.
• There should not exist any connection between landmarks.
2.3 Algorithm of Immune. Network-based Clustering
In this paper, the immune network model(Eq. (1) (5)) is applied to the calculation of
activation values of keywords, by considering a keyword as an antibody and a document
as an antigen. The algorithm is as follows:
1. Extraction of keywords (nouns) from a document set with using the morphological
analyzer3
and the stopword list. In this paper, only the keywords contained in
more than 2 documents are extracted.
2. Construction of the keyword network by connecting the extracted keywords k, to
other keywords kj or documents dj.
(a) Connection between kj and kj: (Dij indicates the number of documents
containing both keywords.)
Strong connection (SC): Dij >7k..
Weak connection (WC): 0 < Dij < Tk
(b) Connection between k, and dj. (TFi j indicates the term frequency of k, in
dj.)
SC: TFij > Td
WC: 0 < TFij < Td
3. Calculation of keywords" activation values on the constructed network, based on
the immune network model (Eq. (1)(5)).
4. Extraction of the keywords that activate much higher than others as landmarks
after the convergence.
5. Generation of document clusters according to the landmarks
In Step 4. a convergence means that the same set of keywords always becomes
active. It is observed through most of the experiments that the same set of keywords
have much (about 100 times ) higher activation values than others[l1]. after 1.000 times
calculation.
3
As the current system is implemented to handle .Japanese documents. Japanese morphological
analyzer r/in.srn(http://clia.sen.aist-nara.ac.jp/) is used to extract nouns.
Y. Takama and K. Hirota / Immune Network-based Clustering
Table 1: Parameter Settings Used in the Experiments
Parameter
Value
Parameter
Value
s
10
Xi(0)
10
r
0.01
Ai(0)
105
kg
10-4
Tk
3
kb
0.4
Td
3
103
SC
1.0
106
WC
10-3
p
1.0
3 Experimental Results
The quality of clusters generated by the proposed clustering method is compared with
that by k-means clustering[3], of which the applicability is widely demonstrated in many
applications.
While k-means generates the clusters so that each data (documents) in a set can be
covered by one of the generated clusters, the proposed method does not intend to cover
all the documents. It is observed through many experiments that 60-80% of a document
set is covered by the generated clusters. Therefore, it is meaningless to compare both
methods in terms of coverage. In this paper, questionnaires are performed to compare
the clusters generated by the proposed method and that by k-means. from the following
viewpoints.
• Coherence: how closely the documents within a cluster relate to each other.
• Understandability: how easily the topic- of a cluster can be understood by users.
The sets of documents used for the experiments are collected from the following
online news sites.
Setl Documents in entertainment category of Yahoo! Japan News site4
. released on
September 18, 2001. The 75 keywords are extracted from 25 documents.
Set2 Documents in entertainment category of Yahoo! Japan News site, released on
September 21, 2001. The 62 keywords are extracted from 24 documents
Set3 Documents in local news category of Lycos Japan5
. released on September 28.
2001. The 22 keywords are extracted from 23 documents.
The parameter values used in the experiments are shown in Table 1. These values are
empirically determined based on the values used in the field of computational biologyf[l.
7,8].
The STATISTICA2000 (Statistica Soft, Inc.) is used to perform k-meansclustering.
The number of clusters generated by k-means, which has to be determined in advance,
is specified as much as the number of clusters generated by the proposed clustering
method. The naive k-means clustering tends to generate the clusters of various sizes,
and sometimes the cluster containing only one document is generated, which is removed
from questionnaires.
The questionnaires are answered by 9 subjects, consisting of researchers and stu-
dents. Each subject is asked to evaluate the clustering results of 2 document sets, one
26 Y. Takama and K. Hirota /Immune Network-based Clustering
Table 2: Comparison of Clustering Results between Proposed Method and K-means Clustering
Data | Item Proposed | K-means
Setl
Set2
Set3
Number of clusters
Variance of Cluster Size
Average score
Score<2.5
Number of clusters
Variance of Cluster Size
Average score
Score>3.5
2.5<Score<3.5
Score<2.5
Number of clusters
Variance of Cluster Size
Average score
Score >3.5
2.5<Score<3.5
Score<2.5
5
0.48
4.33
5
0
0
5
0.32
3.82
4
1
0
5
0.48
2.3
1
1
3
4
3.6
3.90
2
1
1
4
4.625
3.13
1
2
1
5
4.25
4.00
4
0
1
generated by the proposed method and another by k-means. Of course, subjects do not
know by which method each result is generated.
In the questionnaires, the documents in a cluster and the related keywords are pre-
sented for each cluster. The related keywords of the proposed method are landmarks as
well as their suppressing keywords. As for the k-means clustering method, the keywords
of which the weight in the cluster center is higher than others are used as the related
keywords. The number of related keywords of the proposed method is not fixed, while
5 related keywords are presented in the case of k-means for each cluster.
Subjects rate the coherence of each cluster with 5 grades, from score 5 as closely
related to 1 as not related. As for the understandability.Subjects are asked to mark
the related keyword that seems to represent the topic of a cluster6
.
Table 2 shows the number of clusters, the variance of cluster size, average score of
clusters, and the score distributionof the clusteringresults generated by both method
from 3 document sets.
From this table, it is shown that the proposed method (Proposed) can obtain better
results than k-means clustering (K-means) for Setl and Set2. The reason why the
proposed method cannot obtain good result for Set3 seems to relate with the fact that
the number of keywords extracted from Set3 is much leas than those from Setl and
Set2. That is, it seems that there are less topical keywords in the local news category
than in the entertainment category. Extractingnot only keywordsbut also phrases will
be required to handle this problem.
It is observed that some clusters are generated by both of the proposed method
and k-means clustering method. As k-means clustering tends to generate one large
clusters, which leads to large variance of cluster size as shown in Table 2. it is also
observed that some clusters generated by the proposed method are subset of the cluster
generated by k-means. Table 3 and Table 4 shows the distribution of scores of the
clusters, dividing the case when the clusters are generated by both methods (SAME).
6
Multiple keyword selection for a cluster is allowed.
Y. Takama and K. Hirota /Immune Network-based Clustering 27
Table 3: Score Distribution of Clusters Generated by Plastic Clustering Method
Type
SAME
SUBSET
DIFFERENT
TOTAL
1
0(0%)
1(8%)
4(22%)
5(11%)
2
2(14%)
2(15%)
1(6%)
5(11%)
3
0(0%)
0(0%)
0(0%)
0(0%)
4
7(50%)
8(62%)
10(55%)
25(56%)
5
5(36%)
2(15%)
3(17%)
10(22%)
Total
14(100%)
13(100%)
18(100%)
45(100%)
Table 4: Score Distribution of ClustersGenerated by K-means Clustering Method
Type
SAME
SUBSET
DIFFERENT
TOTAL
1
1(7%)
1(10%)
2(20%)
4(12%)
2
1(7%)
2(20%)
2(20%)
5(15%)
3
0(0%)
0(0%)
0(0%)
0(0%)
4
6(43%)
4(40%)
2(20%)
12(35%)
5
6(43%)
3(30%)
4(40%)
13(38%)
Total
14(100%)
10(100%)
10(100%)
34(100%)
the clusters generated by the proposed method is a subset of a cluster of k-means
(SUBSET), and others (DIFFERENT). From these tables, it can be seen that the
clusters generated by both methods can obtain higher scores than others. Although
the scores of clusters in SUBSET and DIFFERENT are lower than those in SAME, the
proposed method can obtain good score (4 and 5) compared with k-means clustering.
As for the understandability, Table 5 shows the ratio of the related keywords that
are marked by more than one subjects among the related keywords presented to them.
It is shown i Table 5 that the ratio becomes high when the clustering results obtain
high scores in terms of coherence, i.e., the results of Setl and Set2 by the proposed
method, and the results of Setl and Set3 by k-means clustering method. That is, the
cluster with high score relates to a certain, obvious topic, which can be understood by
several subjects from the same viewpoint.
4 WWW Information Visualization System with Immune Network Metaphor
An information visualization system is one of the promising approaches for handling the
growing WWW information resource. The information visualization system that aims
to support browsing process often tries to make it easy to understand a link structure by
using 3D graphics as well as by introducing the interaction with the user[16]. When a
information visualization system is designed to support the information retrieval process
with using WWW search engines, it often employs the document clustering method for
improving the efficiency of browsing retrieval results[4, 18, 19].
On the other hand, a keyword map[6, 9, 12, 16], which has not been so famous in
Table 5: Ratio of Keywords Extracted More Than Once
Document Set
Setl
Set2
Set3
Proposed
0.286
0.368
0.167
K-means
0.304
0.095
0.241
28 Y. Takama and K. Hirota / Immune Network-based Clustering
the field of WWW information visualization, is useful to visualize the topic distribution
over a set of documents. Visualizing topic distribution is expected to be also suitable
for supporting interactive information gathering process.
In the proposed method, as a landmark suppresses the related keywords on the
constructed keyword network, this relationship among keywords is also useful as the
metaphor to improve the understandability of a keyword map. as shown in Fig. 1. While
the ordinary keyword map uses only the distance information, the immune network
metaphor is used to improve the keyword map by emphasizing the keyword cluster
of which the representative is a landmark. In Fig. 1. the immune network metaphor
is incorporated into the spring model[16j. so that the spring constant of the spring
connected to a landmark can be set to be stronger than others, and the length of the
spring between landmarks can be set to be longer than others. A landmark is indicated
in white color, while dark-colored one is the keyword suppressed by a landmark. From
Fig. 1. five distinct topics represented with landmarks and their related keywords can
be shown clearly, while the suppressed keywords "Terrorism" and "Simultaneous" are
arranged near the center of the map. because the topic about N. V. tragedy iscontained
in manv documents.
Figure 1: keyword Map Generated from Setl
5 Conclusion
A clustering method based on the immune network model is proposed to visualize the
topic distribution over the document set found on the WWW. The method extracts
the keywords that can be used as the landmarks of the major topics in a document set.
while the document clustering is performed with the keywords. The proposed method
employs the immune network model to calculate the activation values of keywords.
The questionnaires are performed to compare the clusters generated by the proposed
method and those generated by k-means clustering method, of which the results show
that the proposed method can get better results in terms of the coherence than k-means.
in two of three document sets. From the viewpoint of understandability. it is shown
that the landmark and their related keywords can represent the topic of the (luster.
Y. Takama and K. Hirota /Immune Network-based Clustering 29
Furthermore, the immune network metaphor is incorporated into an ordinary key-
word map to improve its imderstandability. As the future work, the ways of incorpo-
rating the immune network model into a keyword map will be considered to further
improve the understandability of a keyword map.
References
[1] Anderson, R. W., Neumann, A. U.,, Perelson, A. S., ''A Cayley Tree Immune Network
Model with Antibody Dynamics," Bulletin of Mathematical Biology, 55, 6, pp. 1091
1131, 1993.
[2] Cole, C., "Interaction with an Enabling Information Retrieval System: Modeling the
User's Decoding and Encoding Operations," Journal of the American Society for Infor-
mation Science , 51, 5, pp. 417 426, 2000.
[3] Duda, R. O., Hart, P. E., Stork, D. G., "10. Urisupervised Learning and Clustering," in
Pattern Classification (2nd Ed.), Wiley, New York, 2000.
[4] Hearst, M. A. and Pedersen. J. O., "Reexamining the Cluster Hypothesis: Scat
ter/Gather on Retrieval Results," SIGIR '96, pp. 76 84, 1996.
[5] Jerne, N. K., ''The Immune System." Sci. Am., 229, pp. 52-60, 1973.
[6] Lagus. K., Honkela, T., Kaski, S., Kohonen, T., "Self-Organizing Maps of Document
Collection: A New Approach to Interactive Exploration." 2nd Int'l Conf. on Knowledge
Discovery and Data Mining, pp.238–243, 1996.
[7] Neumann, A. U. and Weisbuch, G., "Dynamics and Topology of Idiotypic Networks."
Bulletin of Mathematical Biology, 54, 5, pp. 699–726, 1992.
[8] Smith, D. J., Forrest, S., Perelson, A. S., "Immunological Memory is Associative." Int'l
Workshop on the Immunity-Based Systems (IBMS'96), 1996.
[9] Sumi, Y., Nishimoto,K.. Mase, K., "Facilitating Human Communication in Personalized
Information Spaces," AAAI-96 Workshop on Internet-Based Information Systems, pp.
123–129, 1996.
[10] Sulzer. B. et al., "Memory in Idiotypic Networks Due to Competition Between Pro-
liferation and Differentiation." Bulletin of Mathematical Bioloqy, 55, 6, pp. 1133–1182.
1993.
[11] Takama, Y. and Hirota, K., "Application of Immune Network Model to Keyword Set
Extraction with Variety," 6th Int'l Conf. on Soft. Computing (IIZUKA2000), pp. 825 830,
2000.
[12] Takama, Y. and Hirota, K., "Development of Visualization Systems for Topic Distribu-
tion based on Query network", SIG-FAI-A003, pp. 13–18, 2000.
[13] Takama, Y. and Hirota, K., "Employing Immune Network Model for Clustering with
Plastic Structure," 2001 IEEE Int'l Symp. on Computational Intelligence in Robotics
and Automation (CIRA2001), pp. 178 183, 2001.
[14] Takama. Y. and Hirota. K., "Consideration of Memory Cell for Immune Network-based
Plastic Clustering method," lnTech'2001, pp. 233 239, 2001.
[15] Takama, Y. and Ishizuka, "FISH VIEW System: A Document Ordering Support System
Employing Concept-structure-based ViewpointExtraction," J. of Information Processing
Society of Japan (IPSJ), 42, 7, 2000 (written in Japanese).
[16] Takasugi, K. and Kunifuji, S., "A Thinking Support System for Idea Inspiration Using
Spring Model." ./. of Japanese Society for Artificial Intelligence, 14, 3, pp. 495 503. 1999
(written in Japanese).
[17] Watanabe, I., "Visual Text Mining," J. of Japanese Society for Artificial Intelligence.
16, 2. pp. 226–232, 2001 (written in Japanese).
[18] Zamir, O. and Etzioni, O., "Grouper: A Dynamic Clustering Interface to Web Search
Results," Proc. 8th Int'l WWW Conference, 1999.
[19] Zamir, O. and Etzioni. O., "Web Document Clustering: A Feasibility Demonstration."
Proc. SIGIR'98. pp. 46–54, 1998.
This page intentionally left blank
Active Mining
H. Motoda (Ed.)
IOS Press, 2002
Interactive Web page Retrieval with Relational
Learning based Filtering Rules
Masayuki Okabe
okabe@mm.media,kyoto-u.ac.jp
Japan Science and Technology CREST
Yoshida-Nihonmatsn-Cho, Sakyo-ku, Kyoto 606-8501, JAPAN
Seiji Yarnada
yamada@ymd.dis.titech.ac.jp
CISS, IGSSE, Tokyo Institute of Technology
4259 Nagatuta-Cho, Midori-ku,Yokohama 226-8502, JAPAN
Abstract. WWW Search Engines usually return a hit-list including
many irrelevant pages because most of the users just input a few words
as a query which is not enough to specify their information needs. In this
paper wepropose a system which applies relevance feedback to the inter-
active process between users and Web Search Engines, and accelerates
the effectiveness of the process by using a query specific filter. This filter
is a set of rules which represents the characteristics of Web pages that a
user marked as relevant, and is used to find new relevant Web pages from
unidentified pages in a hit-list. Each of the rules is made of logical and
proximity relationships among keywords which exist in a certain range
of a Web page. That range is one of the areas partitioned by four kinds
of HTML tags. The filter is made by a learning algorithm which adopts
separate-and-conquer strategy and top-down heuristic search withlim-
ited backtracking. In experiments with 20 different kinds of retrieval
tests, we demonstrate that our proposed system makes it possible to get
more relevant pages than the case not using the system as the number
of feedback increases. We also analyze how the filters work.
1 Introduction
With the rapid growth of WWW, there are various information sources on the Internet
today. Search engines are indispensable tools to access useful information which might
exist somewhere on the Internet. While they have been getting higher capability to meet
various information needs and large amounts of transactions, they are still insufficient
in the ability to support the users who want to collect a certain number of Web pages
which are relevant to their requirements.
When a user inputs a query, which is usually composed of a few words[1], search
engines return a "hit-list" in which so many Webpages are presented in a certain order.
However it does not often reflect the user's intent, and thus the user would waste much
time and energy on judging Web pages in the hit-list.
To resolvethis problem and to provideefficient retrieval process, wepropose a system
which mediates between users and search engines in order to select only relevant Web
pages out of a hit-list through the interactive process called "relevance feedback" [8].
Given some Web pages marked with their relevancy(relevant or rion-relevant)by a user,
this system generates a set of filtering rules, each of which is a rule to decide whether
32 M. Okabe and S. Yamada / Interactive Web Page Retrieval
Figure 1: Interactive Web search
the user should look a Web page or not. The system constructs filtering rules from the
combinations of keywords, relational operators and tags by a learning algorithm which
is superior to learn structural patterns. We have developed this basic framework in
document retrieval[6]and found our approach was promising. In this paper, we applied
this method to the intelligent interface which coordinates the hit-lists of search engines
in order for individual user to find their wanted information easily.
The remainder of the paper is organized as follows. Section 2 describes the in-
teractive process and the way how to apply filtering rules. Section 3 describes the
representation and the learning algorithm of filtering rules. Section 4 shows the results
of retrieval experiments to evaluate our system.
2 Interactive Web search with relevance feedback
Figure 1 shows the overview of interactive Web search with relevance feedback. In this
section, we explain the procedures of each step in this search process. The number
assigned to them correspond to the numbers in circles of Figure 1.
1. Initial search: A user inputs a query (a set of terms) to our Web search system.
Then the system puts the query through to a search engine and obtains ahit-list.
2. Evaluation of results by a user: After getting a hit-list from a search engine,
the system asks the user to evaluate and mark the relevancy(relevant or non-
relevant) of a small part of Web pages in the hit-list (usually upper 10 pages),
and stores those pages as training pages, especially the relevant pages as positive
training pages and the non-relevant pages as negative trainingpages.
3. Analyzing training pages: Then the system breaks up each positive training
page into the minimalelements which can be a part of filtering rules. The concrete
procedures are the followings.
M. Okabearui S. Yamada / Interactive Weh PageRetrieval
Original hit list
; No.1 pagel
) No.2page2
5 No.3pageS
No.4 page4
No.5 pageS
Modified hit list
No.1 page2
No.2 page4
No.3 page5
O : marked as relevant by a set of filtering rules
x : marked as non-relevant by a set of filtering rules
Figure 2: Filtering Web Pages
• Generating candidates for additional keywords: The extended keywords mean
the terms which can be substituted to the arguments of a predicate. It is
often said that users usually input only a few terms which are quite insuf-
ficient not only for specifying Web pages but for making effective filtering
rules, thus this procedures is very important to widen the variations of rule;
representation. Our system uses TFIDF method[4] to extract additional
keywords.
• Generating literals for constructing bodies of filtering rules: Using the ex-
tended keywords, the system generates literals which can be one of the ele-
ments which compose the body of each filtering rule. These literals are called
A condition candidate set and used to construct a body of a filtering rule.
4. Generating filtering rules by learning: Using the condition candidate set.
the system generates filtering rules by relational learning. The detail procedures
will be developed in the next section.
5. Modify a query and re-searching: The system expands the query using terms
which have been extracted through the analysis of training pages. Then the
modified query is inputed into a search engine and the new results are obtained.
6. Select and indicate the Web pages satisfying filtering rules: As shown
in Figure 2, the system selects the Web pages satisfying the filtering rules from
the hit-list returned by search engine, and indicates them to the user. The pages
which the user has already evaluated are eliminated from the indication.
The information retrieval is done using the above procedures, and the steps from 2
to 6 are repeated until the user collects enough relevant pages.
This system provides the two following functions which are used for filtering the
results of simple relevance feedback.
• Modify a query and re-searching, (corresponding to StepS)
• Select and indicate the Web pages satisfying filtering rules, (corresponding to
Step6)
The search engine: usually selects the candidates of relevant Web pages and ranks
them before returning a hit-list. By modifying a query and re-searching, a system is
able to modify the ranking. Also by selecting and indicating the Web pages satisfying
filtering rules, the filter is modified.
34 M. Okabe and S. Yamada / Interactive Web Page Retrieval
The modification of a query is done by using the query expansion techniques which
have been studied so well in information retrieval[9, 10]. Thus we omit the discussion
on the modification of a query in this paper. We develop representation and generation
of filtering rules using the structure of HTML file in the next section.
3 Filtering rules
This section explains the representation and the generation of filtering rules in detail.
We deal with the construction of filtering rules as inductive learning of machine learn-
ing d, in which relevant and non-relevant pages indicated by the user are used as training
examples.
3.1 Rule representation
We use horn clause to represent filtering rules. The body of a rule consists of the
following predicates standing for relations between terms and tags.
• ap(region-type, word) : This predicate is true iff a word word appears within a
region of region-type in a Web page.
• near(region_type, wordl, word2} : This predicate is true iff both of wordswi',and
Wj appear within a sequence of 10 words somewhere in a region of region-type of
a Web page. The ordering of the two words is not considered.
The predicates ap and near represent basic relations between keyword(s) and the
position of the keyword(s). Several types of relations among keywords can be assumed,
however, we use only neighbor relation because it has been proven to be very useful in
several researches. [2. 5].
Furthermore we can easily consider that the importance of words significantly de-
pends on tags of HTML. For example, the words within <TITLE> seem to have sig-
nificant meaning because they indicate the theme of the Web page. Hence we use
the region-type to restrict a tag with which words are surrounded. We prepare the
region-type in the followings.
• title : The region surrounded with title tags <TITLE>.
• anchor : The region surrounded with anchor tags <A>. For example, the <A
HREF=. . . >.
• head : The region surrounded with heading tags <H1~4>.
• para : The region surrounded with paragraph tags <P>. This means the region
of the same paragraph.
We can represent various features of pages by combining these relations. Here is an
example set of rules.
{ relevant :- ap(title, mobile), ap(anchor. PDA).
relevant :- near(para, palm,os).
Filtering rules are interpreted disjunction. Thus if any rule is satisfied in a Web page,
the page will be considered relevant and otherwise non-relevant. The above filtering
rules means that a Web page is relevant if '"mobile" appears in the title and "PDA"
appears in an anchor text, or "palm" and "OS" appear near in the same paragraph.
M. Okabe and S. Yamada/ Interactive Web Page Retrieval 35
Input: E+
: a set of positive training pages, E : a set of negative training pages
C : a condition candidate set, K : a set of extended keywords
Output: R : a set of filtering rules.
Variables: rule. : a filtering rule. .S : a set, of exception literals,
l1 : an exception literal
Initialize: K <—a set of words in a query. R, S, I i <— empty, ride «—relevant:-.
Repeat
1: Investigate the number p of positive training pages satisfying the rule
and the number n of negative training pages satisfying the rv.le.
2: if n = 0 then
3: • Add rule to R.
4: Remove a positive training page satisfying the rule from E +
.
5: if E+
is empty then Finish
6: else Initialize rule, S, l1.
7: else
8: • For all literals in C n S, compute the information gain G.
9: if No literal with G > 0 then
10: if the body of the rule is empty then
11: • Add a keyword to K.
12: • Update C.
13: else
14:- Initialize S and rule.
15: • Add l1 to S, and initialize / 1.
16: else
17:- Select lmnx having the maximum G.
18: if the body of the rule is empty, then I 1 := lmax
19: • Add llnal to rule, and S.
Figure 3: Learning Algorithm
3.2 Learning algorithm
Figure 3 shows the learning algorithm for making filtering rules. This algorithm is based
on the first order learning system FOIL [7]which adopts a greedy separate-and-conquer
strategy [3]. This algorithm generates a filtering rule one by one, and adds the generated
rule to R. When a rule is generated, the pages covered with the rule are removed from
the set of positive training pages E+
. Thus, as the number of generated filtering rules
increases, E+
decreases, and the algorithm finishes if the E+
becomes empty (step3-5).
In the generation of a single filtering rule, a literal is added into the body one by
one (step!9), and the rule is established if it includes no negative training page (step2).
The added literal is selected from a condition candidate set C. This C consists of the
literals having all of the region-types and keywords in K as its arguments and being
satisfied in training pages. Concretely the following two types of literals are used.
• The ap literals having all of the region Jypes and keywords in K as its arguments
and being satisfied in training pages.
36 M. Okabe and S. Yamada / Interactive Web Page Retrieval
• The near literals having all of the region Jypcs and keywords in K as its argu
ments and being satisfied in training pages.
The criteria for selecting a literal which should be added to the body is based on
the information <?am(step8). It is computed by the following equations, and popular in
learning of filtering tree.
numbers of positive/negative trainin
fore/after the addition of a literal. Using the information gain, a system is able to
select a literal which obtains not only much information for a training page but also
many positive training pages satisfying it (step17).
This rule construction using information gain is efficient because it is greedy. How-
ever it sometimes selects bad literal and stops before completion. In such a case, if a
current rule has some literals in its body, this algorithm eliminates all the literals in its
body and restarts a rule making process. This backtracking is done for literals in C
except for a literal l which was first added to the body (step!4.15).
If the body of a current rule has no literal, a new keyword is added to A' and C
is updated (stepll.12). The added keyword is selected from terms in positive training
pages E+
by the following procedures.
1. Extract paragraphs from E+
using <P>tags.
2. Investigate a subset of the paragraphs including any word in a query, and the
subset is called T.
3. Compute the importance for every word wi in T by the following equation.
Importance of wi, = (average occurrence i n T ) x ( t h e number of texts in which w, occurs
4. Select the literal which has the maximum importance and is not included in a
query.
Backtracking and iterative literal making process are main difference from the algo-
rithm in FOIL. They are very specific and empirical procedure. Without these exten-
sions. however, many useless rules would be generated.
4 Experiments and Results
To evaluate the effectivenessof filtering rules, we conducted retrieval experiments. The
question here is how many relevant pages we can find more with our proposed system
in the condition we look over a certain number of Web pages.
M. Okabe and S. Yamada / Interactive Web Page Retrieval
Figure 4: An example of topic Figure 5: System Interface
4.1 Settings
We conducted two series of retrieval. The one is a retrieval from an original hit-list
returned by a search engine (retrieval 1). In this retrieval, we judged 50 pages from
the top of the hit-list. The other is a retrieval using our system (retrieval2). In this
retrieval, we made feedbacks every after judging 10 pages according to the procedure
described in Section 2. We made total four feedbacks. 10 pages after each feedback
are collected from the top of the hit-list (excluding the pages we've already judged and
filtering rules don't satisfy). In both retrieval, total 50 pages from the same hit-list
were evaluated.
We used the Google l
as a test WWW search engine, which is recognized as oneof
the most powerful search engines. For test questions, we used 20 topics(No. 401~-420)
provided by the small web track in TREC-82
. This test collection is often used for
evaluating the performance of retrieval systems in Information Retrieval community.
Figure 4 is an example of topic which is composed of four parts. Title part consists of
1~3 words. We used these title words as a query for search engine. Relevancejudgment
of each page is conducted by the same searcher according to the account written in the
description and the narrative part of each topic.
4.2 Interface
Figure 5 shows the system interface which consists of query input, rule view, title view
and several buttons. When users put the make rule button, filtering rules are con-
structed and displayed in rule view. We can see the rules directly, thus we find useful
patterns or keywords to retrieve relevant pages. Once rules are constructed, the system
starts to collect new relevant pages, and display their titles in title view. If the user
clicks a title, a browser rises and shows the clicked page.
4.3 Results
Figure 6 shows the relation between judged pages and relevant pages found in the
judged pages. The number of relevant pages is average value of 20 topics. About first
10 pages, there is no difference because both retrieval returns the same pages. The
1
http://www.google.com
2
http://trec.nist.gov
38 M. Okabe and S. Yamada / Interactive Web Page Retrieval
The number of judged pages
Figure 6: The average number of relevant pages
nil
Figure 7: Difference after the first feedback Figure 8: Difference after the second feedback
(total 20 pages judged) (total 30 pages judged)
Topicnumber Topk number
Figure 9: Difference after the third feedback Figure 10: Difference after the fourth feedback
(total 40 pages judged) (total 50 pages judged)
difference of the number of relevant pages increases after the first feedback. As a result,
retrieval2 got about 5 relevant pages more than retrieval1after four feedbacks. However
the difference varies in each topic.
Figure 7 ~ 10shows the difference of relevant pages between retrievall and retrieval2
after each feedback. Let A be the number of relevant pages found in retrieval1 and B
be the one in retrieva!2, the difference D is calculated by D = B —A. In Figure 7. there
is little effect of our system because we only judge small number of pages. In Figure 8
and 9, the effect gradually increases. In Figure 10, we can see the effect clearly. Our
system produces good results for most of topics except a few topics such as no.4 and
no.ll.
4-4 Effective and Ineffective filtering rules
As seen in the results, the retrieval which uses our system enhanced the effectiveness
for most topics. We show two types of examples, a good one that our system effectively
worked, and a bad one that our system didn't work well.
M. Okabe and S. Yatnada / Interactive Web Page Retrieval
Table 1: Filtering rules generated for topic no. 12
relevant :- ap(anchor,screening).
relevant :- near(para,security,system), ap(title,airport),
relevant :- near(para,security,airports), near(para,security,access).
relevant :- near(para,security,airports), near(para,faa,system).
Table 2: Filtering rules generated for topic no.11
relevant :- ap(anchor,shipwreck).
relevant :- ap(anchor,shipwreck), ap(anchor,salvaging).
Topic 12 is an example that filtering rules worked most effectively. The objective
of topic 12 is "to identify a specific airport and describe the security measures already
in effect or proposed for use at that airport". Search engine returns many non-relevant
pages which introduce "the security which travelers must prepare". Removing such
pages by filtering rules, our system could provide proper results. Table 1 shows the
filtering rules generated for this topic. These rules represent the pages which introduce
specific security systems by using the words "faa" and "screening".
Topic 11 is an example that filtering rules didn't work well. The objective of this
topic is "To find informationon shipwrecksalvaging: the recovery or attempted recovery
of treasure from sunken ships". Relevant pages for this topic include various types of
pages such as links, bulletin board, news and individual home pages. The filtering
rules generated for this topic are too general or too specific, thus they could not select
appropriate pages and it leads to the bad results. Table 2 shows the filtering rules
generated for this topic. These rules uses only two keywords and they are insufficient
to restrict relevant pages.
5 Conclusion
We described a system which enhances the effectiveness of WWW Search Engine by
using relevance feedback and relational learning. The main function of our system is
the application of filtering rules which is constructed by relational learning technique.
We presented its representation and learning algorithm. Then we evaluated their effec-
tiveness through retrieval experiments. The results showed that our system enables us
to find more relevant pages though the effect differs in every questions.
Our system need quick response and moderate machine power. Thus it should be
a user side application because search engines cannot afford to attach such a function.
One of the future problem is to reduce the cost which users need to judge pages. We
plan to apply clustering methods for this problem.
References
[1] Baeza-Yates, R. and Ribeiro-Neto, B.: Modern Information Retrieval: Addison-Wesley,
Wokingham, UK, (1999)
[2] Cohen. W.W.: Text categorization and relational learning, In Proceedings of the Twelfth
International Conference on Machine, Learning, pp.124–132 (1995)
Another Random Scribd Document
with Unrelated Content
major-general in 1813, and employed in the North; but his
operations were unsuccessful, owing to a disagreement with Wade
Hampton. A court of inquiry in 1815 exonerated him, however; but
upon the reorganizing of the army, he was not retained in the
service, and retired to Mexico, where he had acquired large estates.
He died in the vicinity of the capital on the 28th of December, 1825.
CHEVALIER DE LA NEUVILLE.
Chevalier de la Neuville, born about 1740, came to this country
with his younger brother in the autumn of 1777, and tendered his
services to Congress. Having served with distinction in the French
army for twenty years, enjoying the favorable opinion of Lafayette,
and bringing with him the highest testimonials, he was appointed on
the 14th of May, 1778, inspector of the army under Gates, with the
promise of rank according to his merit at the end of three months.
He was a good officer and strict disciplinarian, but was not popular
with the army. Failing to obtain the promotion he expected, he
applied for permission to retire at the end of six months’ service. His
request was granted on the 4th of December, 1778, Congress
instructing the president that a certificate be given to Monsieur de la
Neuville in the following words:—
“Mr. de la Neuville having served with fidelity and
reputation in the army of the United States, in testimony of
his merit a brevet commission of brigadier has been granted
to him by Congress, and on his request he is permitted to
leave the service of these States and return to France.”
The brevet commission was to bear date the 14th of October,
1778. Having formed a strong attachment for General Gates, they
corresponded after De la Neuville’s return to France. In one of his
letters the chevalier writes that he wishes to return to America, “not
as a general, but as a philosopher,” and to purchase a residence near
that of his best friend, General Gates. He did not return, however,
and his subsequent history is lost amid the troubles of the French
Revolution.
Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647
JETHRO SUMNER.
Jethro Sumner, born in Virginia about 1730, was of English
parentage. Removing to North Carolina while still a youth, he took
an active part in the measures which preceded the Revolution, and
believed the struggle to be unavoidable. Having held the office of
paymaster to the Provincial troops, and also the command at Fort
Cumberland, he was appointed in 1776, by the Provincial Congress,
colonel in the Third North Carolina Regiment, and served under
Washington at the North. On the 9th of January, 1779, he was
commissioned brigadier-general, and ordered to join Gates at the
South. He took part in the battle of Camden, and served under
Greene at the battle of Eutaw Springs on the 8th of September,
1781, where he led a bayonet-charge. He served to the close of the
war, rendering much assistance in keeping the Tories in North
Carolina in check during the last years of the struggle, and died in
Warren County, North Carolina, about 1790.
JAMES HOGAN.
James Hogan of Halifax, North Carolina, was chosen to represent
his district in the Provincial Congress that assembled on the 4th of
April, 1776. Upon the organization of the North Carolina forces, he
was appointed paymaster of the Third Regiment. On the 17th of the
same month, he was transferred to the Edenton and Halifax Militia,
with the rank of major. His military services were confined to his own
State, though commissioned brigadier-general in the Continental
army on the 9th of January, 1779.
ISAAC HUGER.
Isaac Huger, born at Limerick Plantation at the head-waters of
Cooper River, South Carolina, on the 19th of March, 1742, was the
grandson of Huguenot exiles who had fled to America after the
revocation of the Edict of Nantes. Inheriting an ardent love of civil
and religious liberty, reared in a home of wealth and refinement,
thoroughly educated in Europe and trained to military service
through participation in an expedition against the Cherokee Indians,
he was selected on the 17th of June, 1775, by the Provincial
Congress, as lieutenant-colonel of the First South Carolina Regiment.
Being stationed at Fort Johnson, he had no opportunity to share in
the defeat of the British in Charleston Harbor, as Colonel Moultrie’s
victory at Sullivan’s Island prevented premeditated attack on the city.
During the two years of peace for the South that followed, Huger
was promoted to a colonelcy, and then ordered to Georgia. His
soldiers, however, were so enfeebled by sickness, privation, and toil
that when called into action at Savannah, they could only show what
they might have accomplished under more favorable circumstances.
On the 9th of January, 1779, Congress made him a brigadier-
general; and until the capture of Charleston by the British in May,
1780, he was in constant service either in South Carolina or Georgia.
Too weak to offer any open resistance, the patriots of the South
were compelled for a time to remain in hiding, but with the
appearance of Greene as commander, active operations were
resumed.
Huger’s thorough knowledge of the different localities and his
frank fearlessness gained him the confidence of his superior officer,
and it was to his direction that Greene confided the army on several
occasions, while preparing for the series of engagements that
culminated in the evacuation of Charleston and Savannah. Huger
commanded the Virginia troops at the battle of Guilford Court-
House, where he was severely wounded; and at Hobkirk’s Hill he had
the honor of commanding the right wing of the army. He served to
the close of the war; and when Moultrie was chosen president, he
was made vice-president, of the Society of the Cincinnati of South
Carolina. Entering the war a rich man, he left it a poor one; he gave
his wealth as freely as he had risked his life, and held them both
well spent in helping to secure the blessings of liberty and
independence to his beloved country. He died on the 17th of
October, 1797, and was buried on the banks of the Ashley River,
South Carolina.
MORDECAI GIST.
Mordecai Gist, born in Baltimore, Maryland, in 1743, was
descended from some of the earliest English settlers in that State.
Though trained for a commercial life, he hastened at the beginning
of the Revolution to offer his services to his country, and in January,
1775, was elected to the command of a company of volunteers
raised in his native city, called the “Baltimore Independent
Company,”—the first company raised in Maryland for liberty. In 1776,
he rose to the rank of major, distinguishing himself whenever an
occasion offered. In 1777, he was made colonel, and on the 9th of
January, 1779, Congress recognized his worth by conferring on him
the rank of brigadier-general.
It is with the battle of Camden, South Carolina, that Gist’s name
is indissolubly linked. The British having secured the best position,
Gates divided his forces into three parts, assigning the right wing to
Gist. By a blunder in an order issued by Gates himself, the centre
and the left wing were thrown into confusion and routed. Gist and
De Kalb stood firm, and by their determined resistance made the
victory a dear one for the British. When the brave German fell, Gist
rallied about a hundred men and led them off in good order. In
1782, joining the light troops of the South, he commanded at
Combahee—the last engagement in the war—and gained a victory.
At the close of the war he retired to his plantation near Charleston,
where he died in 1792. He was married three times, and had two
sons, one of whom he named “Independent” and the other “States.”
WILLIAM IRVINE.
William Irvine, born near Enniskillen, Ireland, on the 3d of
November, 1741, was educated at Trinity College, Dublin. Though
preferring a military career, he adopted the medical profession to
gratify the wishes of his parents. During the latter part of the Seven
Years War between England and France, he served as surgeon on
board a British man-of-war, and shortly before the restoration of
peace, he resigned his commission, and coming to America in 1764,
settled at Carlisle, Pennsylvania, where he soon acquired a great
reputation and a large practice. Warm-hearted and impulsive, at the
opening of the Revolution he adopted the cause of the colonists as
his own, and after serving in the Pennsylvania Convention, he was
commissioned in 1776 to raise a regiment in that State. At the head
of these troops, he took part in the Canadian expedition of that year,
and being taken prisoner, was detained for many months. He was
captured a second time at the battle of Chestnut Hill, New Jersey, in
December, 1777. On the 12th of May, 1779, Congress conferred on
him the rank of brigadier-general. From 1782 until the close of the
war, he commanded at Fort Pitt,—an important post defending the
Western frontier, then threatened by British and Indians. In 1785, he
was appointed an agent to examine the public lands, and to him was
intrusted the administration of an act for distributing the donation
lands that had been promised to the troops of the Commonwealth.
Appreciating the advantage to Pennsylvania of having an outlet on
Lake Erie, he suggested the purchase of that tract of land known as
“the triangle.” From 1785 to 1795, he filled various civil and military
offices of responsibility. Being sent to treat with those connected
with the Whiskey Insurgents, and failing to quiet them by
arguments, he was given command of the Pennsylvania Militia to
carry out the vigorous measures afterward adopted to reduce them
to order. In 1795, he settled in Philadelphia, held the position of
intendant of military stores, and was president of the Pennsylvania
Society of the Cincinnati until his death on the 9th of July, 1804.
DANIEL MORGAN.
Daniel Morgan, born in New Jersey about 1736, was of Welsh
parentage. His family having an interest in some Virginia lands, he
went to that colony at seventeen years of age. When Braddock
began his march against Fort Duquesne, Morgan joined the army as
a teamster, and did good service at the rout of the English army at
Monongahela, by bringing away the wounded. Upon returning from
this disastrous campaign, he was appointed ensign in the colonial
service, and soon after was sent with important despatches to a
distant fort. Surprised by the Indians, his two companions were
instantly killed, while he received a rifle-ball in the back of his neck,
which shattered his jaw and passed through his left cheek, inflicting
the only severe wound he received during his entire military career.
Believing himself about to die, but determined that his scalp should
not fall into the hands of his assailants, he clasped his arms around
his horse’s neck and spurred him forward. An Indian followed in hot
pursuit; but finding Morgan’s steed too swift for him, he threw his
tomahawk, hoping to strike his victim. Morgan however escaped and
reached the fort, but was lifted fainting from the saddle and was not
restored to health for six months. In 1762, he obtained a grant of
land near Winchester, Virginia, where he devoted himself to farming
and stock-raising. Summoned again to military duty, he served
during the Pontiac War, but from 1765 to 1775 led the life of a
farmer, and acquired during this period much property.
The first call to arms in the Revolutionary struggle found Morgan
ready to respond; recruits flocked to his standard; and at the head
of a corps of riflemen destined to render brilliant service, he
marched away to Washington’s camp at Cambridge. Montgomery
was already in Canada, and when Arnold was sent to co-operate
with him, Morgan eagerly sought for service in an enterprise so
hazardous and yet so congenial. At the storming of Quebec, Morgan
and his men carried the first barrier, and could they have been
reinforced, would no doubt have captured the city. Being opposed by
overwhelming numbers, and their rifles being rendered almost
useless by the fast-falling snow, after an obstinate resistance they
were forced to surrender themselves prisoners-of-war. Morgan was
offered the rank of colonel in the British army, but rejected the offer
with scorn. Upon being exchanged, Congress gave him the same
rank in the Continental army, and placed a rifle brigade of five
hundred men under his command.
For three years Morgan and his men rendered such valuable
service that even English writers have borne testimony to their
efficiency. In 1780, a severe attack of rheumatism compelled him to
return home. On the 31st of October of the same year, Congress
raised him to the rank of brigadier-general; and his health being
somewhat restored, he joined General Greene, who had assumed
command of the Southern army. Much of the success of the
American arms at the South, during this campaign, must be
attributed to General Morgan, but his old malady returning, in
March, 1781, he was forced to resign. When Cornwallis invaded
Virginia, Morgan once more joined the army, and Lafayette assigned
to him the command of the cavalry. Upon the surrender of Yorktown,
he retired once more to his home, spending his time in agricultural
pursuits and the improvement of his mind. In 1794, the duty of
quelling the “Whiskey Insurrection” in Pennsylvania was intrusted to
him, and subsequently he represented his district in Congress for
two sessions. He died in Winchester on the 6th of July, 1802, and
has been called, “The hero of Quebec, of Saratoga, and of the
Cowpens; the bravest among the brave, and the Ney of the West.”
MOSES HAZEN.
Moses Hazen, born in Haverhill, Massachusetts, in 1733, served
in the French and Indian War, and subsequently settled near St.
Johns, New Brunswick, accumulating much wealth, and retaining his
connection with the British army as a lieutenant on half-pay. In
1775, having furnished supplies and rendered other assistance to
Montgomery during the Canadian campaign, the English troops
destroyed his shops and houses and carried off his personal
property. In 1776, he offered his services to Congress, who promised
to indemnify him for all loss he had sustained, and appointed him
colonel in the Second Canadian Regiment, known by the name of
“Congress’s Own,” because “not attached to the quota of any State.”
He remained in active and efficient service during the entire war,
being promoted to the rank of brigadier-general the 29th of June,
1781. At the close of the war, with his two brothers, who had also
been in the army, he settled in Vermont upon land granted to them
for their services, and died at Troy, New York, on the 30th of
January, 1802, his widow receiving a further grant of land and a
pension for life of two hundred dollars.
OTHO HOLLAND WILLIAMS.
Otho Holland Williams, born in Prince George’s County,
Maryland, in 1749, entered the Revolutionary army in 1775, as a
lieutenant. He steadily rose in rank, holding the position of adjutant-
general under Greene. Though acting with skill and gallantry on all
occasions, his fame chiefly rests on his brilliant achievement at the
battle of Eutaw Springs, where his command gained the day for the
Americans by their irresistible charge with fixed bayonets across a
field swept by the fire of the enemy. On the 9th of May, 1782, he
was made a brigadier-general, but retired from the army on the 6th
of June, 1783, to accept the appointment of collector of customs for
the State of Maryland, which office he held until his death on the
16th of July, 1800.
JOHN GREATON.
John Greaton, born in Roxbury, Massachusetts, on the 10th of
March, 1741, was an innkeeper prior to the Revolution, and an
officer of the militia of his native town. On the 12th of July, 1775, he
was appointed colonel in the regular army. During the siege of
Boston, he led an expedition which destroyed the buildings on Long
Island in Boston Harbor. In April, 1776, he was ordered to Canada,
and in the following December he joined Washington in New Jersey,
but was subsequently transferred to Heath’s division at West Point.
He served to the end of the war, and was commissioned brigadier-
general on the 7th of January, 1783. Conscientiously performing all
the duties assigned him, though unable to boast of any brilliant
achievements, he won a reputation for sterling worth and reliability.
He died in his native town on the 16th of December, 1783, the first
of the Revolutionary generals to pass away after the conclusion of
peace.
RUFUS PUTNAM.
Rufus Putnam, born in Sutton, Massachusetts, on the 9th of
April, 1738, after serving his apprenticeship as a millwright, enlisted
as a common soldier in the Provincial army in 1757. At the close of
the French and Indian War, he returned to Massachusetts, married,
and settled in the town of New Braintree as a miller. Finding a
knowledge of mathematics necessary to his success, he devoted
much time to mastering that science. In 1773, having gone to
Florida, he was appointed deputy-surveyor of the province by the
governor. A rupture with Great Britain becoming imminent, he
returned to Massachusetts in 1775, and was appointed lieutenant in
one of the first regiments raised in that State after the battle of
Lexington. His first service was the throwing up of defences in front
of Roxbury. In 1776, he was ordered to New York and superintended
the defences in that section of the country and the construction of
the fortifications at West Point. In August, Congress appointed him
engineer with the rank of colonel. He continued in active service,
sometimes as engineer, sometimes as commander, and at others as
commissioner for the adjustment of claims growing out of the war,
until the disbanding of the army, being advanced to the rank of
brigadier-general on the 7th of January, 1783.
After the close of the war, Putnam held various civil offices in his
native State, acted as aid to General Lincoln during Shays’ Rebellion
in 1786, was superintendent of the Ohio Company, founded the
town of Marietta in 1788, was appointed in 1792 brigadier-general of
the forces sent against the Indians of the Northwest, concluded an
important treaty with them the same year, and resigned his
commission on account of illness in 1793. During the succeeding ten
years, he was Surveyor-General of the United States, when his
increasing age compelled him to withdraw from active employment,
and he retired to Marietta, where he died on the 1st of May, 1824.
ELIAS DAYTON.
Elias Dayton, born in Elizabethtown, New Jersey, in July, 1737,
began his military career by joining Braddock’s forces, and fought in
the “Jersey Blues” under Wolfe at Quebec. Subsequently he
commanded a company of militia in an expedition against the
Indians, and at the beginning of the Revolution was a member of
the Committee of Safety. In July, 1775, he was with the party under
Lord Stirling that captured a British transport off Staten Island. In
1776, he was ordered to Canada; but upon reaching Albany he was
directed to remain in that part of the country to prevent any hostile
demonstration by the Tory element. In 1777, he ranked as colonel of
the Third New Jersey Regiment, and in 1781, he materially aided in
suppressing the revolt in the New Jersey line. Serving to the end of
the war, he was promoted to be a brigadier-general the 7th of
January, 1783. Returning to New Jersey upon the disbanding of the
army, he was elected president of the Society of the Cincinnati of
that State, and died in his native town on the 17th of July, 1807.
COUNT ARMAND.
Armand Tuffin, Marquis de la Rouarie, born in the castle of
Rouarie near Rennes, France, on the 14th of April, 1756, was
admitted in 1775 to be a member of the body-guard of the French
king. A duel led to his dismissal shortly after. Angry and mortified, he
attempted suicide, but his life was saved; and in May, 1777, he came
to the United States, where he entered the Continental army under
the name of Count Armand. Being granted leave to raise a partisan
corps of Frenchmen, he served with credit and great ability under
Lafayette, Gates, and Pulaski. At the reorganization of the army in
1780, Washington proposed Armand for promotion, and
recommended the keeping intact of his corps. In 1781, he was
summoned to France by his family, but returned in time to take part
in the siege of Yorktown, bringing with him clothing, arms, and
ammunition for his corps, which had been withdrawn from active
service during his absence.
After the surrender of Cornwallis, Washington again called the
attention of Congress to Armand’s meritorious conduct, and he at
last received his promotion as brigadier-general on the 26th of
March, 1783. At the close of the war he was admitted as a member
of the Society of the Cincinnati, and with warmest recommendations
from Washington returned to his native country and lived privately
until 1788, when he was elected one of twelve deputies to intercede
with the king for the continuance of the privileges of his native
province of Brittany. For this he was confined for several weeks in
the Bastile. Upon his release he returned to Brittany, and in 1789,
denounced the principle of revolution and proposed a plan for the
union of the provinces of Brittany, Anjou, and Poitou, and the raising
of an army to co-operate with the allies. These plans being approved
by the brothers of Louis XVI., in December, 1791, Rouarie was
appointed Royal Commissioner of Brittany. In March of the year
following, the chiefs of the confederation met at his castle; and all
was ready for action when they were betrayed to the legislative
assembly, and troops were sent to arrest the marquis. He succeeded
in eluding them for several months, when he was attacked by a fatal
illness and died in the castle of La Guyomarais near Lamballe, on the
30th of January, 1793.
THADDEUS KOSCIUSKO.
Thaddeus Kosciusko, born near Novogrodek, Lithuania, on the
12th of February, 1746, was descended from a noble Polish family.
Studying at first in the military academy at Warsaw, he afterward
completed his education in France. Returning to his native country,
he entered the army and rose to the rank of captain. Soon after
coming to America, he offered his services to Washington as a
volunteer in the cause of American independence. Appreciating his
lofty character and fine military attainments, Washington made him
one of his aids, showing the high estimation in which he held the
gallant Pole.
Taking part in several great battles in the North, Kosciusko there
proved his skill and courage, and was ordered to accompany Greene
to the South when that general superseded Gates in 1781. Holding
the position of chief engineer, he planned and directed all the
besieging operations against Ninety-Six. In recognition of these
valuable services, he received from Congress the rank of brigadier-
general in the Continental army on the 13th of October, 1783.
Serving to the end of the war, he shared with Lafayette the honor of
being admitted into the Society of the Cincinnati. Returning to
Poland in 1786 he entered the Polish army upon its reorganization in
1789, and fought valiantly in behalf of his oppressed country.
Resigning his commission, he once more became an exile, when the
Russians triumphed, and the second partition of Poland was agreed
upon.
Two years later, however, when the Poles determined to resume
their struggle for freedom, Kosciusko returned, and in March, 1794,
was proclaimed director and generalissimo. With courage, patience
and skill, that justified the high esteem in which he had been held in
America, he directed his followers while they waged the unequal
strife. Successful at first, he broke the yoke of tyranny from the
necks of his down-trodden countrymen, and for a few short weeks
beheld his beloved country free. But with vastly augmented numbers
the enemy once more invaded Poland; and in a desperate conflict
Kosciusko, covered with wounds, was taken prisoner, and the
subjugation of the whole province soon followed. He remained a
prisoner for two years until the accession of Paul I. of Russia. In
token of his admiration, Paul wished to present his own sword to
Kosciusko; but the latter refused it, saying, “I have no more need of
a sword, as I have no longer a country,” and would accept nothing
but his release from captivity. He visited France and England, and in
1797 returned to the United States, from which country he received
a pension, and was everywhere warmly welcomed. The following
year he returned to France, when his countrymen in the French
army presented him with the sword of John Sobieski. Purchasing a
small estate, he devoted himself to agriculture.
In 1806, when Napoleon planned the restoration of Poland,
Kosciusko refused to join in the undertaking, because he was on his
parole never to fight against Russia. He gave one more evidence
before his death of his love of freedom and sincere devotion to her
cause, by releasing from slavery all the serfs on his own estate in his
native land. In 1816, he removed to Switzerland, where he died on
the 15th of October, 1817, at Solothurn. The following year his
remains were removed to Cracow, and buried beside Sobieski, and
the people, in loving remembrance of his patriotic devotion, raised a
mound above his grave one hundred and fifty feet high, the earth
being brought from every great battle-field in Poland. This country
paid its tribute of gratitude by erecting a monument to his memory
at West Point on the Hudson.
STEPHEN MOYLAN.
Stephen Moylan, born in Ireland in 1734, received a good
education in his native land, resided for a time in England, and then
coming to America, travelled extensively, and finally became a
merchant in Philadelphia. He was among the first to hasten to the
camp at Cambridge in 1775, and was at once placed in the
Commissariat Department. His face and manners attracting
Washington, he was selected March 5, 1776, to be aide-de-camp,
and on the 5th of June following, on recommendation of the
commander-in-chief, he was made quartermaster-general. Finding
himself unable to discharge his duties satisfactorily, he soon after
resigned to enter the ranks as a volunteer. In 1777 he commanded a
company of dragoons, was in the action at Germantown, and
wintered with the army at Valley Forge in 1777 and 1778. With
Wayne, Moylan joined the expedition to Bull’s Ferry in 1780, and was
with Greene in the South in 1781. He served to the close of the war,
being made brigadier-general by brevet the 3d of November, 1783.
After the disbanding of the army, he resumed business in
Philadelphia, where he died on the 11th of April, 1811, holding for
several years prior to his decease the office of United States
commissioner of loans.
SAMUEL ELBERT.
Samuel Elbert, born in Prince William parish, South Carolina, in
1743, was left an orphan at an early age, and going to Savannah,
engaged in commercial pursuits. In June, 1774, he was elected
captain of a company of grenadiers, and later was a member of the
local Committee of Safety. In February, 1776, he entered the
Continental army as lieutenant-colonel of Lachlan McIntosh’s
brigade, and was promoted to colonel during the ensuing
September. In May of the year following, he was intrusted with the
command of an expedition against the British in East Florida, and
captured Fort Oglethorpe in that State in April of 1778. Ordered to
Georgia, he behaved with great gallantry when an attack was made
on Savannah by Col. Archibald Campbell in December of the same
year. In 1779, after distinguishing himself at Brier Creek, he was
taken prisoner, and when exchanged joined the army under
Washington, and was present at the surrender of Lord Cornwallis.
On the 3d of November, 1783, Congress brevetted him brigadier-
general, and in 1785 he was elected Governor of Georgia. In further
acknowledgment of his services in her behalf, that State
subsequently appointed him major-general of her militia, and named
a county in his honor. He died in Savannah on the 2d of November,
1788.
CHARLES COTESWORTH PINCKNEY.
Charles Cotesworth Pinckney, born at Charleston, South
Carolina, on the 25th of February, 1746, was educated in England.
Having qualified himself for the legal profession, he returned to his
native State and began the practice of law in 1770, soon gaining an
enviable reputation and being appointed to offices of trust and great
responsibility under the crown. The battle of Lexington, however,
changed his whole career. With the first call to arms, Pinckney took
the field, was given the rank of captain, June, 1775, and entered at
once upon the recruiting service. Energetic and efficient, he gained
promotion rapidly, taking part as colonel in the battle at Fort
Sullivan. This victory securing peace to South Carolina for two years,
he left that State to join the army under Washington, who,
recognizing his ability, made him aide-de-camp and subsequently
honored him with the most distinguished military and civil
appointments. When his native State again became the theatre of
action, Pinckney hastened to her defence, and once more took
command of his regiment. In all the events that followed, he bore
his full share, displaying fine military qualities and unwavering faith
in the ultimate triumph of American arms.
At length, after a most gallant resistance, overpowered by vastly
superior numbers, and undermined by famine and disease,
Charleston capitulated in May, 1780, and Pinckney became a
prisoner-of-war and was not exchanged until 1782. On the 3d of
November of the year following, he was promoted to be brigadier-
general. Impoverished by the war, he returned to the practice of law
upon the restoration of peace; and after declining a place on the
Supreme Bench, and the secretaryship, first of War and then of
State, he accepted the mission to France in 1796, urged to this step
by the request of Washington and the conviction that it was his duty.
Arriving in Paris, he met the intimation that peace might be secured
with money by the since famous reply, “Not one cent for tribute, but
millions for defence!” The war with France appearing inevitable, he
was recalled and given a commission as major-general; peace being
restored without an appeal to arms, he once more retired to the
quiet of his home, spending the chief portion of his old age in the
pursuits of science and the pleasures of rural life, though taking part
when occasion demanded in public affairs. He died in Charleston on
the 16th of August, 1825, in the eightieth year of his age.
WILLIAM RUSSELL.
William Russell, born in Culpeper County, Virginia, in 1758,
removed in early boyhood with his father to the western frontier of
that State. When only fifteen years of age, he joined the party led by
Daniel Boone, to form a settlement on the Cumberland River. Driven
back by the Indians, Boone persevered; but Russell hastened to
enter the Continental army; and he received, young as he was, the
appointment of lieutenant. After the battle of King’s Mountain in
1780, he was promoted to a captaincy, and ordered to join an
expedition against the Cherokee Indians, with whom he succeeded
in negotiating a treaty of peace. On the 3d of November, 1783, he
received his commission as brigadier-general.
At the close of the war Russell went to Kentucky and bore an
active part in all the expeditions against the Indians, until the
settlement of the country was accomplished. In 1789, he was a
delegate to the Virginia Legislature that passed an act separating
Kentucky from that State. After the organization of the Kentucky
government Russell was annually returned to the Legislature until
1808, when he was appointed by President Madison colonel of the
Seventh United States Infantry. In 1811, he succeeded Gen. William
Henry Harrison in command of the frontier of Indiana, Illinois, and
Missouri. In 1812, he planned and commanded an expedition against
the Peoria Indians, and in 1823 was again sent to the Legislature.
The following year he declined the nomination for governor, and died
on the 3d of July, 1825, in Fayette County, Kentucky. Russell County
of that State is named in his honor.
Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647
FRANCIS MARION.
Francis Marion, born at Winyah, near Georgetown, South
Carolina, in 1732, was of Huguenot descent; his ancestors, fleeing
from persecution in France, came to this country in 1690. Small in
stature and slight in person, he possessed a power of endurance
united with remarkable activity rarely surpassed. At the age of
fifteen, yielding to a natural love of enterprise, he went to sea in a
small schooner employed in the West India trade. Being
shipwrecked, he endured such tortures from famine and thirst as to
have prevented his ever wishing to go to sea again. After thirteen
years spent in peaceful tilling of the soil, he took up arms in defence
of his State against the Cherokee Indians. So signal a victory was
gained by the whites at the town of Etchoee, June 7, 1761, that this
tribe never again seriously molested the settlers. Returning to his
home after this campaign, Marion resumed his quiet life until in 1775
he was elected a member of the Provincial Congress of South
Carolina. This Congress solemnly pledged the “people of the State to
the principles of the Revolution, authorized the seizing of arms and
ammunition, stored in various magazines belonging to the crown,
and passed a law for raising two regiments of infantry and a
company of horse.” Marion resigned his seat in Congress, and
applying for military duty, was appointed captain. He undertook the
recruiting and drilling of troops, assisted at the capture of Fort
Johnson, was promoted to the rank of major, and bore his full share
in the memorable defence of Fort Moultrie on Sullivan’s Island, which
saved Charleston and secured to South Carolina long exemption
from the horrors of war. Little was done at the South for the next
three years, when in 1779 the combined French and American forces
attempted the capture of Savannah. Marion was in the hottest of the
fight; but the attack was a failure, followed in 1780 by the loss of
Charleston. Marion escaped being taken prisoner by an accident that
placed him on sick leave just before the city was invested by the
British. The South was now overrun by the enemy; cruel outrages
were everywhere perpetrated; and the defeat of the Americans at
Camden seemed to have quenched the hopes of even the most
sanguine. Four days after the defeat of Gates, Marion began
organizing and drilling a band of troopers subsequently known as
“Marion’s Brigade.” Though too few in number to risk an open battle,
they succeeded in so harassing the enemy that several expeditions
were fitted out expressly to kill or capture Marion, who, because of
the partisan warfare he waged and the tactics he employed, gained
the sobriquet of the “Swamp Fox.” Again and again he surprised
strong parties of the British at night, capturing large stores of
ammunition and arms, and liberating many American prisoners. He
was always signally active against the Tories, for he well knew their
influence in depressing the spirit of liberty in the country. When
Gates took command of the Southern army, he neither appreciated
nor knew how to make the best use of Marion and his men. South
Carolina, recognizing how much she owed to his unwearying efforts
in her behalf, acknowledged her debt of gratitude by making him
brigadier-general of her Provincial troops, after the defeat of Gates
at Camden. Early in the year 1781, General Greene assumed
command of the Southern army, and entertaining a high opinion of
Marion, sent Lieutenant-Colonel Harry Lee, with his famous legion of
light-horse, to aid him. Acting in concert and sometimes
independently, these two noted leaders carried on the war vigorously
wherever they went, capturing Forts Watson and Motte, defeating
Major Frazier at Parker’s Ferry and joining Greene in time for the
battle of Eutaw Springs. When the surrender of Cornwallis practically
ended the war, Marion returned to his plantation in St. John’s parish
and soon after was elected to the Senate of South Carolina. On the
26th of February, 1783, the following resolutions were unanimously
adopted by that body:—
“Resolved, That the thanks of this House be given
Brigadier-General Marion in his place as a member of this
House, for his eminent and conspicuous services to his
country.
“Resolved, That a gold medal be given to Brigadier-
General Marion as a mark of public approbation for his great,
glorious, and meritorious conduct.”
In 1784, he was given command of Fort Johnson in Charleston
Harbor, and shortly after, he married Mary Videau, a lady of
Huguenot descent, who possessed considerable wealth and was a
most estimable character. On the 27th of February, 1795, Francis
Marion passed peacefully away, saying, “Thank God, I can lay my
hand on my heart and say that since I came to man’s estate I have
never intentionally done wrong to any.”
THOMAS SUMTER.
Thomas Sumter, born in Virginia in 1734, served in the French
and Indian War, and afterward on the Western frontier. Establishing
himself finally in South Carolina, he was appointed in March, 1776,
lieutenant-colonel of the Second Regiment of South Carolina
Riflemen, and sent to overawe the Tories and Loyalists in the interior
of the State. The comparative immunity from war secured to South
Carolina during the first years of the Revolution deprived Sumter of
any opportunity for distinguishing himself until after the surrender of
Charleston to the British in 1780. Taking refuge for a time in the
swamps of the Santee, he made his way after a while to North
Carolina, collected a small body of refugees, and presently returned
to carry on a partisan warfare against the British. His fearlessness
and impetuosity in battle gained for him the sobriquet of “the game-
cock;” and with a small band of undisciplined militia, armed with
ducking-guns, sabres made from old mill-saws ground to an edge,
and hunting-knives fastened to poles for lances, he effectually
checked the progress of the British regulars again and again,
weakened their numbers, cut off their communications, and
dispersed numerous bands of Tory militia.
Like Marion, whenever the enemy threatened to prove too
strong, Sumter and his followers would retreat to the swamps and
mountain fastnesses, to emerge again when least expected, and at
the right moment to take the British at a disadvantage. During one
of many severe engagements with Tarleton, he was dangerously
wounded and compelled for a time to withdraw from active service,
but learning Greene’s need of troops, Sumter again took the field.
After rendering valuable assistance toward clearing the South of the
British, the failure of his health again forced him to seek rest and
strength among the mountains, leaving his brigade to the command
of Marion. When once more fitted for duty, the British were in
Charleston, and the war was virtually at an end. Though Sumter’s
military career ended with the disbanding of the army, his country
still demanded his services. He represented South Carolina in
Congress from 1789 to 1793, and from 1797 to 1801; he served in
the United States Senate from 1801 to 1809, and was minister to
Brazil from 1809 to 1811. He died at South Mount, near Camden,
South Carolina, on the 1st of June, 1832, the last surviving general
officer of the Revolution.
ADDENDA.
Prior to the adoption of the “federal Constitution,” partisan
feeling ran high on this side of the Atlantic,—indeed, it was no
unusual thing for a man to speak of the colony in which he was born
as his country. When the struggle for American independence
began, though men were willing to fight in defence of their own
State, there was great difficulty in filling the ranks of the Continental
army,—not only because of the longer time for which they were
required to enlist, but also because once in the Continental service,
they would be ordered to any part of the country. The same difficulty
existed in respect to securing members for the Continental Congress.
With the slowness of transportation and the uncertainty of the mails,
it was no small sacrifice for a man to leave his home, his dear ones,
and his local prestige, to become one of an unpopular body directing
an unpopular war, for it was not until near the end of the struggle
that the Revolution was espoused by the majority. It was under
these circumstances, then, that three different kinds of troops
composed the American army,—the Continentals, the Provincials,
and the Militia. The first could be ordered to any point where they
were most needed; the second, though regularly organized and
disciplined, were only liable to duty in their own State; and the last
were hastily gathered together and armed in the event of any
pressing need or sudden emergency. Washington, as stated in his
commission, was commander-in-chief of all the forces. The other
subjects of the foregoing sketches were the commanding officers of
the Continental army. Marion and Warren were famous generals of
the Provincials; while Pickens and Ten Brock were noted leaders of
the militia. Dr. Joseph Warren received his commission of major-
general from the Massachusetts Assembly just before the battle of
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookball.com

More Related Content

PDF
Artifical Intelligence In Education Shaping The Future Of Learning Through In...
PDF
Graph mining, social network analysis and multirelational data mining
PDF
Information Modelling And Knowledge Bases Xix H Jaakkola Y Kiyoki
PDF
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
PDF
How Information Systems Came To Rule The World And Other Essays 1st Edition S...
PDF
070624-ai-for-real-world-applications-prof-ciprian-neagu.pdf
PDF
Prerquisite for Data Sciecne, KDD, Attribute Type
PPTX
Information entanglement
Artifical Intelligence In Education Shaping The Future Of Learning Through In...
Graph mining, social network analysis and multirelational data mining
Information Modelling And Knowledge Bases Xix H Jaakkola Y Kiyoki
A STUDY- KNOWLEDGE DISCOVERY APPROACHESAND ITS IMPACT WITH REFERENCE TO COGNI...
How Information Systems Came To Rule The World And Other Essays 1st Edition S...
070624-ai-for-real-world-applications-prof-ciprian-neagu.pdf
Prerquisite for Data Sciecne, KDD, Attribute Type
Information entanglement

Similar to Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647 (20)

PPTX
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
PDF
Task Intelligence For Search And Recommendation Chirag Shah Ryen W White
PDF
A Behavioral Economics Approach To Interactive Information Retrieval Understa...
PDF
Data Science In Applications Gintautas Dzemyda Jolita Bernataviien
PDF
Ontologies and Semantic Technologies for Intelligence 1st Edition L. Obrst
PDF
PatternLanguageOfData
PDF
Information Modelling And Knowledge Bases Xxv T Tokuda Y Kiyoki
PDF
Information Modelling And Knowledge Bases Xxv T Tokuda Y Kiyoki
PDF
Information Modelling And Knowledge Bases Xxxi Frontiers In Artificial Intell...
PDF
Data Mining With Decision Trees Theory And Applications Lior Rokach
PPTX
Future of Scholarly Communications
PDF
Information Modelling And Knowledge Bases Xxii A Heimbrrger
PDF
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
PDF
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
PDF
Big Social Data And Urban Computing First Workshop Bidu 2018 Rio De Janeiro B...
PPTX
Big Data meets Big Social: Social Machines and the Semantic Web
PDF
Data Science definition
PDF
Let's talk about Data Science
PDF
A Review Of Data Mining Literature
PDF
Ontologies and Semantic Technologies for Intelligence 1st Edition L. Obrst
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
Task Intelligence For Search And Recommendation Chirag Shah Ryen W White
A Behavioral Economics Approach To Interactive Information Retrieval Understa...
Data Science In Applications Gintautas Dzemyda Jolita Bernataviien
Ontologies and Semantic Technologies for Intelligence 1st Edition L. Obrst
PatternLanguageOfData
Information Modelling And Knowledge Bases Xxv T Tokuda Y Kiyoki
Information Modelling And Knowledge Bases Xxv T Tokuda Y Kiyoki
Information Modelling And Knowledge Bases Xxxi Frontiers In Artificial Intell...
Data Mining With Decision Trees Theory And Applications Lior Rokach
Future of Scholarly Communications
Information Modelling And Knowledge Bases Xxii A Heimbrrger
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
Big Social Data And Urban Computing First Workshop Bidu 2018 Rio De Janeiro B...
Big Data meets Big Social: Social Machines and the Semantic Web
Data Science definition
Let's talk about Data Science
A Review Of Data Mining Literature
Ontologies and Semantic Technologies for Intelligence 1st Edition L. Obrst
Ad

Recently uploaded (20)

PDF
1_English_Language_Set_2.pdf probationary
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Empowerment Technology for Senior High School Guide
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
IGGE1 Understanding the Self1234567891011
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
Introduction to Building Materials
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
History, Philosophy and sociology of education (1).pptx
1_English_Language_Set_2.pdf probationary
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Empowerment Technology for Senior High School Guide
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Digestion and Absorption of Carbohydrates, Proteina and Fats
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
IGGE1 Understanding the Self1234567891011
202450812 BayCHI UCSC-SV 20250812 v17.pptx
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Indian roads congress 037 - 2012 Flexible pavement
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Introduction to Building Materials
What if we spent less time fighting change, and more time building what’s rig...
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
A powerpoint presentation on the Revised K-10 Science Shaping Paper
History, Philosophy and sociology of education (1).pptx
Ad

Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647

  • 1. Active Mining New Directions of Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Hiroshi Motoda ISBN 158603264X 9781586032647 pdf download https://ebookball.com/product/active-mining-new-directions-of- data-mining-frontiers-in-artificial-intelligence-and- applications-1st-edition-by-hiroshi-motoda- isbn-158603264x-9781586032647-19756/ Explore and download more ebooks or textbooks at ebookball.com
  • 2. Get Your Digital Files Instantly: PDF, ePub, MOBI and More Quick Digital Downloads: PDF, ePub, MOBI and Other Formats Multi Relational Data Mining Frontiers in Artificial Intelligence and Applications 1st Edition by Arno Knobbe ISBN 1586036610 9781586036614 https://ebookball.com/product/multi-relational-data-mining- frontiers-in-artificial-intelligence-and-applications-1st- edition-by-arno-knobbe-isbn-1586036610-9781586036614-19734/ Adaptive Stream Mining Pattern Learning and Mining from Evolving Data Streams Volume 207 Frontiers in Artificial Intelligence and Applications 1st Edition by Albert Bifet ISBN 1607500906 9781607500902 https://ebookball.com/product/adaptive-stream-mining-pattern- learning-and-mining-from-evolving-data-streams- volume-207-frontiers-in-artificial-intelligence-and- applications-1st-edition-by-albert-bifet- isbn-1607500906-9781607500902/ Artificial Intelligence and Education Frontiers in Artificial Intelligence and Applications 1st Edition by Bierman, Breuker, Sandberg ISBN 9051990146 9789051990140 https://ebookball.com/product/artificial-intelligence-and- education-frontiers-in-artificial-intelligence-and- applications-1st-edition-by-bierman-breuker-sandberg- isbn-9051990146-9789051990140-19708/ Knowledge Discovery Practices and Emerging Applications of Data Mining Trends and New Domains 1st edition by Senthil Kumar 160960069XÂ 9781609600693 https://ebookball.com/product/knowledge-discovery-practices-and- emerging-applications-of-data-mining-trends-and-new-domains-1st- edition-by-senthil-kumar-160960069x-9781609600693-14490/
  • 3. Annotation for the Semantic Web Frontiers in Artificial Intelligence and Applications 1st Edition by Siegfried Handschuh, Steffen Staab ISBN 158603345X 9781586033453 https://ebookball.com/product/annotation-for-the-semantic-web- frontiers-in-artificial-intelligence-and-applications-1st- edition-by-siegfried-handschuh-steffen-staab- isbn-158603345x-9781586033453-19754/ New Directions in Dental Anthropology paradigms methodologies and outcomes 1st edition by Grant Townsend, Eisaku Kanazawa, Hiroshi Takayam 9780987171870 https://ebookball.com/product/new-directions-in-dental- anthropology-paradigms-methodologies-and-outcomes-1st-edition-by- grant-townsend-eisaku-kanazawa-hiroshi- takayam-9780987171870-1642/ Agent Intelligence Through Data Mining Multiagent Systems Artificial Societies and Simulated Organizations 14 1st edition by Andreas Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528 https://ebookball.com/product/agent-intelligence-through-data- mining-multiagent-systems-artificial-societies-and-simulated- organizations-14-1st-edition-by-andreas-symeonidis-pericles- mitkas-isbn-0387243526-978-0387243528-19574/ Data Mining and Predictive Analysis Intelligence Gathering and Crime Analysis 1st Edition by Colleen McCue 0750677961 9780750677967 https://ebookball.com/product/data-mining-and-predictive- analysis-intelligence-gathering-and-crime-analysis-1st-edition- by-colleen-mccue-0750677961-9780750677967-19238/ Constraint Solving over Multi Valued Logics Application to Digital Circuits Frontiers in Artificial Intelligence and Applications 1st Edition by Francisco Azevedo ISBN 1586033042 9781586033040 https://ebookball.com/product/constraint-solving-over-multi- valued-logics-application-to-digital-circuits-frontiers-in- artificial-intelligence-and-applications-1st-edition-by- francisco-azevedo-isbn-1586033042-9781586033040-19704/
  • 6. Frontiers in Artificial Intelligence and Applications Series Editors: J. Breuker, R. Lopez de Mantaras, M. Mohammadian, S. Ohsuga and W. Swartout Volume 79 Volume 3 in the subseries Knowledge-Based Intelligent Engineering Systems Editor: L.C.Jain Previously published in this series: Vol. 78. T. Vidal and P. Liberatore (Eds.), STAIRS 2002 Vol. 77. F. van Harmelen (Ed.). ECAI 2002 Vol. 76. P. SinCak et al. (Eds.), Intelligent Technologies - Theory andApplications Vol. 75.1.F. Cruz et al. (Eds.). The Emerging Semantic Web Vol. 74, M. Blay-Fornarino et al. (Eds.). Cooperative Systems Design Vol. 73. H. Kangassalo et al. (Eds.), Information Modelling and Knowledge Bases XIII Vol. 72, A. Namatame et al. (Eds.), Agent-Based Approaches in Economic and Social Complex Systems Vol. 71. J.M. Abe and J.I. da Silva Filho (Eds.), Logic. Artificial Intelligence and Robotics Vol. 70, B. Verheij et al. (Eds.), Legal Knowledge and Information Systems Vol. 69, N. Baba et al. (Eds.), Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies Vol. 68, J.D. Moore et al. (Eds.), Artificial Intelligencein Education Vol. 67. H. Jaakkola et al. (Eds.), Information Modelling and Knowledge Bases XII Vol. 66, H.H. Lund et al. (Eds.), Seventh Scandinavian Conference on Artificial Intelligence Vol. 65, In production Vol. 64. J. Breuker et al. (Eds.). Legal Knowledgeand Information Systems Vol. 63.1. Gent et al. (Eds.), SAT2000 Vol. 62. T. Hruska and M. Hashimoto (Eds.), Knowledge-Based SoftwareEngineering Vol. 61, E. Kawaguchiet al. (Eds.). Information Modellingand Knowledge Bases XI Vol. 60, P. Hoffman and D. Lemke (Eds.), Teaching and Learning in a Network World Vol. 59, M. Mohammadian (Ed.), Advances in Intelligent Systems: Theory andApplications Vol. 58. R. Dieng et al. (Eds.), Designing Cooperative Systems Vol. 57, M. Mohammadian (Ed.), New Frontiers in Computational Intelligence and its Applications Vol. 56, M.I. Torres and A. Sanfeliu (Eds.), Pattern Recognition and Applications Vol. 55, G. Cumming et al. (Eds.). Advanced Research in Computers and Communications in Education Vol. 54. W. Horn (Ed.), ECAI 2000 Vol. 53, E. Motta. Reusable Components for Knowledge Modelling Vol. 52. Inproduction Vol. 51, H. Jaakkola et al. (Eds.), InformationModellingand Knowledge Bases X Vol. 50. S.P. Lajoie and M. Vivet (Eds.), Artificial Intelligence in Education Vol. 49. P. McNamara and H. Prakken (Eds.), Norms. Logics and Information Systems Vol. 48. P. Navrat and H. Ueno (Eds.), Knowledge-Based Software Engineering Vol. 47. M.T. Escrig and F. Toledo, Qualitative Spatial Reasoning: Theory and Practice Vol. 46. N. Guarino (Ed.), Formal Ontology in Information Systems Vol. 45. P.-J. Charrel et al. (Eds.). Information Modelling and Knowledge Bases IX ISSN: 0922-6389
  • 7. Active Mining New Directions of Data Mining Edited by Hiroshi Motoda Division of Intelligent Systems Science, The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan /OS P r e s s Ohmsha Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
  • 8. © 2002, Hiroshi Motoda All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmined. in any form or by any means, without the prior written permission from the publisher. ISBN 158603 264 X (IOS Press) ISBN 4 274 90521 7 C3055 (Ohmsha) Library of Congress Control Number: 2002106944 Publisher IOS Press Nieuwe Hemweg 6B 1013BG Amsterdam The Netherlands fax:+31 206203419 e-mail: order@iospress.nl Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax:+44 1865750079 Distributor in the USA and Canada IOS Press, Inc. 5795-G Burke Centre Parkway Burke, VA 22015 USA fax:+l 703 323 3668 e-mail: iosbooks@iospress.com Distributor in Germany, Austria and Switzerland IOS Press/LSL.de Gerichtsweg 28 D-04103 Leipzig Germany fax:+49 341 995 4255 Distributor in Japan Ohmsha, Ltd. 3-1 KandaNishiki-cho Chiyoda-ku. Tokyo 101–8460 Japan fax:+81 3 3233 2426 LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
  • 9. Preface Our ability to collect data, be it in business, government, science, and perhaps personal life has been increasing at a dramatic rate. However, our ability to analyze and understand massive data lags far behind our ability to collect them. The value of data is no longer in "how much of it we have". Rather, the value is in how quickly and how effectively can the data be reduced, explored, manipulated and managed. Knowledge Discovery and Data mining (KDD) is an emerging technique that extracts implicit, previously unknown, and potentially useful information (or patters) from data. Recent advancement made through extensive studies and real world applications reveals that no matter how powerful computers are now or will be in the future, KDD researchers and practitioners must consider how to manage ever-growing data which is, ironically, due to the extensive use of computers and ease of data collection, ever-increasing forms of data which different applications require us to handle, and ever-changing requirements for new data and mining target as new evidences are collected and new findings are made. In short, the need for 1) identifying and collecting the relevant data from a huge information search space, 2) mining useful knowledge from different forms of massive data efficiently and effectively, and 3) promptly reacting to situation changes and giving necessary feedback to both data collection and mining steps, is ever increasing in this era of informationoverload. Active mining is a collection of activities each solving a part of the above need, but collectively achieving the various mining objectives. By "collectively achieving" we mean that the total effect outperforms the simple add-sum effect that each individual effort can bring. Said differently, a spiral effect of these interleaving three steps is the target to be pursued. To achieve this goal the initial action is to explore mechanisms of 1) active information collection where necessary information is effectively searched and pre- processed, 2) user-centered active mining where various forms of information sources are effectively mined, and 3) active user reaction where the mined knowledge is easily assessed and prompt feedback is made possible. This book is a joint effort from leading and active researchers in Japan with a theme about active mining. It provides a forum for a wide variety of research work to be presented ranging from theories, methodologies, algorithms, to their applications. It is a timely report on the forefront of data mining. It offers a contemporary overview of modern solutions with real-world applications, shares hard-learned experiences, and sheds light on future development of activemining. This collection evolved from a project on active mining and the papers in this collection were selected from among over 40submissions. The book consists of 3 parts. Each part corresponds to one of the three mechanisms mentioned above. Namely, part I consists of chapters on Data Collection, part II on User- centered Mining, and part III on User Reaction and Interaction. Some of the chapters overlap each other but have to be placed in one of these three parts. The topics covered in 27 chapters include online text mining, clustering for information gathering, online monitoring of Web page updates, technical term classification, active information gathering, substructure mining from Web and graph structured data, web community discovery and classification,spatial data mining, automatic configuration of mining tools, worst case analysis of exceptional rule mining, data squashing applied to boosting, outlier detection, meta-learning for evidenced based medicine, knowledge acquisition from both
  • 10. human expert and data, data visualization, active mining in business application world, meta analysis and many more. This book is intended for a wide audience, from graduate students who wish to learn basic concepts and principles of data mining to seasoned practitioners and researchers who want to take advantage of the state-of-the-art development for active mining. The book can be used as a reference to find recent techniques and their applications, as a starting point to find other related research topics on data collection, data mining and user interaction, or as a stepping stone to develop novel theories and techniques meeting the exciting challenges ahead of us. Active mining is a new direction in the knowledge discovery process for real-world applications handlinghuge amounts of data with actual user need. Hiroshi Motoda
  • 11. Acknowledgments As the field of data mining advances, the interest in as well as the need for integrating various components intensifies for effective and successful data mining. A lot of research ensues. This book project resulted from the active mining initiatives that started during 2001 as a grant-in-aid for scientific research on priority area by the Japanese Ministry of Education, Science, Culture, Sports and Technology. We received many suggestions and support from researchers in machine learning, data mining and database communities from the very beginning of this book project. The completion of this book is particularly due to the contributors from all areas of data mining research in Japan, their ardent and creative research work. The editorial members of this project have kindly provided their detailed and constructive comments and suggestions to help clarify terms, concepts, and writing in this truly multi-disciplinary collection. I wish to express my sincere thanks to the following members: Numao Masayuki, Yukio Ohsawa, Einoshin Suzuki, Takao Terano, Shusaku Tsumoto and Takahira Yamaguchi. We are also grateful to the editorial staff of IOS Press, especially Carry Koolbergen and Anne Marie de Rover for their swift and timely help in bringing this book to a successful conclusion. During the process of this book development, I was generously supported by our colleagues and friends at Osaka University.
  • 13. Contents Preface, Hiroshi Motoda Acknowledgments I. Data Collection Toward Active Mining from On-line Scientific Text Abstracts Using Pre-existing Sources, TuanNam Tran and Masayuki Numao 3 Data Mining on theWAVEs - Word-of-mouth-Assisting Virtual Environments, Masayuki Numao, Masashi Yoshida and Yusuke Ito 1 Immune Network-based Clustering for WWW Information Gathering/Visualization, Yasufumi Takama and Kaoru Hirota 21 Interactive Web Page Retrieval with Relational Learning-based Filtering Rules, Masayuki Okabe and Seiji Yamada 31 Monitoring Partial Update of Web Pages by Interactive Relational Learning, Seiji Yamada and Yuki Nakai 41 Context-based Classification of Technical Terms Using Support Vector Machines, Masashi Shimbo, Hiroyasu Yamada and Yuji Matsumoto 51 Intelligent Tickers: An Information Integration Scheme for Active Information Gathering, Yasukiro Kitamura 61 II. User Centered Mining Discovery of Concept Relation Rules Using an Incomplete Key Concept Dictionary, Shigeaki Sakurai, Yumi Ichimura and Akihiro Suyama 73 Mining Frequent Substructures from Web, Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Hiroshi Sakamoto and Setsuo Arikawa 83 Towards the Discovery of Web Communities from Input Keywords to a Search Engine, Tsuyoshi Murata 95 Temporal Spatial Index Techniques for OLAP in Traffic Data Warehouse, Hiroyuki Kawano 103 Knowledge Discovery from Structured Data by Beam-wise Graph-Based Induction, Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida and Takashi Washio 115 PAGA Discovery: A Worst-Case Analysisof Rule Discovery for ActiveMining, Einoshin Suzuki 127 Evaluating the Automatic Composition of Inductive Applications Using StatLog Repository of Data Set, Hidenao Abe and Takahira Yamaguchi 139 Fast Boosting Based on Iterative Data Squashing, Yuta Choki and Einoshin Suzuki 151 Reducing Crossovers in ReconciliationGraphs Using the Coupling Cluster Exchange Method with a Genetic Algorithm, Hajime Kitakami and Yasuma Mori 163 Outlier Detection using Cluster Discriminant Analysis, Arata Sato, Takashi Suenaga and Hitoshi Sakano 175
  • 14. III. User Reaction and Interaction Evidence-Based Medicine and Data Mining:Developing a Causal Model via Meta-Learning Methodology, Masanori Inada and Takao Terano 87 KeyGraph for Classifying Web Communities, Yukio Ohsawa, Yutaka Matsuo, Naohiro Natsumura, Hirotaka Soma and Masaki Usui 95 Case Generation Method for Constructingan RDR Knowledge Base, Keisei Fujiwara, Tetsuya Yoshida, Hiroshi Motoda and Takashi Washio 205 Acquiring Knowledge from Both Human Experts and Accumulated Data in an Unstable Environment, Takuya Wada, Tetsuya Yoshida, Hiroshi Motoda and Takashi Washio 217 Active Participation of Users with Visualizaiton Tools in the Knowledge Discovery Process, Tu Bao Ho, Trong Dung Nguyen, Duc Dung Nguyen and Saori Kawasaki 229 The Future Direction of Active Miningin the Business World, Katsutoshi Yada 239 Topographical Expression of a Rule for Active Mining, Takashi Okada 247 The Effect of Spatial Representation of Information on Decision Making in Purchase. Hiroko Shoji and Koichi Hori 259 A Hybrid Approach of Multiscale Matching and Rough Clustering to Knowledge Discovery in Temporal Medical Databases, Shoji Hirano and Shusaku Tsumoto 269 Meta Analysis for Data Mining, Shusaku Tsumoto 279 Author Index 291
  • 17. Active Mining H. Moloda (Ed.) IOS Press, 2002 Toward Active Mining from On-line Scientific Text Abstracts Using Pre-existing Sources TuanNam Tran and Masayuki Numao tt-nam@nm.cs.titech.ac.jp, nurnao@cs.titech.ac.jp Department of Computer Science, Tokyo Institute of Technology 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, JAPAN Abstract. As biomedical research enters the post-genome era and most new information relevant to biology research is still recorded as free text, there is an extensively increasing needs of extracting information from biological literature databases such as MEDLINE. Different from other work so far, in this paper we presents a framework for mining MEDLINE by making use of a pre-existing biological database on a kind of Yeast called S.cerevisiae. Our framework is based on an active mining prospect and consists of two tasks: an information retrieval task of actively selecting articles in accordance with users' interest, and a text data mining task using association rule mining and term extraction techniques. The preliminary results indicate that the proposed method may be useful for consistency checking and error detection in annotation of MeSH terms in MEDLINE records. It is considered that the proposed approach of combining information retrieval making use of pre-existing databases and text data mining could be expanded for other fields such as Web mining. 1 Introduction Because of the rapid growth of computer hardwares and network technologies, a vast amount of information could be accessed through a variety of databases and sources. Biology research inevitably plays an essential role in this century, producing a large number of papers and on-line databases on this field. However, even though the number and the size of sequence databases are growing rapidly, most new information relevant to biology research is still recorded as free text. As biomedical research enters the post- genome era, new kinds of databases that contain information beyond simple sequences are needed, for example, information on protein-protein interactions, gene regulation etc. Currently, most of early work on literature data mining for biology concentrated on analytical tasks such as identifying protein names [5],simple techniques such as word co-occurrence [12], pattern matching [8], or based on more general natural language parsers that could handle considerably more complex sentences [9],[15]. In this paper, a different approach is proposed for dealing with literature data mining from MEDLINE, a biomedical literature database which contains a vast amount of useful information on medicine and bioinformatics. Our approach is based on active mining, which focuses on active information gathering and data mining in accordance with the purposes and interests of the users. In detail, our current, system contains two subtasks: the first task exploits existing databases and machine learning techniques for selecting useful articles, and the second one using association rule mining and term
  • 18. 4 T. Tran and M. Numao / Toward Active Mining extraction techniques to conduct text data mining from the set of documents obtained by the first task. The remainder of this paper is organized as follows. Section 2 gives a brief overview on literature data mining. Section 3 describes in detail the task of making use ofexisting databases to retrieve relevant documents (the information retrieval task). Given the results obtained from the Section 3. Section 4 introduces the text mining task by using association rule mining and term extraction. Section 5 describes some directions for future work. Finally Section 6 presents our conclusions. 2 Overview on literature data mining for biology In this section we give a brief overview of current work on literature data ming for bi- ology. As described above, even though the number and the size of sequence databases are growing rapidly, most new information relevant to biology research is still recorded as free text. As a result, biologists need information contained in text to integrate information across articles and update databases. Current automated natural language systems could be classified as information retrieval systems (which return documents relevant to a subject), information extraction systems (which identify entities or re- lations among entities in text) and question answering system (which answer factual questions using large document collections). However, it should be noted that most of these systems work on newswire. and text mining for biology is considered to be harder because the syntax is more complex, new terms are introduced constantly and there is a confusion between genes and proteins [6]. On the other hand, since natural language processing offers the tools to make infor- mation in text accessible, there are an increasing numbers of groups workingon natural language processing for biology. Fukuda et. al. [5] attempt to identifying protein names from biological papers. Andrade and Valencia [2]also concentrate on extraction of keywords, not mining factual assertions. There have been many approaches to the extraction of factual assertions using natural language processing techniques such as syntactic parsing. Sekimizu et. al. [11] attempt to generate automatic database entries containing relations extracted from MEDLINE abstracts. Their approach is to parse, determine noun phrases, spot the frequently-occurring verbs and choose the most likely subject and object from the candidate NPs in the surrounding text. Rindflesch [10] uses a stochastic part-of-speech tagger to generate an underspecified syntactic parse and then uses semantic and pragmatic information to construct its assertions. This system can only extract mentions of well-characterized genes, drugs cell types, not the interactions among them. Thomas et. al. [13] use an existing information extraction system called SRI's Highlight for gathering data on protein interactions. Their work concentrates on finding relations directly between proteins. Blaschke et. al. [3] at- tempt to generate functional relationship maps from abstracts, however, it requires a pre-defined list of all named entities and cannot handle syntactically complex sentences. 3 Retrieving relevant documents by making use of existing database We describe our information retrieval task, which can be considered as a specific task for retrieving relevant documents from MEDLINE. Current systems for accessing MED- LINE such as PubMed (1 ) accept keyword-based queries to text sources and return 1 http://www.ncbi.nlm.nih.gov/PiibMod/
  • 19. T. Tran and M. Numao / Toward Active Mining documents that are hopefully relevant to the query. Since MEDLINE contains an enor- mous amount of papers and the current MEDLINE search engines is a keyword-base one, the number of returned documents is often large, and many of them in fact are non-relevant. The approach to solve this issue is to make use of existing databases of organisms such as S.cerevisiae using supervised machine learning techniques. Figure 1 shows the illustration of the information retrieval task. In this Figure, YPD database (standing for Yeast Protein Database 2 ) is a biological database which contains genetic functions and other characteristics of a kind of Yeast called S.cerevisiae. Given a certain organism X, the goal of this task is to retrieve its relevant documents, i.e. documents containing useful genetic: information for biological research. Collection of S.cerevisiae (MS) Negative Examples (MS-YS) Collection of target organism (MX) Figure 1: Outline of the information retrieval task Let MX, MS be the sets of documents retrieved from MEDLINE by querying for the target organism X and S.cerevisiae respectively (without any machine learning filtering) and YS be the set of documents found by querying for the YPD terms for S.cerevisiae (YS is omitted in Figure 1 for the reason of simplification). The set of positive and negative examples then are collected as the intersection set and difference set of MS and YS respectively. Given the training examples. OX is the output set of documents obtained by applying Naive Bayes classifier on MX. 3.1 Naive Bayes classifier Naive Bayes classifiers ([7]) are among the most successful known algorithms for learning to classify text documents. A naive Bayes classifier is constructed by using the training data to estimate the probability of each category given the document feature values of a new instance. The probability a instance d belongs to a class Ck is estimates by Bayes theorem as follows: Since P(dC — ck) is often impractical to compute without simplifyingassumptions, for the Naive Bayes classifier, it is assumed that the features X1 ,X2 ,.. ,Xn areconditionally
  • 20. T. Tran and M. Numao/ Toward ActiveMining independent, given the category variable C. As a result : 3.2 Experimental results of information retrieval task Our experiments use YPD as an existing database. From this database weobtain 14572 articles pertaining to S.cerevisiae. For the target organisms, initially we collect 3073 and 8945 articles for two kinds of Yeast called Pombe and Candida respectively. After conducting experiments as in Figure 1, we obtain the output containing 1764 and 285 articles for Pombe and Candida respectively. A certain number of documents (50 in this experiment) in each of dataset is taken randomly, checked by hand whether they are relevant or not. Figure 2 shows the Recall- Precision curve for Pombe and Candida. It can be seen from this Figure that using machine learning approaches remarkably improved the precision. The reason the recall in the case of Candida is rather lower compared to the case of Pombe is that Pombe is a yeast which has many similar genetic characteristics than Candida. Figure 2: Recall-Precision curve for Pombe and Candida 4 Mining MEDLINE by combining term extraction and association rule mining In this section, we attempt to mine the set of MEDLINE documents obtained in the previous section by combining term extraction and association rule mining. The text mining task from the collected dataset consists of two main modules: the Term Extraction module and the Association-Rule Generation module. The Term Extraction module itself includes the following stages: • XML translation: This stage translates the MEDLINErecord from HTML form into a XML-like form, conducting some pre-processing dealing with punctuation. • Part-of-speech tagging: Here, the rule-based Brill part-of-speech tagger [4] was used for tagging the title and the abstract part.
  • 21. T. Tran and M. Numao / Toward Active Mining • Term Generation: sequences of tagged words are selected as potential term candidates on the basis of relevant morpho-syntactic patterns (such as "Noun Noun", "Noun Adjective Noun", "Adjective Noun", "Noun Preposition Noun" etc). For example, "in vivo", "saccharomyces cerevisiae" are terms extracted from this stage. • Stemming: Stemming algorithm was used to find variations of the same word. Stemming transforms variations of the same word into a single one, reducing vocabulary size. • Term Filtering: In order to decrease the number of "bad terms", in the abstract part, only sentences containing verbs listed in the "verbs related to biological events" Table in [14] have been used for Term Generation stage. After necessary terms have been generated from the Term Extraction module, the Association-Rule Generation module then applies the Apriori algorithm[1] using the set of generated terms to produce association rules (each line of the input file of Apriori- based program consists every terms extracted from a certain MEDLINE record in the dataset). Figure 3 and Figure 4 show the list of twenty rules among obtained rules demon- strating" the relationships among extracted terms for Pornbe and Candida respectively. For example, the 5th rule in Figure 4 implies that "the rule that in a MEDLINE record if aspartyl proteinases occurs then this MEDLINE document is published in the Jour- nal of Bacteriology has the support of 1.3% and the confidence of 100.0%.". It can be seen that the relation between journal name and terms extracted from the title and the abstract has been discovered from this example. It can be seen from Figure 3 and 4 that making use of terms can produced interesting rules that cannot be obtained using only single-words. 5 Future Work 5.1 For the information retrieval task Although using an existing database of S.cerevisiae is able to obtain a high precision for other yeasts and organisms, the recall value is still low, especially for the yeasts which are different remarkably from S.cerevisiae. Since yeasts such as Candida might have many unique attributes, we may improve the recall by feeding the documents checked by hand back to the classifier and conduct the learning process again. The negative training set has still contained many positive examples so we need to reduce this noise by making use of the learning results. 5.2 For the text mining task By combining term extraction and association rule mining, it is able to obtain inter- esting rules such as the relations among journal names and terms, terms and terms. Particularly, the relations among MeSH terms and "Substances" may be useful for error detection in annotation of MeSH terms in MEDLINE records. However, the current al- gorithm treats extracted terms such as "cdc37_caryogamy_defect","cdc37_injnitosy",
  • 22. T. Tran and M. Numao / Toward Active Mining 1: fission_yeast_schizosaccharomyc_pomb <- transcript_control (0.3%, 80.0%.) 2: cell_cycle <- period (0.6%, 77.87.) 3: mutant <- other_mutant (0.4%, 83.37.) 4: essenty <- gene_disrupt_expery (0.5%, 75.07.) 5: mitosy <- passag_through_start (0.3%, 80.07.) 6: transcript <- mat2-mat3_interval (0.3%, 80.07.) 7: embo_j <- p34cdc2_kinas_activity (0.5%, 75.07.) 8: nucleu <- periphery (0.3%, 80.07.) 9: structur <- function_similar (0.3%, 80.07.) 10: meiosy <- premeiot_dna_synthesy (0.5%, 75.07.) 11: meiosy <- pair (0.3%, 80.07.) 12: s.phase <- complet.of_s_phase (0.4%, 83.37.) 13: amino_acid_sequ <- alignment (0.4%, 83.37.) 14: amino_acid_sequ <- _residu (0.3%, 80.07.) 15: human <- mous_homolog (0.3%, 80.07.) 16: open_read_frame <- uninterrupt (0.4%, 83.37.) 17: subunit <- rpb2 (0.3%, 80.07.) 18: centromer <- central_core (0.4%, 83.37.) 19: centromer <- centromer_function (0.4%, 83.37.) 20: weel <- mikl (0.5%, 85.77.) Figure 3: First twenty rules obtained for the set of Pombe documents obtained in Section 3 (minimum support = 0.003. minimum confidence = 0.75) "cdc37_mutat" to be mutually independent. It may be necessary to construct semi- automatically term taxonomy, for instance users are able to choose only interesting rules or terms then feedback to the system. 5.3 Mutual benefits between two tasks Gaining mutual benefits between two tasks is also an important issue for future work. First, by applying text mining results, it should be noted that we can decrease the number of documents being "leaked" in the information retrieval task. As a result, it is possible to improve the recall. Conversely, since the current text mining algorithm create many unnecessary rules (from the viewpoint of biological research), it is also possible to apply the information retrieval task first for filtering relevant documents, then apply to the text mining task to decrease the number of unnecessary rulesobtained and to improve the quality of the text mining task. 6 Conclusions This paper has introduced a framework for mining MEDLINE by making use of exist- ing biological databases. Two tasks concerninginformation extractionfrom MEDLINE have been presented. The first task is used for retrieving useful documents for biology research with high precision. Given the obtained set of documents, the second task attempts to apply association rule mining and term extraction for mining these docu- ments. It can be seen from this paper that making use of the obtained results is useful for consistency checking and error detection in annotation of MeSH terms in MEDLINE records. In future work, combining these two tasks together may be essential to gain mutual benefits for both two tasks.
  • 23. T. Tran and M. Numao/Toward Active Mining 1: open_read_frame <- molecular_weight (1.8%, 75.0%) 2: open_read_frame <- molecular_mass (1.8%, 75.0%) 3: open_read_frame <- cdna_clone (1.3%, 100.0%) 4: virul <- growth_rate (1.8%, 75.0%) 5: j_bacteriol <- aspartyl_proteinas (1.3%, 100.0%) 6: j_bacteriol <- gene_code (1.3%, 100.0%) 7: j_bacteriol <- sucros (1.3%, 100.0%) 8: organism <- immunoelectron_microscopy (1.3%, 100.0%) 9: resist <- transport (1.8%, 75.0%) 10: similar <- hyphal_growth (1.8%, 75.0%) 11: clone <- southern_blot (1.3%, 100.0%) 12: white <- opaqu (1.8%, 75.0%) 13: white <- opaqu_phase (1.8%, 75.0%) 14: white <- opaqu_cell (1.8%, 75.0%) 15: amino_acid_sequ <- comparison (2.7%, 83.3%) 16: amino_acid_sequ <- escherichia_coly (1.8%, 75.0%) 17: amino_acid_sequ <- alignment (1.8%, 75.0%) 18: fragment <- molecular_mass (1.8%, 75.0%) 19: cell_wall <- moiety (1.3%, 100.0%) 20: cell_wall <- immunoelectron_microscopy (1.3%, 100.0%) Figure 4: First twenty rules obtained for the set of Candida documents obtained in Section 3 (minimum support = 0.01, minimum confidence = 0.75) [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases, 1994. [2] M.A. Andrade and A. Valencia. Automatic annotation for biological sequences by ex- traction of keywords from medline abstracts, development of a prototype system. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, 1997. [3] C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia. Automatic extraction of biological information from scientific text: protein-protein interactions. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999. [4] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, 1992. [5] K. Fukuda, A. Tamura, T. Tsunoda, and T. Takagi. Toward information extraction: identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocornputing, 1998. [C] L. Hirschman. Mining the biomedical literature: Creating a challenge evaluation. Tech- nical report, The MITRE Corporation, 2001. [7] D.D. Lewis and M. Ringuette. A comparison of two learning algorithms for text catego- rization. In Third Annual Symposium on Document Analysis and Information Retrieval. 1994. [8] S. K. Ng and M. Wong. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics, 10:104 11, December 1999. [9] J. C. Park, H. S. Kim, and J. J. Kim. Bidirectional incremental parsing for automatic pathway identification with cornbinatory categorial grammar. In Proceedings of the Pa- cific Symposium on Biocornputing, 2001. [10] T.C. Rindnesch. Edgar: Extraction of drugs, genes and relations from the biomedical literature. In Proceedings of the Pacific Symposium, on Biocornputing, 2000.
  • 24. 10 T. Tran and M. Numao / Toward Active Mining [11] T. Sekimizu, H.S. Park, and J. Tsujii. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome Informatics. pages 62-71, 1998. [12] B. J. Stapley and G. Benoit. Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in medline abstracts. In Proceedings of the Pacific Symposium on Biocomputing. 2000. [13] J. Thomas. D. Milward. C. Ouzounis, S. Pulman, and M. Carroll. Automatic extraction of protein interactions from scientific abstracts. In Proceedings of the Pacific Symposium on Biocomputing. 2000. [14] J. Tsujii. Information extraction from scientific texts. In Proceedings of the Pacific Symposium on Biocomputing, 2001. [15] A. Yakushiji, Y. Tateisi, Y. Miyao Y., and J. Tsujii. Event extraction from biomedical papers using a full parser. In Proceedings of the Pacific Symposium on Biocomputing. 2001.
  • 25. Active Mining H. Moloda(Ed.) 1OS Press, 2002 Data Mining on the WAVEs — Word-of-mouth-Assisting Virtual Environments Masayuki Numao, Masashi Yoshida and Yusuke Ito numao@cs.titech.ac.jp http://www.nrn.cs.titech.ac.jp Department of Computer Science Tokyo Institute of Technology 2-12-1 O-okayama, Meguro 152-8552, JAPAN Abstract. Recently, computers play an important role not only in knowledge processing but also as communication media. However, they often cause troubles in communication, since it is hard for us to select only useful pieces of information. To overcome this difficulty, we pro- pose a new tool, WAVE (Word-of-mouth-Assisting Virtual Environmen- t), which helps us to communicate and spread information by relaying a message like Chinese whispers. This paper describes its concept, an implementation and its preliminary evaluation. 1 Introduction Chinese whispers a game in which a message is distorted by being passed around in a whisper (also called Russian scandal). word of mouth (a) oral communication or publicity; (b) done, given, etc., by speak ing: oral. - New Shorter Oxford English Dictionary WWW and e-mail are very useful tools for communication. However, we sometimes feel uncomfortable because of flaming or mental barriers to participate in Computer- Mediated Communication (CMC). There are some important differences between CMC and direct comrnunication[5]. Another problem is that computer networks deliver too many pieces of information, by which it is too hard to select useful pieces. Although search engines, such as Yahoo, Goo and Google, are very useful to find web pages, we need another type of tool without requiring a keyword for search. Good candidates are a mailing list and a network news system, where we need a filtering system to select only useful messages. Although content-based filtering[6] and collaborative filtering[8] are good solutions, the current, methods have not achieved high precision and recall. This paper presents another approach by relaying a message like Chinese whispers to gather useful information, to alleviate mental barriers and to block flames.
  • 26. 12 M. Numao et al. / Data Mining on the WAVEs request request Figure 1: Spread of information 2 Spread of information by Chinese whispers Fig. 1 shows spread of information by word of mouth, where each person relays a message like Chinese whispers. Although a message is distorted by being passed around in the game, in a computer-assisted environment we expect that a delivered message is the same as its original. In such a process, we even have a merit that, as a result of evaluation and selection by each person, this process delivers only useful information. Each person knows whom (s)he should ask on a current topic, and retrieve a small amount that can be handled, where only interesting information survives. 3 WAVE To assist spread of information by Chinese whispers, wepropose a system WAVE (Word- of-mouth-Assisting Virtual Environment) for smooth communication and information gathering. Compared to agent systems proposed to automate word of mouth [1.9. 2. 7]. WAVE is a simpler tool and works as directed by the user except for a separated recommendation module. The authors believe that, in most situations, a simple and intuitive tool is better than an automated complicated tool, since users construct a model of the tool easily. Fig. 2 shows a diagram of WAVE. The user's operations are posting, opening and reviewing an article. In addition, in a recommendation window, the system shows some good articles based on the user's log. 3.1 Posting an article The user can post an article as shown in Fig. 3, which may contain a text and URLs of web pages or photos. (S)he gives evaluation 1-5 (1 for the worst and 5 for the best) and a category to the article. The posted article is open to others as shown in Fig. 4 and referred by other users like WWW and a mailing list. The user can browse articles posted by her/his friends. Fig. 5 shows a list of friends. Each person is identified by an address 'user_namefihost:port'. If an article is interest- ing. (s)he can post its review, by which (s)he relays the article to his friends as shown in Fig. 2. Fig. 4 shows a list of articles the user has posted or reviewed.
  • 27. M. Nurnao et al. /Data Mining on the WAVEs Figure 2: Word-of-mouth-Assisting Virtual Environment Figure 3: Posting an article Figure 4: Articles posted or reviewed
  • 28. 14 M. Numao et al. / Data Mining on the WAVEs Figure 5: Your friends Figure 6: Reviews by your friend 3.2 Open articles Articles posted or reviewed by the user are stored in her/his database. It is open to people who registered her/him as a friend. The user can register an address of her/his friends, or notify her/his address to another user. For example, if C registered A and B to her/his friend's list, C can see the databases of A and B. Since each user knows her/his friends, (s)he can judge their reliability, which is very useful to select information from them. In addition, it is comfortable to join the community because (s)he exchanges messages only with her/his friends. 3.3 Review an article If C is interested in an article from A in Fig. 2, C can browse its body and give an evaluation and a comment as shown in Fig. 7. After this operation, the article is automatically retrieved and stored in C's database, which is open to C's friends. Chaining the operation propagates an article. As such, WAVE seamlessly assists opening, browsing, evaluation, retrieve of an article. This saves us a lot of time and labor of uploading, advertisement, etc. In BBS and mailing lists, most participants feel mental barriers to post an article. In contrast, a user first posts an article only to his friends in WAVE. Mental barriers are alleviated in this fashion. ROMs (Read Only Members) often form a bridge between two communities. WAVE is useful to activate a bridge. 3.4 Automatic recommendation When a user has many friends, it might be good to order articles based on her/his model. Modeling a person is difficult since we cannot directly measure a mental state. Even if it can by using MRI or other devices, it is still hard to clarify a relation between
  • 29. M. Numao et al. /Data Mining on the WAVEs Figure 7: An article Figure 8: Recommendation Figure 10: Modeling based on communi- Figure 9: Modeling cation
  • 30. 16 M. Numao et al. / Data Mining on the WAVEs Figure 11: Recommending process a brain state and its social effects, since a person has many activities and aspects (Fig. 9). Instead, we propose to model a relation between two persons by logging their communication. To model a relation between two persons, we need a log of communications between all combinations of persons. This causes a trouble in analyzing WWW. a news system or a mailing list. In contrast, all communications are occurred only among friends in WAVE. We have no combinatorial problem in analyzing communications and modeling relations, since the number of friends of one person is not usually large. Fig. 11 shows a process of ordering articles for recommendation, where C s history is analyzed based on an evaluation function to order articles in databases of A and B. and evaluation is based on the following factors: • Evaluation of the article by the last reviewer. • Evaluation of the last reviewer by the user. • The user's preference for the category of article. • How old is the article? • Howmany people relay the article? 3.5 Distributed implementation The system is implemented on Java servelet and works on a web server as shown in Fig. 12. The user first registers her/his name and password, and accesses the system by using a web browser.
  • 31. M. Numao et at. / Data Mining on the WAVEs Figure 12: Distributed implementation
  • 32. 18 M. Numao et al. / Data Mining on the WAVEs Figure 14: Two example flows of an article The system is distributed easily to several hosts. In Fig. 12. Mr. A registered on hostl to use the system. Ms. B registered on host2. Mr. A can see Ms. B's article by specifying her address. As such, the system is scalable by being distributed over many hosts. 4 Preliminary evaluation 33 users test the system for 20 days. The result is visualized as shown in Fig. 13. This map is based on one by KrackPlot[4]. which is a program for network visualization designed for social network analysts. Each node denotes a user, whose shape denotes the number of articles (s)he posts. Here, myoshida. blankey. roy and t-sugie are opinion leaders that post many articles. A directed arc denotes that articles are retrieved and reviewed in that direction. Its thickness denotes the number of articles retrieved. In the network, we can see many triangles, each of which forms triad strongly connecting each other. Two example flows of an article are shown in Fig. 14. One flow is in thick solid line. The other is in thick dotted line. S denotes their origin. Each attached number denotes evaluation by each person. In most cases, the evaluation degrades as people relay an article. Each island circled in Fig. 15 shows a community the authors observed, where people know each other in their real life. An article moves mainly in a community. Some people appear in multiple communities, and play a role of gatekeeper[3]. who bridges information between communities.
  • 33. M. Numao et at, / Data Mining on the WAVEx 19 Figure 15: Communities in the real life 5 Conclusion We have proposed a system for information propagation and gathering by relaying a message like Chinese whispers. The URL of the experimental system is: http://www.mn.es. titeeh.ac.jp: 12581/worn/ The authors are preparing a distribution package of the system for experiments in the distributed manner shown in Fig. 12. References [1] L. N. Foner. A multi-agent referral system for matchmaking. In Proceedings of the Inter- national Conference on the Practical Applications of Intelligent Agents and Multi-Agent Technology, 1996. [2] L. N. Foner. Yenta: a multi-agent, referral-based matchmaking system. In AA-97. pages 301 307, 1997. [3] S. Goto and H. Nojima. Analysis of the three-layered structure of information flow in human societies. Journal of Japanese Society for Artificial Intelligence (in Japanese). 8(3):348 356. 1993. This paper also appears in Artifical Intelligence. [4] KrackPlot, URL: http://www.contrib.andrew.cmu.edU/~ kraek/. [5] M. Lea. Contexts of computer-mediated communication. Harvester Wheatsheaf, pages 30 65. 1992.
  • 34. 20 M. Numao et al. / Data Mining on the WAVEs [6] Pattie Maes. Agents that reduce work and information. CACM. 37(7):30– 40. 1994. [7] Takeshi Otani and Toshiro Minami. Searching for information resources by word of mouth. In MACC 97 (In Japanese). 1997. http://www.kecl.ntt.co.jp/csl/msrg/events/macc97- /ohtani.html. [8] P. Resnick, N. lacovou. M. Suchak. P. Bergstrom. and J. Riedl. Grouplens: An open architechture for collaborative filtering of net news. In CSCW '94-pages 175 186. 1994. [9] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating "word of mouth". In CHI. pages 210 217. 1997.
  • 35. Active Mining H. Motoda (Ed.) 1OS Press, 2002 Immune Network-based Clustering for WWW Information Gathering/Visualization Yasufumi Takarna1 '2 and Kaoru Hirota1 {takama,hirota}@hrt.dis.titech.ac.jp 1 Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama 226-8502 JAPAN 2 PREST, Japan Science and Technology Corporation. JAPAN Abstract. A clustering method based on the immune network model is proposed to visualize the topic distributionover the document set that is found on the WWW. The method extracts the keywords that can be used as the landmarksof the major topics in a document set, while the documentclustering is performed with the keywords. The proposed method employs the immune network model to calculate the activation values of keywords as well as to improve the understandability of the web information visualization system. The questionnairesare performed to compare the quality of clusters between the proposed method and k- nieans clustering method, of which the results show that the proposed method can get better results in terms of coherence as well as under- standability than k-means clustering method. 1 Introduction A WWW information visualization method to find topic distribution from document sets is proposed. When the WWW is considered as the information resource, it has several significant characteristics, such as hugeness, dynamic nature, and hyperlinked structure, among which we focus on the fact that the information on the WWW tends to be obtained by users as a set of documents. For example, there are so many online-news sites on the WWW, which constantly release a set of news articles of various topics day by day. As another example, a series of user's retrieval processes also provides the user with a sequence of document sets. Although the hugeness of the WWW as well as its dynamic nature is burden for the users, it will also bring them a chance for business and research if they can notice the trends or movement of the real world from the WWW, which cannot be found from a single document but from a set of documents. Information visualization systems[6, 15, 16, 18]are promising approaches to help the user notice the trends of topics on the WWW. The Fish View system[15] extracts the user's viewpoint as a set of concepts, and the extracted concepts are used not only to construct the vector space that is sensitive to the user's viewpoint, but also to present the user's current viewpoint in an explicit manner. In this paper, an information visualization method based on document set-wise processing is proposed to find the topic distribution over a set of documents. One of the characteristic features of the proposed method is the generation of keyword map as well as document clustering. That is, a landmark that is a representative keyword on a keyword map is found, while the documents containing the same landmark form a document cluster.
  • 36. 22 Y. Takama and K. Hirota / Immune Network-based Clustering When landmark keywords are found based on the propagation of keywords" activa- tion values over the keyword network, the keywords should be activated with related keywords, while the keywords relating to each other should not be highly activated at the same time. To achieve this kind of nonlinear activation, the immune network model [1, 5, 7, 8] is employed to calculate the activation values of keywords. The understandability of the information visualizationsystem for users can be im- proved by employing an appropriate metaphor. From this viewpoint, the method based on the immune network model is expected to improve the understandability of the keyword map, by incorporating the additional information, such as landmark and its suppressing keywords, into the ordinary keyword map, on which only the distance be- tween keywords is a clue to understand the topic distribution over a document set. The concept of the clustering method based on the immune network model as well as its algorithm are proposed in Section 2, followed by the experimental results that compare the quality of the clusters generated by the proposed method and that by k-means clustering method in Section 3. An application of the proposed method to information visualization / gathering systems is considered in Section 4. 2 Immune Network-based Clustering Method 2.1 Concept of Immune Network-based Clustering Generally, the information visualization systems designed for handling documents are divided into 2 types, an information visualizationsystem based on document clustering, and a keyword map. In this paper, the information visualization system that arranges the keywords extracted from documents on (usually) a 2-D space according to their similarities is called a keyword map [6, 9, 16]. A keyword map is often adopted to visualize the topic distribution over a document set. The clustering method[1l, 12, 13, 14]proposed in this paper aims to generate a key- word map, while performing a document clustering. On a keyword map, the keywords relating to the same topic are assumed to gather and form a cluster. The proposed method extracts a representative keyword, called landmark, from each cluster. As the border of keyword clusters on a keyword map is usually not obvious, another constraint for extracting a landmark is adopted from the viewpoint of document clustering. That is, when the documents containing the same landmark are classified into the same clus- ter, there should not exist overlapping among clusters. From the viewpointofdocument clustering, a landmark is called as a cluster identifier, because it defines the member of a document cluster. To extract a landmark (a cluster identifier) from a keyword map. the proposed method calculates an activation value of each keyword based on the interaction between the keywords that relate to each other. In this paper, the immune network model is employed to calculate a keyword's activation value, which is described in Section 2.2. 2.2 Immune Network Model Th Immune network model has been proposed by Jerne[5] to explain the functionality of an immune system, such as variety and memory. The model assumes that an antibody can be active by recognizing the related antibody as well as the antigen of a specific type. As antibodies form a network by recognizing each other, the antibody that has once recognized an invading antigen can outlive after the antigen has been removed.
  • 37. Y. Takama and K. Hirota / Immune Network-based Clustering 23 Concerning the immune network model, several models have been proposed in the field of computational biology[1, 7, 8]. among which one of the simplest model is em- ployed in this paper: 3 here Xl and Ai are the concentration (activation) values of antibody i and antigen i, respectively. The s is a source term modeling a constant cell flux from the bone marrow and r is a reproduction rate of the antigen, while kb,and kg are the decay terms of the antibody and antigen, respectively. The and {0,WC, SC}) indicate the strength ofthe connectivity between the antibodies i and j, and that between antibody i and antigen j, respectively. The influence on antibody i by other connected antibodies and antigens is calculated by the proliferation function (5), which has a log-bell form with the maximum proliferation rate p. Using Eq. (5) does not only activate the antibody by recognizing other antibodies or antigens, but also suppresses the antibody if the influence by other objects is too strong. The characteristics of immune systems such as immune response and tolerance1 can be explained by the model[l, 7, 10]. The dynamics and the stability of the immune network model have been analyzed by fixing the structure or the topology of the network[l, 7, 10]. As the structure of the keyword network that is generated in the proposed method is defined based on the occurrence of keywords in a set of documents, the analysis noted above cannot be applicable. However, the consideration about the combination of the activation states between the connected antibodies leads to the following constraints [13]: • An antibody can take one of 4 states in terms of activation value; virgin state, suppressed state, weakly-activated state, and highly-activated state. • It is unstable that both of the antibodies connected to each other take highly- activated state at the same time. • When there are several antibodies that connect to the same antibody of highly- activated state, the antibodies with strong connection2 are suppressed, while those with weak connection become weakly-activated. Applying such a nonlinear activation mechanism of immune network model enables to satisfy the following contradictory conditions for a landmark. 1 A tolerance indicates the fact that the immune system of a body does not attack the cells of oneself. "As noted in Section2.3. there are two types of connectionsin terms ofstrength.
  • 38. 24 Y. Takama and K. Hirota / Immune Network-based Clustering • A landmark should form a keyword cluster with a certain number of connected keywords. • There should not exist any connection between landmarks. 2.3 Algorithm of Immune. Network-based Clustering In this paper, the immune network model(Eq. (1) (5)) is applied to the calculation of activation values of keywords, by considering a keyword as an antibody and a document as an antigen. The algorithm is as follows: 1. Extraction of keywords (nouns) from a document set with using the morphological analyzer3 and the stopword list. In this paper, only the keywords contained in more than 2 documents are extracted. 2. Construction of the keyword network by connecting the extracted keywords k, to other keywords kj or documents dj. (a) Connection between kj and kj: (Dij indicates the number of documents containing both keywords.) Strong connection (SC): Dij >7k.. Weak connection (WC): 0 < Dij < Tk (b) Connection between k, and dj. (TFi j indicates the term frequency of k, in dj.) SC: TFij > Td WC: 0 < TFij < Td 3. Calculation of keywords" activation values on the constructed network, based on the immune network model (Eq. (1)(5)). 4. Extraction of the keywords that activate much higher than others as landmarks after the convergence. 5. Generation of document clusters according to the landmarks In Step 4. a convergence means that the same set of keywords always becomes active. It is observed through most of the experiments that the same set of keywords have much (about 100 times ) higher activation values than others[l1]. after 1.000 times calculation. 3 As the current system is implemented to handle .Japanese documents. Japanese morphological analyzer r/in.srn(http://clia.sen.aist-nara.ac.jp/) is used to extract nouns.
  • 39. Y. Takama and K. Hirota / Immune Network-based Clustering Table 1: Parameter Settings Used in the Experiments Parameter Value Parameter Value s 10 Xi(0) 10 r 0.01 Ai(0) 105 kg 10-4 Tk 3 kb 0.4 Td 3 103 SC 1.0 106 WC 10-3 p 1.0 3 Experimental Results The quality of clusters generated by the proposed clustering method is compared with that by k-means clustering[3], of which the applicability is widely demonstrated in many applications. While k-means generates the clusters so that each data (documents) in a set can be covered by one of the generated clusters, the proposed method does not intend to cover all the documents. It is observed through many experiments that 60-80% of a document set is covered by the generated clusters. Therefore, it is meaningless to compare both methods in terms of coverage. In this paper, questionnaires are performed to compare the clusters generated by the proposed method and that by k-means. from the following viewpoints. • Coherence: how closely the documents within a cluster relate to each other. • Understandability: how easily the topic- of a cluster can be understood by users. The sets of documents used for the experiments are collected from the following online news sites. Setl Documents in entertainment category of Yahoo! Japan News site4 . released on September 18, 2001. The 75 keywords are extracted from 25 documents. Set2 Documents in entertainment category of Yahoo! Japan News site, released on September 21, 2001. The 62 keywords are extracted from 24 documents Set3 Documents in local news category of Lycos Japan5 . released on September 28. 2001. The 22 keywords are extracted from 23 documents. The parameter values used in the experiments are shown in Table 1. These values are empirically determined based on the values used in the field of computational biologyf[l. 7,8]. The STATISTICA2000 (Statistica Soft, Inc.) is used to perform k-meansclustering. The number of clusters generated by k-means, which has to be determined in advance, is specified as much as the number of clusters generated by the proposed clustering method. The naive k-means clustering tends to generate the clusters of various sizes, and sometimes the cluster containing only one document is generated, which is removed from questionnaires. The questionnaires are answered by 9 subjects, consisting of researchers and stu- dents. Each subject is asked to evaluate the clustering results of 2 document sets, one
  • 40. 26 Y. Takama and K. Hirota /Immune Network-based Clustering Table 2: Comparison of Clustering Results between Proposed Method and K-means Clustering Data | Item Proposed | K-means Setl Set2 Set3 Number of clusters Variance of Cluster Size Average score Score<2.5 Number of clusters Variance of Cluster Size Average score Score>3.5 2.5<Score<3.5 Score<2.5 Number of clusters Variance of Cluster Size Average score Score >3.5 2.5<Score<3.5 Score<2.5 5 0.48 4.33 5 0 0 5 0.32 3.82 4 1 0 5 0.48 2.3 1 1 3 4 3.6 3.90 2 1 1 4 4.625 3.13 1 2 1 5 4.25 4.00 4 0 1 generated by the proposed method and another by k-means. Of course, subjects do not know by which method each result is generated. In the questionnaires, the documents in a cluster and the related keywords are pre- sented for each cluster. The related keywords of the proposed method are landmarks as well as their suppressing keywords. As for the k-means clustering method, the keywords of which the weight in the cluster center is higher than others are used as the related keywords. The number of related keywords of the proposed method is not fixed, while 5 related keywords are presented in the case of k-means for each cluster. Subjects rate the coherence of each cluster with 5 grades, from score 5 as closely related to 1 as not related. As for the understandability.Subjects are asked to mark the related keyword that seems to represent the topic of a cluster6 . Table 2 shows the number of clusters, the variance of cluster size, average score of clusters, and the score distributionof the clusteringresults generated by both method from 3 document sets. From this table, it is shown that the proposed method (Proposed) can obtain better results than k-means clustering (K-means) for Setl and Set2. The reason why the proposed method cannot obtain good result for Set3 seems to relate with the fact that the number of keywords extracted from Set3 is much leas than those from Setl and Set2. That is, it seems that there are less topical keywords in the local news category than in the entertainment category. Extractingnot only keywordsbut also phrases will be required to handle this problem. It is observed that some clusters are generated by both of the proposed method and k-means clustering method. As k-means clustering tends to generate one large clusters, which leads to large variance of cluster size as shown in Table 2. it is also observed that some clusters generated by the proposed method are subset of the cluster generated by k-means. Table 3 and Table 4 shows the distribution of scores of the clusters, dividing the case when the clusters are generated by both methods (SAME). 6 Multiple keyword selection for a cluster is allowed.
  • 41. Y. Takama and K. Hirota /Immune Network-based Clustering 27 Table 3: Score Distribution of Clusters Generated by Plastic Clustering Method Type SAME SUBSET DIFFERENT TOTAL 1 0(0%) 1(8%) 4(22%) 5(11%) 2 2(14%) 2(15%) 1(6%) 5(11%) 3 0(0%) 0(0%) 0(0%) 0(0%) 4 7(50%) 8(62%) 10(55%) 25(56%) 5 5(36%) 2(15%) 3(17%) 10(22%) Total 14(100%) 13(100%) 18(100%) 45(100%) Table 4: Score Distribution of ClustersGenerated by K-means Clustering Method Type SAME SUBSET DIFFERENT TOTAL 1 1(7%) 1(10%) 2(20%) 4(12%) 2 1(7%) 2(20%) 2(20%) 5(15%) 3 0(0%) 0(0%) 0(0%) 0(0%) 4 6(43%) 4(40%) 2(20%) 12(35%) 5 6(43%) 3(30%) 4(40%) 13(38%) Total 14(100%) 10(100%) 10(100%) 34(100%) the clusters generated by the proposed method is a subset of a cluster of k-means (SUBSET), and others (DIFFERENT). From these tables, it can be seen that the clusters generated by both methods can obtain higher scores than others. Although the scores of clusters in SUBSET and DIFFERENT are lower than those in SAME, the proposed method can obtain good score (4 and 5) compared with k-means clustering. As for the understandability, Table 5 shows the ratio of the related keywords that are marked by more than one subjects among the related keywords presented to them. It is shown i Table 5 that the ratio becomes high when the clustering results obtain high scores in terms of coherence, i.e., the results of Setl and Set2 by the proposed method, and the results of Setl and Set3 by k-means clustering method. That is, the cluster with high score relates to a certain, obvious topic, which can be understood by several subjects from the same viewpoint. 4 WWW Information Visualization System with Immune Network Metaphor An information visualization system is one of the promising approaches for handling the growing WWW information resource. The information visualization system that aims to support browsing process often tries to make it easy to understand a link structure by using 3D graphics as well as by introducing the interaction with the user[16]. When a information visualization system is designed to support the information retrieval process with using WWW search engines, it often employs the document clustering method for improving the efficiency of browsing retrieval results[4, 18, 19]. On the other hand, a keyword map[6, 9, 12, 16], which has not been so famous in Table 5: Ratio of Keywords Extracted More Than Once Document Set Setl Set2 Set3 Proposed 0.286 0.368 0.167 K-means 0.304 0.095 0.241
  • 42. 28 Y. Takama and K. Hirota / Immune Network-based Clustering the field of WWW information visualization, is useful to visualize the topic distribution over a set of documents. Visualizing topic distribution is expected to be also suitable for supporting interactive information gathering process. In the proposed method, as a landmark suppresses the related keywords on the constructed keyword network, this relationship among keywords is also useful as the metaphor to improve the understandability of a keyword map. as shown in Fig. 1. While the ordinary keyword map uses only the distance information, the immune network metaphor is used to improve the keyword map by emphasizing the keyword cluster of which the representative is a landmark. In Fig. 1. the immune network metaphor is incorporated into the spring model[16j. so that the spring constant of the spring connected to a landmark can be set to be stronger than others, and the length of the spring between landmarks can be set to be longer than others. A landmark is indicated in white color, while dark-colored one is the keyword suppressed by a landmark. From Fig. 1. five distinct topics represented with landmarks and their related keywords can be shown clearly, while the suppressed keywords "Terrorism" and "Simultaneous" are arranged near the center of the map. because the topic about N. V. tragedy iscontained in manv documents. Figure 1: keyword Map Generated from Setl 5 Conclusion A clustering method based on the immune network model is proposed to visualize the topic distribution over the document set found on the WWW. The method extracts the keywords that can be used as the landmarks of the major topics in a document set. while the document clustering is performed with the keywords. The proposed method employs the immune network model to calculate the activation values of keywords. The questionnaires are performed to compare the clusters generated by the proposed method and those generated by k-means clustering method, of which the results show that the proposed method can get better results in terms of the coherence than k-means. in two of three document sets. From the viewpoint of understandability. it is shown that the landmark and their related keywords can represent the topic of the (luster.
  • 43. Y. Takama and K. Hirota /Immune Network-based Clustering 29 Furthermore, the immune network metaphor is incorporated into an ordinary key- word map to improve its imderstandability. As the future work, the ways of incorpo- rating the immune network model into a keyword map will be considered to further improve the understandability of a keyword map. References [1] Anderson, R. W., Neumann, A. U.,, Perelson, A. S., ''A Cayley Tree Immune Network Model with Antibody Dynamics," Bulletin of Mathematical Biology, 55, 6, pp. 1091 1131, 1993. [2] Cole, C., "Interaction with an Enabling Information Retrieval System: Modeling the User's Decoding and Encoding Operations," Journal of the American Society for Infor- mation Science , 51, 5, pp. 417 426, 2000. [3] Duda, R. O., Hart, P. E., Stork, D. G., "10. Urisupervised Learning and Clustering," in Pattern Classification (2nd Ed.), Wiley, New York, 2000. [4] Hearst, M. A. and Pedersen. J. O., "Reexamining the Cluster Hypothesis: Scat ter/Gather on Retrieval Results," SIGIR '96, pp. 76 84, 1996. [5] Jerne, N. K., ''The Immune System." Sci. Am., 229, pp. 52-60, 1973. [6] Lagus. K., Honkela, T., Kaski, S., Kohonen, T., "Self-Organizing Maps of Document Collection: A New Approach to Interactive Exploration." 2nd Int'l Conf. on Knowledge Discovery and Data Mining, pp.238–243, 1996. [7] Neumann, A. U. and Weisbuch, G., "Dynamics and Topology of Idiotypic Networks." Bulletin of Mathematical Biology, 54, 5, pp. 699–726, 1992. [8] Smith, D. J., Forrest, S., Perelson, A. S., "Immunological Memory is Associative." Int'l Workshop on the Immunity-Based Systems (IBMS'96), 1996. [9] Sumi, Y., Nishimoto,K.. Mase, K., "Facilitating Human Communication in Personalized Information Spaces," AAAI-96 Workshop on Internet-Based Information Systems, pp. 123–129, 1996. [10] Sulzer. B. et al., "Memory in Idiotypic Networks Due to Competition Between Pro- liferation and Differentiation." Bulletin of Mathematical Bioloqy, 55, 6, pp. 1133–1182. 1993. [11] Takama, Y. and Hirota, K., "Application of Immune Network Model to Keyword Set Extraction with Variety," 6th Int'l Conf. on Soft. Computing (IIZUKA2000), pp. 825 830, 2000. [12] Takama, Y. and Hirota, K., "Development of Visualization Systems for Topic Distribu- tion based on Query network", SIG-FAI-A003, pp. 13–18, 2000. [13] Takama, Y. and Hirota, K., "Employing Immune Network Model for Clustering with Plastic Structure," 2001 IEEE Int'l Symp. on Computational Intelligence in Robotics and Automation (CIRA2001), pp. 178 183, 2001. [14] Takama. Y. and Hirota. K., "Consideration of Memory Cell for Immune Network-based Plastic Clustering method," lnTech'2001, pp. 233 239, 2001. [15] Takama, Y. and Ishizuka, "FISH VIEW System: A Document Ordering Support System Employing Concept-structure-based ViewpointExtraction," J. of Information Processing Society of Japan (IPSJ), 42, 7, 2000 (written in Japanese). [16] Takasugi, K. and Kunifuji, S., "A Thinking Support System for Idea Inspiration Using Spring Model." ./. of Japanese Society for Artificial Intelligence, 14, 3, pp. 495 503. 1999 (written in Japanese). [17] Watanabe, I., "Visual Text Mining," J. of Japanese Society for Artificial Intelligence. 16, 2. pp. 226–232, 2001 (written in Japanese). [18] Zamir, O. and Etzioni, O., "Grouper: A Dynamic Clustering Interface to Web Search Results," Proc. 8th Int'l WWW Conference, 1999. [19] Zamir, O. and Etzioni. O., "Web Document Clustering: A Feasibility Demonstration." Proc. SIGIR'98. pp. 46–54, 1998.
  • 45. Active Mining H. Motoda (Ed.) IOS Press, 2002 Interactive Web page Retrieval with Relational Learning based Filtering Rules Masayuki Okabe okabe@mm.media,kyoto-u.ac.jp Japan Science and Technology CREST Yoshida-Nihonmatsn-Cho, Sakyo-ku, Kyoto 606-8501, JAPAN Seiji Yarnada yamada@ymd.dis.titech.ac.jp CISS, IGSSE, Tokyo Institute of Technology 4259 Nagatuta-Cho, Midori-ku,Yokohama 226-8502, JAPAN Abstract. WWW Search Engines usually return a hit-list including many irrelevant pages because most of the users just input a few words as a query which is not enough to specify their information needs. In this paper wepropose a system which applies relevance feedback to the inter- active process between users and Web Search Engines, and accelerates the effectiveness of the process by using a query specific filter. This filter is a set of rules which represents the characteristics of Web pages that a user marked as relevant, and is used to find new relevant Web pages from unidentified pages in a hit-list. Each of the rules is made of logical and proximity relationships among keywords which exist in a certain range of a Web page. That range is one of the areas partitioned by four kinds of HTML tags. The filter is made by a learning algorithm which adopts separate-and-conquer strategy and top-down heuristic search withlim- ited backtracking. In experiments with 20 different kinds of retrieval tests, we demonstrate that our proposed system makes it possible to get more relevant pages than the case not using the system as the number of feedback increases. We also analyze how the filters work. 1 Introduction With the rapid growth of WWW, there are various information sources on the Internet today. Search engines are indispensable tools to access useful information which might exist somewhere on the Internet. While they have been getting higher capability to meet various information needs and large amounts of transactions, they are still insufficient in the ability to support the users who want to collect a certain number of Web pages which are relevant to their requirements. When a user inputs a query, which is usually composed of a few words[1], search engines return a "hit-list" in which so many Webpages are presented in a certain order. However it does not often reflect the user's intent, and thus the user would waste much time and energy on judging Web pages in the hit-list. To resolvethis problem and to provideefficient retrieval process, wepropose a system which mediates between users and search engines in order to select only relevant Web pages out of a hit-list through the interactive process called "relevance feedback" [8]. Given some Web pages marked with their relevancy(relevant or rion-relevant)by a user, this system generates a set of filtering rules, each of which is a rule to decide whether
  • 46. 32 M. Okabe and S. Yamada / Interactive Web Page Retrieval Figure 1: Interactive Web search the user should look a Web page or not. The system constructs filtering rules from the combinations of keywords, relational operators and tags by a learning algorithm which is superior to learn structural patterns. We have developed this basic framework in document retrieval[6]and found our approach was promising. In this paper, we applied this method to the intelligent interface which coordinates the hit-lists of search engines in order for individual user to find their wanted information easily. The remainder of the paper is organized as follows. Section 2 describes the in- teractive process and the way how to apply filtering rules. Section 3 describes the representation and the learning algorithm of filtering rules. Section 4 shows the results of retrieval experiments to evaluate our system. 2 Interactive Web search with relevance feedback Figure 1 shows the overview of interactive Web search with relevance feedback. In this section, we explain the procedures of each step in this search process. The number assigned to them correspond to the numbers in circles of Figure 1. 1. Initial search: A user inputs a query (a set of terms) to our Web search system. Then the system puts the query through to a search engine and obtains ahit-list. 2. Evaluation of results by a user: After getting a hit-list from a search engine, the system asks the user to evaluate and mark the relevancy(relevant or non- relevant) of a small part of Web pages in the hit-list (usually upper 10 pages), and stores those pages as training pages, especially the relevant pages as positive training pages and the non-relevant pages as negative trainingpages. 3. Analyzing training pages: Then the system breaks up each positive training page into the minimalelements which can be a part of filtering rules. The concrete procedures are the followings.
  • 47. M. Okabearui S. Yamada / Interactive Weh PageRetrieval Original hit list ; No.1 pagel ) No.2page2 5 No.3pageS No.4 page4 No.5 pageS Modified hit list No.1 page2 No.2 page4 No.3 page5 O : marked as relevant by a set of filtering rules x : marked as non-relevant by a set of filtering rules Figure 2: Filtering Web Pages • Generating candidates for additional keywords: The extended keywords mean the terms which can be substituted to the arguments of a predicate. It is often said that users usually input only a few terms which are quite insuf- ficient not only for specifying Web pages but for making effective filtering rules, thus this procedures is very important to widen the variations of rule; representation. Our system uses TFIDF method[4] to extract additional keywords. • Generating literals for constructing bodies of filtering rules: Using the ex- tended keywords, the system generates literals which can be one of the ele- ments which compose the body of each filtering rule. These literals are called A condition candidate set and used to construct a body of a filtering rule. 4. Generating filtering rules by learning: Using the condition candidate set. the system generates filtering rules by relational learning. The detail procedures will be developed in the next section. 5. Modify a query and re-searching: The system expands the query using terms which have been extracted through the analysis of training pages. Then the modified query is inputed into a search engine and the new results are obtained. 6. Select and indicate the Web pages satisfying filtering rules: As shown in Figure 2, the system selects the Web pages satisfying the filtering rules from the hit-list returned by search engine, and indicates them to the user. The pages which the user has already evaluated are eliminated from the indication. The information retrieval is done using the above procedures, and the steps from 2 to 6 are repeated until the user collects enough relevant pages. This system provides the two following functions which are used for filtering the results of simple relevance feedback. • Modify a query and re-searching, (corresponding to StepS) • Select and indicate the Web pages satisfying filtering rules, (corresponding to Step6) The search engine: usually selects the candidates of relevant Web pages and ranks them before returning a hit-list. By modifying a query and re-searching, a system is able to modify the ranking. Also by selecting and indicating the Web pages satisfying filtering rules, the filter is modified.
  • 48. 34 M. Okabe and S. Yamada / Interactive Web Page Retrieval The modification of a query is done by using the query expansion techniques which have been studied so well in information retrieval[9, 10]. Thus we omit the discussion on the modification of a query in this paper. We develop representation and generation of filtering rules using the structure of HTML file in the next section. 3 Filtering rules This section explains the representation and the generation of filtering rules in detail. We deal with the construction of filtering rules as inductive learning of machine learn- ing d, in which relevant and non-relevant pages indicated by the user are used as training examples. 3.1 Rule representation We use horn clause to represent filtering rules. The body of a rule consists of the following predicates standing for relations between terms and tags. • ap(region-type, word) : This predicate is true iff a word word appears within a region of region-type in a Web page. • near(region_type, wordl, word2} : This predicate is true iff both of wordswi',and Wj appear within a sequence of 10 words somewhere in a region of region-type of a Web page. The ordering of the two words is not considered. The predicates ap and near represent basic relations between keyword(s) and the position of the keyword(s). Several types of relations among keywords can be assumed, however, we use only neighbor relation because it has been proven to be very useful in several researches. [2. 5]. Furthermore we can easily consider that the importance of words significantly de- pends on tags of HTML. For example, the words within <TITLE> seem to have sig- nificant meaning because they indicate the theme of the Web page. Hence we use the region-type to restrict a tag with which words are surrounded. We prepare the region-type in the followings. • title : The region surrounded with title tags <TITLE>. • anchor : The region surrounded with anchor tags <A>. For example, the <A HREF=. . . >. • head : The region surrounded with heading tags <H1~4>. • para : The region surrounded with paragraph tags <P>. This means the region of the same paragraph. We can represent various features of pages by combining these relations. Here is an example set of rules. { relevant :- ap(title, mobile), ap(anchor. PDA). relevant :- near(para, palm,os). Filtering rules are interpreted disjunction. Thus if any rule is satisfied in a Web page, the page will be considered relevant and otherwise non-relevant. The above filtering rules means that a Web page is relevant if '"mobile" appears in the title and "PDA" appears in an anchor text, or "palm" and "OS" appear near in the same paragraph.
  • 49. M. Okabe and S. Yamada/ Interactive Web Page Retrieval 35 Input: E+ : a set of positive training pages, E : a set of negative training pages C : a condition candidate set, K : a set of extended keywords Output: R : a set of filtering rules. Variables: rule. : a filtering rule. .S : a set, of exception literals, l1 : an exception literal Initialize: K <—a set of words in a query. R, S, I i <— empty, ride «—relevant:-. Repeat 1: Investigate the number p of positive training pages satisfying the rule and the number n of negative training pages satisfying the rv.le. 2: if n = 0 then 3: • Add rule to R. 4: Remove a positive training page satisfying the rule from E + . 5: if E+ is empty then Finish 6: else Initialize rule, S, l1. 7: else 8: • For all literals in C n S, compute the information gain G. 9: if No literal with G > 0 then 10: if the body of the rule is empty then 11: • Add a keyword to K. 12: • Update C. 13: else 14:- Initialize S and rule. 15: • Add l1 to S, and initialize / 1. 16: else 17:- Select lmnx having the maximum G. 18: if the body of the rule is empty, then I 1 := lmax 19: • Add llnal to rule, and S. Figure 3: Learning Algorithm 3.2 Learning algorithm Figure 3 shows the learning algorithm for making filtering rules. This algorithm is based on the first order learning system FOIL [7]which adopts a greedy separate-and-conquer strategy [3]. This algorithm generates a filtering rule one by one, and adds the generated rule to R. When a rule is generated, the pages covered with the rule are removed from the set of positive training pages E+ . Thus, as the number of generated filtering rules increases, E+ decreases, and the algorithm finishes if the E+ becomes empty (step3-5). In the generation of a single filtering rule, a literal is added into the body one by one (step!9), and the rule is established if it includes no negative training page (step2). The added literal is selected from a condition candidate set C. This C consists of the literals having all of the region-types and keywords in K as its arguments and being satisfied in training pages. Concretely the following two types of literals are used. • The ap literals having all of the region Jypes and keywords in K as its arguments and being satisfied in training pages.
  • 50. 36 M. Okabe and S. Yamada / Interactive Web Page Retrieval • The near literals having all of the region Jypcs and keywords in K as its argu ments and being satisfied in training pages. The criteria for selecting a literal which should be added to the body is based on the information <?am(step8). It is computed by the following equations, and popular in learning of filtering tree. numbers of positive/negative trainin fore/after the addition of a literal. Using the information gain, a system is able to select a literal which obtains not only much information for a training page but also many positive training pages satisfying it (step17). This rule construction using information gain is efficient because it is greedy. How- ever it sometimes selects bad literal and stops before completion. In such a case, if a current rule has some literals in its body, this algorithm eliminates all the literals in its body and restarts a rule making process. This backtracking is done for literals in C except for a literal l which was first added to the body (step!4.15). If the body of a current rule has no literal, a new keyword is added to A' and C is updated (stepll.12). The added keyword is selected from terms in positive training pages E+ by the following procedures. 1. Extract paragraphs from E+ using <P>tags. 2. Investigate a subset of the paragraphs including any word in a query, and the subset is called T. 3. Compute the importance for every word wi in T by the following equation. Importance of wi, = (average occurrence i n T ) x ( t h e number of texts in which w, occurs 4. Select the literal which has the maximum importance and is not included in a query. Backtracking and iterative literal making process are main difference from the algo- rithm in FOIL. They are very specific and empirical procedure. Without these exten- sions. however, many useless rules would be generated. 4 Experiments and Results To evaluate the effectivenessof filtering rules, we conducted retrieval experiments. The question here is how many relevant pages we can find more with our proposed system in the condition we look over a certain number of Web pages.
  • 51. M. Okabe and S. Yamada / Interactive Web Page Retrieval Figure 4: An example of topic Figure 5: System Interface 4.1 Settings We conducted two series of retrieval. The one is a retrieval from an original hit-list returned by a search engine (retrieval 1). In this retrieval, we judged 50 pages from the top of the hit-list. The other is a retrieval using our system (retrieval2). In this retrieval, we made feedbacks every after judging 10 pages according to the procedure described in Section 2. We made total four feedbacks. 10 pages after each feedback are collected from the top of the hit-list (excluding the pages we've already judged and filtering rules don't satisfy). In both retrieval, total 50 pages from the same hit-list were evaluated. We used the Google l as a test WWW search engine, which is recognized as oneof the most powerful search engines. For test questions, we used 20 topics(No. 401~-420) provided by the small web track in TREC-82 . This test collection is often used for evaluating the performance of retrieval systems in Information Retrieval community. Figure 4 is an example of topic which is composed of four parts. Title part consists of 1~3 words. We used these title words as a query for search engine. Relevancejudgment of each page is conducted by the same searcher according to the account written in the description and the narrative part of each topic. 4.2 Interface Figure 5 shows the system interface which consists of query input, rule view, title view and several buttons. When users put the make rule button, filtering rules are con- structed and displayed in rule view. We can see the rules directly, thus we find useful patterns or keywords to retrieve relevant pages. Once rules are constructed, the system starts to collect new relevant pages, and display their titles in title view. If the user clicks a title, a browser rises and shows the clicked page. 4.3 Results Figure 6 shows the relation between judged pages and relevant pages found in the judged pages. The number of relevant pages is average value of 20 topics. About first 10 pages, there is no difference because both retrieval returns the same pages. The 1 http://www.google.com 2 http://trec.nist.gov
  • 52. 38 M. Okabe and S. Yamada / Interactive Web Page Retrieval The number of judged pages Figure 6: The average number of relevant pages nil Figure 7: Difference after the first feedback Figure 8: Difference after the second feedback (total 20 pages judged) (total 30 pages judged) Topicnumber Topk number Figure 9: Difference after the third feedback Figure 10: Difference after the fourth feedback (total 40 pages judged) (total 50 pages judged) difference of the number of relevant pages increases after the first feedback. As a result, retrieval2 got about 5 relevant pages more than retrieval1after four feedbacks. However the difference varies in each topic. Figure 7 ~ 10shows the difference of relevant pages between retrievall and retrieval2 after each feedback. Let A be the number of relevant pages found in retrieval1 and B be the one in retrieva!2, the difference D is calculated by D = B —A. In Figure 7. there is little effect of our system because we only judge small number of pages. In Figure 8 and 9, the effect gradually increases. In Figure 10, we can see the effect clearly. Our system produces good results for most of topics except a few topics such as no.4 and no.ll. 4-4 Effective and Ineffective filtering rules As seen in the results, the retrieval which uses our system enhanced the effectiveness for most topics. We show two types of examples, a good one that our system effectively worked, and a bad one that our system didn't work well.
  • 53. M. Okabe and S. Yatnada / Interactive Web Page Retrieval Table 1: Filtering rules generated for topic no. 12 relevant :- ap(anchor,screening). relevant :- near(para,security,system), ap(title,airport), relevant :- near(para,security,airports), near(para,security,access). relevant :- near(para,security,airports), near(para,faa,system). Table 2: Filtering rules generated for topic no.11 relevant :- ap(anchor,shipwreck). relevant :- ap(anchor,shipwreck), ap(anchor,salvaging). Topic 12 is an example that filtering rules worked most effectively. The objective of topic 12 is "to identify a specific airport and describe the security measures already in effect or proposed for use at that airport". Search engine returns many non-relevant pages which introduce "the security which travelers must prepare". Removing such pages by filtering rules, our system could provide proper results. Table 1 shows the filtering rules generated for this topic. These rules represent the pages which introduce specific security systems by using the words "faa" and "screening". Topic 11 is an example that filtering rules didn't work well. The objective of this topic is "To find informationon shipwrecksalvaging: the recovery or attempted recovery of treasure from sunken ships". Relevant pages for this topic include various types of pages such as links, bulletin board, news and individual home pages. The filtering rules generated for this topic are too general or too specific, thus they could not select appropriate pages and it leads to the bad results. Table 2 shows the filtering rules generated for this topic. These rules uses only two keywords and they are insufficient to restrict relevant pages. 5 Conclusion We described a system which enhances the effectiveness of WWW Search Engine by using relevance feedback and relational learning. The main function of our system is the application of filtering rules which is constructed by relational learning technique. We presented its representation and learning algorithm. Then we evaluated their effec- tiveness through retrieval experiments. The results showed that our system enables us to find more relevant pages though the effect differs in every questions. Our system need quick response and moderate machine power. Thus it should be a user side application because search engines cannot afford to attach such a function. One of the future problem is to reduce the cost which users need to judge pages. We plan to apply clustering methods for this problem. References [1] Baeza-Yates, R. and Ribeiro-Neto, B.: Modern Information Retrieval: Addison-Wesley, Wokingham, UK, (1999) [2] Cohen. W.W.: Text categorization and relational learning, In Proceedings of the Twelfth International Conference on Machine, Learning, pp.124–132 (1995)
  • 54. Another Random Scribd Document with Unrelated Content
  • 55. major-general in 1813, and employed in the North; but his operations were unsuccessful, owing to a disagreement with Wade Hampton. A court of inquiry in 1815 exonerated him, however; but upon the reorganizing of the army, he was not retained in the service, and retired to Mexico, where he had acquired large estates. He died in the vicinity of the capital on the 28th of December, 1825.
  • 56. CHEVALIER DE LA NEUVILLE. Chevalier de la Neuville, born about 1740, came to this country with his younger brother in the autumn of 1777, and tendered his services to Congress. Having served with distinction in the French army for twenty years, enjoying the favorable opinion of Lafayette, and bringing with him the highest testimonials, he was appointed on the 14th of May, 1778, inspector of the army under Gates, with the promise of rank according to his merit at the end of three months. He was a good officer and strict disciplinarian, but was not popular with the army. Failing to obtain the promotion he expected, he applied for permission to retire at the end of six months’ service. His request was granted on the 4th of December, 1778, Congress instructing the president that a certificate be given to Monsieur de la Neuville in the following words:— “Mr. de la Neuville having served with fidelity and reputation in the army of the United States, in testimony of his merit a brevet commission of brigadier has been granted to him by Congress, and on his request he is permitted to leave the service of these States and return to France.” The brevet commission was to bear date the 14th of October, 1778. Having formed a strong attachment for General Gates, they corresponded after De la Neuville’s return to France. In one of his letters the chevalier writes that he wishes to return to America, “not as a general, but as a philosopher,” and to purchase a residence near that of his best friend, General Gates. He did not return, however, and his subsequent history is lost amid the troubles of the French Revolution.
  • 58. JETHRO SUMNER. Jethro Sumner, born in Virginia about 1730, was of English parentage. Removing to North Carolina while still a youth, he took an active part in the measures which preceded the Revolution, and believed the struggle to be unavoidable. Having held the office of paymaster to the Provincial troops, and also the command at Fort Cumberland, he was appointed in 1776, by the Provincial Congress, colonel in the Third North Carolina Regiment, and served under Washington at the North. On the 9th of January, 1779, he was commissioned brigadier-general, and ordered to join Gates at the South. He took part in the battle of Camden, and served under Greene at the battle of Eutaw Springs on the 8th of September, 1781, where he led a bayonet-charge. He served to the close of the war, rendering much assistance in keeping the Tories in North Carolina in check during the last years of the struggle, and died in Warren County, North Carolina, about 1790.
  • 59. JAMES HOGAN. James Hogan of Halifax, North Carolina, was chosen to represent his district in the Provincial Congress that assembled on the 4th of April, 1776. Upon the organization of the North Carolina forces, he was appointed paymaster of the Third Regiment. On the 17th of the same month, he was transferred to the Edenton and Halifax Militia, with the rank of major. His military services were confined to his own State, though commissioned brigadier-general in the Continental army on the 9th of January, 1779.
  • 60. ISAAC HUGER. Isaac Huger, born at Limerick Plantation at the head-waters of Cooper River, South Carolina, on the 19th of March, 1742, was the grandson of Huguenot exiles who had fled to America after the revocation of the Edict of Nantes. Inheriting an ardent love of civil and religious liberty, reared in a home of wealth and refinement, thoroughly educated in Europe and trained to military service through participation in an expedition against the Cherokee Indians, he was selected on the 17th of June, 1775, by the Provincial Congress, as lieutenant-colonel of the First South Carolina Regiment. Being stationed at Fort Johnson, he had no opportunity to share in the defeat of the British in Charleston Harbor, as Colonel Moultrie’s victory at Sullivan’s Island prevented premeditated attack on the city. During the two years of peace for the South that followed, Huger was promoted to a colonelcy, and then ordered to Georgia. His soldiers, however, were so enfeebled by sickness, privation, and toil that when called into action at Savannah, they could only show what they might have accomplished under more favorable circumstances. On the 9th of January, 1779, Congress made him a brigadier- general; and until the capture of Charleston by the British in May, 1780, he was in constant service either in South Carolina or Georgia. Too weak to offer any open resistance, the patriots of the South were compelled for a time to remain in hiding, but with the appearance of Greene as commander, active operations were resumed. Huger’s thorough knowledge of the different localities and his frank fearlessness gained him the confidence of his superior officer, and it was to his direction that Greene confided the army on several
  • 61. occasions, while preparing for the series of engagements that culminated in the evacuation of Charleston and Savannah. Huger commanded the Virginia troops at the battle of Guilford Court- House, where he was severely wounded; and at Hobkirk’s Hill he had the honor of commanding the right wing of the army. He served to the close of the war; and when Moultrie was chosen president, he was made vice-president, of the Society of the Cincinnati of South Carolina. Entering the war a rich man, he left it a poor one; he gave his wealth as freely as he had risked his life, and held them both well spent in helping to secure the blessings of liberty and independence to his beloved country. He died on the 17th of October, 1797, and was buried on the banks of the Ashley River, South Carolina.
  • 62. MORDECAI GIST. Mordecai Gist, born in Baltimore, Maryland, in 1743, was descended from some of the earliest English settlers in that State. Though trained for a commercial life, he hastened at the beginning of the Revolution to offer his services to his country, and in January, 1775, was elected to the command of a company of volunteers raised in his native city, called the “Baltimore Independent Company,”—the first company raised in Maryland for liberty. In 1776, he rose to the rank of major, distinguishing himself whenever an occasion offered. In 1777, he was made colonel, and on the 9th of January, 1779, Congress recognized his worth by conferring on him the rank of brigadier-general. It is with the battle of Camden, South Carolina, that Gist’s name is indissolubly linked. The British having secured the best position, Gates divided his forces into three parts, assigning the right wing to Gist. By a blunder in an order issued by Gates himself, the centre and the left wing were thrown into confusion and routed. Gist and De Kalb stood firm, and by their determined resistance made the victory a dear one for the British. When the brave German fell, Gist rallied about a hundred men and led them off in good order. In 1782, joining the light troops of the South, he commanded at Combahee—the last engagement in the war—and gained a victory. At the close of the war he retired to his plantation near Charleston, where he died in 1792. He was married three times, and had two sons, one of whom he named “Independent” and the other “States.”
  • 63. WILLIAM IRVINE. William Irvine, born near Enniskillen, Ireland, on the 3d of November, 1741, was educated at Trinity College, Dublin. Though preferring a military career, he adopted the medical profession to gratify the wishes of his parents. During the latter part of the Seven Years War between England and France, he served as surgeon on board a British man-of-war, and shortly before the restoration of peace, he resigned his commission, and coming to America in 1764, settled at Carlisle, Pennsylvania, where he soon acquired a great reputation and a large practice. Warm-hearted and impulsive, at the opening of the Revolution he adopted the cause of the colonists as his own, and after serving in the Pennsylvania Convention, he was commissioned in 1776 to raise a regiment in that State. At the head of these troops, he took part in the Canadian expedition of that year, and being taken prisoner, was detained for many months. He was captured a second time at the battle of Chestnut Hill, New Jersey, in December, 1777. On the 12th of May, 1779, Congress conferred on him the rank of brigadier-general. From 1782 until the close of the war, he commanded at Fort Pitt,—an important post defending the Western frontier, then threatened by British and Indians. In 1785, he was appointed an agent to examine the public lands, and to him was intrusted the administration of an act for distributing the donation lands that had been promised to the troops of the Commonwealth. Appreciating the advantage to Pennsylvania of having an outlet on Lake Erie, he suggested the purchase of that tract of land known as “the triangle.” From 1785 to 1795, he filled various civil and military offices of responsibility. Being sent to treat with those connected with the Whiskey Insurgents, and failing to quiet them by arguments, he was given command of the Pennsylvania Militia to
  • 64. carry out the vigorous measures afterward adopted to reduce them to order. In 1795, he settled in Philadelphia, held the position of intendant of military stores, and was president of the Pennsylvania Society of the Cincinnati until his death on the 9th of July, 1804.
  • 65. DANIEL MORGAN. Daniel Morgan, born in New Jersey about 1736, was of Welsh parentage. His family having an interest in some Virginia lands, he went to that colony at seventeen years of age. When Braddock began his march against Fort Duquesne, Morgan joined the army as a teamster, and did good service at the rout of the English army at Monongahela, by bringing away the wounded. Upon returning from this disastrous campaign, he was appointed ensign in the colonial service, and soon after was sent with important despatches to a distant fort. Surprised by the Indians, his two companions were instantly killed, while he received a rifle-ball in the back of his neck, which shattered his jaw and passed through his left cheek, inflicting the only severe wound he received during his entire military career. Believing himself about to die, but determined that his scalp should not fall into the hands of his assailants, he clasped his arms around his horse’s neck and spurred him forward. An Indian followed in hot pursuit; but finding Morgan’s steed too swift for him, he threw his tomahawk, hoping to strike his victim. Morgan however escaped and reached the fort, but was lifted fainting from the saddle and was not restored to health for six months. In 1762, he obtained a grant of land near Winchester, Virginia, where he devoted himself to farming and stock-raising. Summoned again to military duty, he served during the Pontiac War, but from 1765 to 1775 led the life of a farmer, and acquired during this period much property. The first call to arms in the Revolutionary struggle found Morgan ready to respond; recruits flocked to his standard; and at the head of a corps of riflemen destined to render brilliant service, he marched away to Washington’s camp at Cambridge. Montgomery
  • 66. was already in Canada, and when Arnold was sent to co-operate with him, Morgan eagerly sought for service in an enterprise so hazardous and yet so congenial. At the storming of Quebec, Morgan and his men carried the first barrier, and could they have been reinforced, would no doubt have captured the city. Being opposed by overwhelming numbers, and their rifles being rendered almost useless by the fast-falling snow, after an obstinate resistance they were forced to surrender themselves prisoners-of-war. Morgan was offered the rank of colonel in the British army, but rejected the offer with scorn. Upon being exchanged, Congress gave him the same rank in the Continental army, and placed a rifle brigade of five hundred men under his command. For three years Morgan and his men rendered such valuable service that even English writers have borne testimony to their efficiency. In 1780, a severe attack of rheumatism compelled him to return home. On the 31st of October of the same year, Congress raised him to the rank of brigadier-general; and his health being somewhat restored, he joined General Greene, who had assumed command of the Southern army. Much of the success of the American arms at the South, during this campaign, must be attributed to General Morgan, but his old malady returning, in March, 1781, he was forced to resign. When Cornwallis invaded Virginia, Morgan once more joined the army, and Lafayette assigned to him the command of the cavalry. Upon the surrender of Yorktown, he retired once more to his home, spending his time in agricultural pursuits and the improvement of his mind. In 1794, the duty of quelling the “Whiskey Insurrection” in Pennsylvania was intrusted to him, and subsequently he represented his district in Congress for two sessions. He died in Winchester on the 6th of July, 1802, and has been called, “The hero of Quebec, of Saratoga, and of the Cowpens; the bravest among the brave, and the Ney of the West.”
  • 67. MOSES HAZEN. Moses Hazen, born in Haverhill, Massachusetts, in 1733, served in the French and Indian War, and subsequently settled near St. Johns, New Brunswick, accumulating much wealth, and retaining his connection with the British army as a lieutenant on half-pay. In 1775, having furnished supplies and rendered other assistance to Montgomery during the Canadian campaign, the English troops destroyed his shops and houses and carried off his personal property. In 1776, he offered his services to Congress, who promised to indemnify him for all loss he had sustained, and appointed him colonel in the Second Canadian Regiment, known by the name of “Congress’s Own,” because “not attached to the quota of any State.” He remained in active and efficient service during the entire war, being promoted to the rank of brigadier-general the 29th of June, 1781. At the close of the war, with his two brothers, who had also been in the army, he settled in Vermont upon land granted to them for their services, and died at Troy, New York, on the 30th of January, 1802, his widow receiving a further grant of land and a pension for life of two hundred dollars.
  • 68. OTHO HOLLAND WILLIAMS. Otho Holland Williams, born in Prince George’s County, Maryland, in 1749, entered the Revolutionary army in 1775, as a lieutenant. He steadily rose in rank, holding the position of adjutant- general under Greene. Though acting with skill and gallantry on all occasions, his fame chiefly rests on his brilliant achievement at the battle of Eutaw Springs, where his command gained the day for the Americans by their irresistible charge with fixed bayonets across a field swept by the fire of the enemy. On the 9th of May, 1782, he was made a brigadier-general, but retired from the army on the 6th of June, 1783, to accept the appointment of collector of customs for the State of Maryland, which office he held until his death on the 16th of July, 1800.
  • 69. JOHN GREATON. John Greaton, born in Roxbury, Massachusetts, on the 10th of March, 1741, was an innkeeper prior to the Revolution, and an officer of the militia of his native town. On the 12th of July, 1775, he was appointed colonel in the regular army. During the siege of Boston, he led an expedition which destroyed the buildings on Long Island in Boston Harbor. In April, 1776, he was ordered to Canada, and in the following December he joined Washington in New Jersey, but was subsequently transferred to Heath’s division at West Point. He served to the end of the war, and was commissioned brigadier- general on the 7th of January, 1783. Conscientiously performing all the duties assigned him, though unable to boast of any brilliant achievements, he won a reputation for sterling worth and reliability. He died in his native town on the 16th of December, 1783, the first of the Revolutionary generals to pass away after the conclusion of peace.
  • 70. RUFUS PUTNAM. Rufus Putnam, born in Sutton, Massachusetts, on the 9th of April, 1738, after serving his apprenticeship as a millwright, enlisted as a common soldier in the Provincial army in 1757. At the close of the French and Indian War, he returned to Massachusetts, married, and settled in the town of New Braintree as a miller. Finding a knowledge of mathematics necessary to his success, he devoted much time to mastering that science. In 1773, having gone to Florida, he was appointed deputy-surveyor of the province by the governor. A rupture with Great Britain becoming imminent, he returned to Massachusetts in 1775, and was appointed lieutenant in one of the first regiments raised in that State after the battle of Lexington. His first service was the throwing up of defences in front of Roxbury. In 1776, he was ordered to New York and superintended the defences in that section of the country and the construction of the fortifications at West Point. In August, Congress appointed him engineer with the rank of colonel. He continued in active service, sometimes as engineer, sometimes as commander, and at others as commissioner for the adjustment of claims growing out of the war, until the disbanding of the army, being advanced to the rank of brigadier-general on the 7th of January, 1783. After the close of the war, Putnam held various civil offices in his native State, acted as aid to General Lincoln during Shays’ Rebellion in 1786, was superintendent of the Ohio Company, founded the town of Marietta in 1788, was appointed in 1792 brigadier-general of the forces sent against the Indians of the Northwest, concluded an important treaty with them the same year, and resigned his commission on account of illness in 1793. During the succeeding ten
  • 71. years, he was Surveyor-General of the United States, when his increasing age compelled him to withdraw from active employment, and he retired to Marietta, where he died on the 1st of May, 1824.
  • 72. ELIAS DAYTON. Elias Dayton, born in Elizabethtown, New Jersey, in July, 1737, began his military career by joining Braddock’s forces, and fought in the “Jersey Blues” under Wolfe at Quebec. Subsequently he commanded a company of militia in an expedition against the Indians, and at the beginning of the Revolution was a member of the Committee of Safety. In July, 1775, he was with the party under Lord Stirling that captured a British transport off Staten Island. In 1776, he was ordered to Canada; but upon reaching Albany he was directed to remain in that part of the country to prevent any hostile demonstration by the Tory element. In 1777, he ranked as colonel of the Third New Jersey Regiment, and in 1781, he materially aided in suppressing the revolt in the New Jersey line. Serving to the end of the war, he was promoted to be a brigadier-general the 7th of January, 1783. Returning to New Jersey upon the disbanding of the army, he was elected president of the Society of the Cincinnati of that State, and died in his native town on the 17th of July, 1807.
  • 73. COUNT ARMAND. Armand Tuffin, Marquis de la Rouarie, born in the castle of Rouarie near Rennes, France, on the 14th of April, 1756, was admitted in 1775 to be a member of the body-guard of the French king. A duel led to his dismissal shortly after. Angry and mortified, he attempted suicide, but his life was saved; and in May, 1777, he came to the United States, where he entered the Continental army under the name of Count Armand. Being granted leave to raise a partisan corps of Frenchmen, he served with credit and great ability under Lafayette, Gates, and Pulaski. At the reorganization of the army in 1780, Washington proposed Armand for promotion, and recommended the keeping intact of his corps. In 1781, he was summoned to France by his family, but returned in time to take part in the siege of Yorktown, bringing with him clothing, arms, and ammunition for his corps, which had been withdrawn from active service during his absence. After the surrender of Cornwallis, Washington again called the attention of Congress to Armand’s meritorious conduct, and he at last received his promotion as brigadier-general on the 26th of March, 1783. At the close of the war he was admitted as a member of the Society of the Cincinnati, and with warmest recommendations from Washington returned to his native country and lived privately until 1788, when he was elected one of twelve deputies to intercede with the king for the continuance of the privileges of his native province of Brittany. For this he was confined for several weeks in the Bastile. Upon his release he returned to Brittany, and in 1789, denounced the principle of revolution and proposed a plan for the union of the provinces of Brittany, Anjou, and Poitou, and the raising
  • 74. of an army to co-operate with the allies. These plans being approved by the brothers of Louis XVI., in December, 1791, Rouarie was appointed Royal Commissioner of Brittany. In March of the year following, the chiefs of the confederation met at his castle; and all was ready for action when they were betrayed to the legislative assembly, and troops were sent to arrest the marquis. He succeeded in eluding them for several months, when he was attacked by a fatal illness and died in the castle of La Guyomarais near Lamballe, on the 30th of January, 1793.
  • 75. THADDEUS KOSCIUSKO. Thaddeus Kosciusko, born near Novogrodek, Lithuania, on the 12th of February, 1746, was descended from a noble Polish family. Studying at first in the military academy at Warsaw, he afterward completed his education in France. Returning to his native country, he entered the army and rose to the rank of captain. Soon after coming to America, he offered his services to Washington as a volunteer in the cause of American independence. Appreciating his lofty character and fine military attainments, Washington made him one of his aids, showing the high estimation in which he held the gallant Pole. Taking part in several great battles in the North, Kosciusko there proved his skill and courage, and was ordered to accompany Greene to the South when that general superseded Gates in 1781. Holding the position of chief engineer, he planned and directed all the besieging operations against Ninety-Six. In recognition of these valuable services, he received from Congress the rank of brigadier- general in the Continental army on the 13th of October, 1783. Serving to the end of the war, he shared with Lafayette the honor of being admitted into the Society of the Cincinnati. Returning to Poland in 1786 he entered the Polish army upon its reorganization in 1789, and fought valiantly in behalf of his oppressed country. Resigning his commission, he once more became an exile, when the Russians triumphed, and the second partition of Poland was agreed upon. Two years later, however, when the Poles determined to resume their struggle for freedom, Kosciusko returned, and in March, 1794, was proclaimed director and generalissimo. With courage, patience
  • 76. and skill, that justified the high esteem in which he had been held in America, he directed his followers while they waged the unequal strife. Successful at first, he broke the yoke of tyranny from the necks of his down-trodden countrymen, and for a few short weeks beheld his beloved country free. But with vastly augmented numbers the enemy once more invaded Poland; and in a desperate conflict Kosciusko, covered with wounds, was taken prisoner, and the subjugation of the whole province soon followed. He remained a prisoner for two years until the accession of Paul I. of Russia. In token of his admiration, Paul wished to present his own sword to Kosciusko; but the latter refused it, saying, “I have no more need of a sword, as I have no longer a country,” and would accept nothing but his release from captivity. He visited France and England, and in 1797 returned to the United States, from which country he received a pension, and was everywhere warmly welcomed. The following year he returned to France, when his countrymen in the French army presented him with the sword of John Sobieski. Purchasing a small estate, he devoted himself to agriculture. In 1806, when Napoleon planned the restoration of Poland, Kosciusko refused to join in the undertaking, because he was on his parole never to fight against Russia. He gave one more evidence before his death of his love of freedom and sincere devotion to her cause, by releasing from slavery all the serfs on his own estate in his native land. In 1816, he removed to Switzerland, where he died on the 15th of October, 1817, at Solothurn. The following year his remains were removed to Cracow, and buried beside Sobieski, and the people, in loving remembrance of his patriotic devotion, raised a mound above his grave one hundred and fifty feet high, the earth being brought from every great battle-field in Poland. This country paid its tribute of gratitude by erecting a monument to his memory at West Point on the Hudson.
  • 77. STEPHEN MOYLAN. Stephen Moylan, born in Ireland in 1734, received a good education in his native land, resided for a time in England, and then coming to America, travelled extensively, and finally became a merchant in Philadelphia. He was among the first to hasten to the camp at Cambridge in 1775, and was at once placed in the Commissariat Department. His face and manners attracting Washington, he was selected March 5, 1776, to be aide-de-camp, and on the 5th of June following, on recommendation of the commander-in-chief, he was made quartermaster-general. Finding himself unable to discharge his duties satisfactorily, he soon after resigned to enter the ranks as a volunteer. In 1777 he commanded a company of dragoons, was in the action at Germantown, and wintered with the army at Valley Forge in 1777 and 1778. With Wayne, Moylan joined the expedition to Bull’s Ferry in 1780, and was with Greene in the South in 1781. He served to the close of the war, being made brigadier-general by brevet the 3d of November, 1783. After the disbanding of the army, he resumed business in Philadelphia, where he died on the 11th of April, 1811, holding for several years prior to his decease the office of United States commissioner of loans.
  • 78. SAMUEL ELBERT. Samuel Elbert, born in Prince William parish, South Carolina, in 1743, was left an orphan at an early age, and going to Savannah, engaged in commercial pursuits. In June, 1774, he was elected captain of a company of grenadiers, and later was a member of the local Committee of Safety. In February, 1776, he entered the Continental army as lieutenant-colonel of Lachlan McIntosh’s brigade, and was promoted to colonel during the ensuing September. In May of the year following, he was intrusted with the command of an expedition against the British in East Florida, and captured Fort Oglethorpe in that State in April of 1778. Ordered to Georgia, he behaved with great gallantry when an attack was made on Savannah by Col. Archibald Campbell in December of the same year. In 1779, after distinguishing himself at Brier Creek, he was taken prisoner, and when exchanged joined the army under Washington, and was present at the surrender of Lord Cornwallis. On the 3d of November, 1783, Congress brevetted him brigadier- general, and in 1785 he was elected Governor of Georgia. In further acknowledgment of his services in her behalf, that State subsequently appointed him major-general of her militia, and named a county in his honor. He died in Savannah on the 2d of November, 1788.
  • 79. CHARLES COTESWORTH PINCKNEY. Charles Cotesworth Pinckney, born at Charleston, South Carolina, on the 25th of February, 1746, was educated in England. Having qualified himself for the legal profession, he returned to his native State and began the practice of law in 1770, soon gaining an enviable reputation and being appointed to offices of trust and great responsibility under the crown. The battle of Lexington, however, changed his whole career. With the first call to arms, Pinckney took the field, was given the rank of captain, June, 1775, and entered at once upon the recruiting service. Energetic and efficient, he gained promotion rapidly, taking part as colonel in the battle at Fort Sullivan. This victory securing peace to South Carolina for two years, he left that State to join the army under Washington, who, recognizing his ability, made him aide-de-camp and subsequently honored him with the most distinguished military and civil appointments. When his native State again became the theatre of action, Pinckney hastened to her defence, and once more took command of his regiment. In all the events that followed, he bore his full share, displaying fine military qualities and unwavering faith in the ultimate triumph of American arms. At length, after a most gallant resistance, overpowered by vastly superior numbers, and undermined by famine and disease, Charleston capitulated in May, 1780, and Pinckney became a prisoner-of-war and was not exchanged until 1782. On the 3d of November of the year following, he was promoted to be brigadier- general. Impoverished by the war, he returned to the practice of law upon the restoration of peace; and after declining a place on the Supreme Bench, and the secretaryship, first of War and then of
  • 80. State, he accepted the mission to France in 1796, urged to this step by the request of Washington and the conviction that it was his duty. Arriving in Paris, he met the intimation that peace might be secured with money by the since famous reply, “Not one cent for tribute, but millions for defence!” The war with France appearing inevitable, he was recalled and given a commission as major-general; peace being restored without an appeal to arms, he once more retired to the quiet of his home, spending the chief portion of his old age in the pursuits of science and the pleasures of rural life, though taking part when occasion demanded in public affairs. He died in Charleston on the 16th of August, 1825, in the eightieth year of his age.
  • 81. WILLIAM RUSSELL. William Russell, born in Culpeper County, Virginia, in 1758, removed in early boyhood with his father to the western frontier of that State. When only fifteen years of age, he joined the party led by Daniel Boone, to form a settlement on the Cumberland River. Driven back by the Indians, Boone persevered; but Russell hastened to enter the Continental army; and he received, young as he was, the appointment of lieutenant. After the battle of King’s Mountain in 1780, he was promoted to a captaincy, and ordered to join an expedition against the Cherokee Indians, with whom he succeeded in negotiating a treaty of peace. On the 3d of November, 1783, he received his commission as brigadier-general. At the close of the war Russell went to Kentucky and bore an active part in all the expeditions against the Indians, until the settlement of the country was accomplished. In 1789, he was a delegate to the Virginia Legislature that passed an act separating Kentucky from that State. After the organization of the Kentucky government Russell was annually returned to the Legislature until 1808, when he was appointed by President Madison colonel of the Seventh United States Infantry. In 1811, he succeeded Gen. William Henry Harrison in command of the frontier of Indiana, Illinois, and Missouri. In 1812, he planned and commanded an expedition against the Peoria Indians, and in 1823 was again sent to the Legislature. The following year he declined the nomination for governor, and died on the 3d of July, 1825, in Fayette County, Kentucky. Russell County of that State is named in his honor.
  • 83. FRANCIS MARION. Francis Marion, born at Winyah, near Georgetown, South Carolina, in 1732, was of Huguenot descent; his ancestors, fleeing from persecution in France, came to this country in 1690. Small in stature and slight in person, he possessed a power of endurance united with remarkable activity rarely surpassed. At the age of fifteen, yielding to a natural love of enterprise, he went to sea in a small schooner employed in the West India trade. Being shipwrecked, he endured such tortures from famine and thirst as to have prevented his ever wishing to go to sea again. After thirteen years spent in peaceful tilling of the soil, he took up arms in defence of his State against the Cherokee Indians. So signal a victory was gained by the whites at the town of Etchoee, June 7, 1761, that this tribe never again seriously molested the settlers. Returning to his home after this campaign, Marion resumed his quiet life until in 1775 he was elected a member of the Provincial Congress of South Carolina. This Congress solemnly pledged the “people of the State to the principles of the Revolution, authorized the seizing of arms and ammunition, stored in various magazines belonging to the crown, and passed a law for raising two regiments of infantry and a company of horse.” Marion resigned his seat in Congress, and applying for military duty, was appointed captain. He undertook the recruiting and drilling of troops, assisted at the capture of Fort Johnson, was promoted to the rank of major, and bore his full share in the memorable defence of Fort Moultrie on Sullivan’s Island, which saved Charleston and secured to South Carolina long exemption from the horrors of war. Little was done at the South for the next three years, when in 1779 the combined French and American forces attempted the capture of Savannah. Marion was in the hottest of the
  • 84. fight; but the attack was a failure, followed in 1780 by the loss of Charleston. Marion escaped being taken prisoner by an accident that placed him on sick leave just before the city was invested by the British. The South was now overrun by the enemy; cruel outrages were everywhere perpetrated; and the defeat of the Americans at Camden seemed to have quenched the hopes of even the most sanguine. Four days after the defeat of Gates, Marion began organizing and drilling a band of troopers subsequently known as “Marion’s Brigade.” Though too few in number to risk an open battle, they succeeded in so harassing the enemy that several expeditions were fitted out expressly to kill or capture Marion, who, because of the partisan warfare he waged and the tactics he employed, gained the sobriquet of the “Swamp Fox.” Again and again he surprised strong parties of the British at night, capturing large stores of ammunition and arms, and liberating many American prisoners. He was always signally active against the Tories, for he well knew their influence in depressing the spirit of liberty in the country. When Gates took command of the Southern army, he neither appreciated nor knew how to make the best use of Marion and his men. South Carolina, recognizing how much she owed to his unwearying efforts in her behalf, acknowledged her debt of gratitude by making him brigadier-general of her Provincial troops, after the defeat of Gates at Camden. Early in the year 1781, General Greene assumed command of the Southern army, and entertaining a high opinion of Marion, sent Lieutenant-Colonel Harry Lee, with his famous legion of light-horse, to aid him. Acting in concert and sometimes independently, these two noted leaders carried on the war vigorously wherever they went, capturing Forts Watson and Motte, defeating Major Frazier at Parker’s Ferry and joining Greene in time for the battle of Eutaw Springs. When the surrender of Cornwallis practically ended the war, Marion returned to his plantation in St. John’s parish and soon after was elected to the Senate of South Carolina. On the 26th of February, 1783, the following resolutions were unanimously adopted by that body:—
  • 85. “Resolved, That the thanks of this House be given Brigadier-General Marion in his place as a member of this House, for his eminent and conspicuous services to his country. “Resolved, That a gold medal be given to Brigadier- General Marion as a mark of public approbation for his great, glorious, and meritorious conduct.” In 1784, he was given command of Fort Johnson in Charleston Harbor, and shortly after, he married Mary Videau, a lady of Huguenot descent, who possessed considerable wealth and was a most estimable character. On the 27th of February, 1795, Francis Marion passed peacefully away, saying, “Thank God, I can lay my hand on my heart and say that since I came to man’s estate I have never intentionally done wrong to any.”
  • 86. THOMAS SUMTER. Thomas Sumter, born in Virginia in 1734, served in the French and Indian War, and afterward on the Western frontier. Establishing himself finally in South Carolina, he was appointed in March, 1776, lieutenant-colonel of the Second Regiment of South Carolina Riflemen, and sent to overawe the Tories and Loyalists in the interior of the State. The comparative immunity from war secured to South Carolina during the first years of the Revolution deprived Sumter of any opportunity for distinguishing himself until after the surrender of Charleston to the British in 1780. Taking refuge for a time in the swamps of the Santee, he made his way after a while to North Carolina, collected a small body of refugees, and presently returned to carry on a partisan warfare against the British. His fearlessness and impetuosity in battle gained for him the sobriquet of “the game- cock;” and with a small band of undisciplined militia, armed with ducking-guns, sabres made from old mill-saws ground to an edge, and hunting-knives fastened to poles for lances, he effectually checked the progress of the British regulars again and again, weakened their numbers, cut off their communications, and dispersed numerous bands of Tory militia. Like Marion, whenever the enemy threatened to prove too strong, Sumter and his followers would retreat to the swamps and mountain fastnesses, to emerge again when least expected, and at the right moment to take the British at a disadvantage. During one of many severe engagements with Tarleton, he was dangerously wounded and compelled for a time to withdraw from active service, but learning Greene’s need of troops, Sumter again took the field. After rendering valuable assistance toward clearing the South of the
  • 87. British, the failure of his health again forced him to seek rest and strength among the mountains, leaving his brigade to the command of Marion. When once more fitted for duty, the British were in Charleston, and the war was virtually at an end. Though Sumter’s military career ended with the disbanding of the army, his country still demanded his services. He represented South Carolina in Congress from 1789 to 1793, and from 1797 to 1801; he served in the United States Senate from 1801 to 1809, and was minister to Brazil from 1809 to 1811. He died at South Mount, near Camden, South Carolina, on the 1st of June, 1832, the last surviving general officer of the Revolution.
  • 88. ADDENDA. Prior to the adoption of the “federal Constitution,” partisan feeling ran high on this side of the Atlantic,—indeed, it was no unusual thing for a man to speak of the colony in which he was born as his country. When the struggle for American independence began, though men were willing to fight in defence of their own State, there was great difficulty in filling the ranks of the Continental army,—not only because of the longer time for which they were required to enlist, but also because once in the Continental service, they would be ordered to any part of the country. The same difficulty existed in respect to securing members for the Continental Congress. With the slowness of transportation and the uncertainty of the mails, it was no small sacrifice for a man to leave his home, his dear ones, and his local prestige, to become one of an unpopular body directing an unpopular war, for it was not until near the end of the struggle that the Revolution was espoused by the majority. It was under these circumstances, then, that three different kinds of troops composed the American army,—the Continentals, the Provincials, and the Militia. The first could be ordered to any point where they were most needed; the second, though regularly organized and disciplined, were only liable to duty in their own State; and the last were hastily gathered together and armed in the event of any pressing need or sudden emergency. Washington, as stated in his commission, was commander-in-chief of all the forces. The other subjects of the foregoing sketches were the commanding officers of the Continental army. Marion and Warren were famous generals of the Provincials; while Pickens and Ten Brock were noted leaders of the militia. Dr. Joseph Warren received his commission of major- general from the Massachusetts Assembly just before the battle of
  • 89. Welcome to Our Bookstore - The Ultimate Destination for Book Lovers Are you passionate about books and eager to explore new worlds of knowledge? At our website, we offer a vast collection of books that cater to every interest and age group. From classic literature to specialized publications, self-help books, and children’s stories, we have it all! Each book is a gateway to new adventures, helping you expand your knowledge and nourish your soul Experience Convenient and Enjoyable Book Shopping Our website is more than just an online bookstore—it’s a bridge connecting readers to the timeless values of culture and wisdom. With a sleek and user-friendly interface and a smart search system, you can find your favorite books quickly and easily. Enjoy special promotions, fast home delivery, and a seamless shopping experience that saves you time and enhances your love for reading. Let us accompany you on the journey of exploring knowledge and personal growth! ebookball.com