Web Data Management Practices Emerging Techniques And Technologies Athena Vakali

Web Data Management Practices Emerging
Techniques And Technologies Athena Vakali
download
https://ebookbell.com/product/web-data-management-practices-
emerging-techniques-and-technologies-athena-vakali-983684
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
The Semantic Web For Knowledge And Data Management Technologies And
Practices 1st Edition Zongmin Ma
https://ebookbell.com/product/the-semantic-web-for-knowledge-and-data-
management-technologies-and-practices-1st-edition-zongmin-ma-1789858
Web Data Management Draft S Abiteboul Et Al
https://ebookbell.com/product/web-data-management-draft-s-abiteboul-
et-al-4131920
Web Data Management A Warehouse Approach 1st Edition Sourav S Bhowmick
https://ebookbell.com/product/web-data-management-a-warehouse-
approach-1st-edition-sourav-s-bhowmick-4238768
Web Data Management And Distribution Serge Abiteboul Ioana Manolescu
https://ebookbell.com/product/web-data-management-and-distribution-
serge-abiteboul-ioana-manolescu-2097512

Intelligent Web Data Management Software Architectures And Emerging
Technologies 1st Edition Kun Ma
https://ebookbell.com/product/intelligent-web-data-management-
software-architectures-and-emerging-technologies-1st-edition-kun-
ma-5355898
New Directions In Web Data Management 1 1 2011 Vakali Athena Jain
https://ebookbell.com/product/new-directions-in-web-data-
management-1-1-2011-vakali-athena-jain-32704318
New Directions In Web Data Management 1 1st Edition Athena Vakali Auth
https://ebookbell.com/product/new-directions-in-web-data-
management-1-1st-edition-athena-vakali-auth-1857550
Webscale Data Management For The Cloud 1st Edition Wolfgang Lehner
https://ebookbell.com/product/webscale-data-management-for-the-
cloud-1st-edition-wolfgang-lehner-4241680
Data Management In The Semantic Web 1st Edition Hal Jin
https://ebookbell.com/product/data-management-in-the-semantic-web-1st-
edition-hal-jin-51356474

Web Data Management
Practices:
Emerging Techniques
and Technologies
Athena Vakali
Aristotle University of Thessaloniki, Greece
George Pallis
Aristotle University of Thessaloniki, Greece
Hershey • London • Melbourne • Singapore
IDEa GrouP PublIshInG

Acquisition Editor: Michelle Potter
Senior Managing Editor: Jennifer Neidig
Managing Editor: Sara Reed
Development Editor: Kristin Roth
Copy Editor: Shanelle Ramelb
Typesetter: Marko Primorac
Cover Design: Lisa Tosheff
Printed at: Integrated Book Technology
Published in the United States of America by
Idea Group Publishing (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@idea-group.com
Web site: http://www.idea-group.com
and in the United Kingdom by
Idea Group Publishing (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 3313
Web site: http://www.eurospan.co.uk
Copyright © 2007 by Idea Group Inc. All rights reserved. No part of this book may be reproduced in any
form or by any means, electronic or mechanical, including photocopying, without written permission from the
publisher.
Product or company names used in this book are for identification purposes only. Inclusion of the names of the
products or companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Web data management practices : emerging techniques and technologies / Athena Vakali and George Pallis,
editors.
p. cm.
Summary: “This book provides an understanding of major issues, current practices and the main ideas in the
field of Web data management, helping readers to identify current and emerging issues, as well as future trends.
The most important aspects are discussed: Web data mining, content management on the Web, Web applications
and Web services”--Provided by publisher.
ISBN 1-59904-228-2 (hardcover) -- ISBN 1-59904-229-0 (softcover) -- ISBN 1-59904-230-4 (ebook)
1. Web databases. 2. Database management. 3. Data warehousing. 4. Web services. I. Vakali, Athena, 1963-
II. Pallis, George, 1978-
QA76.9.W43W4155 2006
005.74--dc22
2006019124
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are
those of the authors, but not necessarily of the publisher.

Web Data Management
Practices:
Emerging Techniques
and Technologies
Table of Contents
Foreword.
.......................................................................................................................vi
Preface.
.........................................................................................................................viii
Section.I:.Web.Data.Mining
Chapter.I
Data.Clustering:.From.Documents.to.the.Web...........................................................1
Dušan Húsek, Academy of the Sciences of the Czech Republic,
Czech Republic
Jaroslav Pokorný, Charles University, Czech Republic
Hana Řezanková, University of Economics, Prague, Czech Republic
Václav Snášel, Technical University of Ostrava, Czech Republic
Chapter.II
Clustering.Web.Information.Sources.
........................................................................34
Athena Vakali, Aristotle University of Thessaloniki, Greece
George Pallis, Aristotle University of Thessaloniki, Greece
Lefteris Angelis, Aristotle University of Thessaloniki, Greece

Chapter.III
An.Overview.of.Similarity.Measures.for.Clustering.XML.Documents..................56
Giovanna Guerrini, Università degli Studi di Genova, Italy
Marco Mesiti, Università degli Studi di Milano, Italy
Ismael Sanz, Universitat Jaume I, Spain
Chapter.IV
Mining.Association.Rules.from.XML.Documents.
....................................................79
Laura Irina Rusu, La Trobe University, Australia
Wenny Rahayu, La Trobe University, Australia
David Taniar, Monash University, Australia
Section.II:.Content.Management.on.the.Web
Chapter.V
Dynamically.Generated.Web.Content:.Research.and.Technology.Practices.......104
Stavros Papastavrou, University of Cyprus, Cyprus
George Samaras, University of Cyprus, Cyprus
Paraskevas Evripidou, University of Cyprus, Cyprus
Panos K. Chrysanthis, University of Pittsburgh, USA
Chapter.VI
Caching.on.the.Web...................................................................................................124
Mehregan Mahdavi, The University of New South Wales, Australia
Boualem Benatallah, The University of New South Wales, Australia
Chapter.VII
Information-Theoretic.Methods.for.Prediction.in.the.Wireless.and.Wired.
Web. .
..........................................................................................................................159
Dimitrios Katsaros, Aristotle University of Thessaloniki, Greece
Section.III:.Web.Information.Integration.and.Applications
Chapter.VIII
Designing.and.Mining.Web.Applications:.A.Conceptual.Modeling.Approach....179
Rosa Meo, Università di Torino, Italy
Maristella Matera, Politecnico di Milano, Italy
Chapter.IX
Integrating.Heterogeneous.Data.Sources.in.the.Web.............................................199
Angelo Brayner, University of Fortaleza, Brazil
Marcelo Meirelles, University of Fortaleza, Brazil
José de Aguiar Moraes Filho, University of Fortaleza, Brazil

Chapter X
E-Mail Mining: Emerging Techniques for E-Mail Management..........................219
Ioannis Katakis, Aristotle University of Thessaloniki, Greece
Grigorios Tsoumakas, Aristotle University of Thessaloniki, Greece
Ioannis Vlahavas, Aristotle University of Thessaloniki, Greece
Section IV: Web Services for Data Accessing
Chapter XI
Web Services: Technology Issues and Foundations................................................242
Bernd Amann, University Paris 6, France
Salima Benbernou, University Lyon 1, France
Benjamin Nguyen, University of Versailles, France
Chapter XII
Web Services Management: Toward Efficient Web Data Access..........................266
Farhana H. Zulkernine, Queen’s University, Canada
Pat Martin, Queen’s University, Canada
About the Authors......................................................................................................289
Index .
..........................................................................................................................297

vi
Foreword
The Web revolution currently under way is making the Internet more and more central to our
everyday lives. The amount of information available on the Web is increasing at an enormous
rate, and almost everyday new services become available. The Internet is now widely used
not only for B2B (business-to-business) and B2C (business-to-consumer) communications,
but also to carry on our everyday activities, such as making a phone call, booking seats at
a theatre, searching for information, and so on.
The richness of the Web is fascinating: It is an infinite repository of up-to-date information
and services able to fulfill everybody’s needs. However, all of us have experienced that
accessing Web information and services can be very frustrating as it is very easy to get lost
in an apparently chaotic environment. Additionally, interacting with the Web is very often
a time-consuming activity resulting in an increasingly frustrating user experience. So, how
can access to such a heterogeneous and highly dynamic environment be managed best? The
explosive growth of the Web calls for new models, techniques, and technologies for Web
data management upon which efficient and effective services can be built.
The current volume is a major contribution in the field of Web data management. It presents
many of the most relevant current developments and research results in a coherent and self-
contained manner. What I really like about this book is its ability to balance theoretical and
practical aspects. Each chapter contains case studies or real-world examples that enable
the reader to understand and evaluate the illustrated techniques immediately. The book
provides exhaustive coverage of the most important fundamental issues related to Web data
management, such as Web data models and integration techniques, and Web data clustering.

It also provides interesting insights on the most relevant applications, such as Web services
and e-mail systems. Finally, the most crucial issues related to efficiency are covered, too,
such as caching and prefetching. The variety of topics found in this book makes it a valuable
reference for all professionals, researchers, and students that are interested in Web data-
management issues.
Elena Ferrari
University of Insubria at Como, Italy
vii

The explosive growth of the Web has dramatically changed the way in which information
is managed and accessed. In particular, nowadays, the Web has evolved rapidly from a
simple information-sharing environment (offering only static text and images) to a rich
framework of dynamic and interactive services (such as video and audio conferencing, e-
commerce, and distance learning). This enormous growth and diversity in terms of access
devices, bandwidth, information sources, and content has complicatedWeb data-management
frameworks and practices.
The Web is actually a distributed global information resource containing a large spectrum of
applications in which users interact with (or within) companies, organizations, governmental
agencies, and educational or collaborative environments.The popularity of theWeb originates
from its potential to deliver readily dynamic, distributed, heterogeneous, and unstructured
data all over the world. In this context, the Web is evolving at an alarming rate and is
becoming increasingly chaotic without any specific, consistent organization. Therefore, the
need of various Web data-management techniques and mechanisms has become obligatory
toward providing information (that is actually useful to users) and improving information
circulation and dissemination over the Web. Furthermore, new tools and techniques are
needed to effectively manage these data since managing Web data with conventional tools
is becoming almost impossible.
Efficient and effective Web data-management practices may form the basis for developing
intelligent, personalized, and business-optimal Web services. Such enabling practices include
Web data mining, scalable data warehousing, and preprocessing, sequence discovery, real-
time processing, users and documents clustering and classification, user modeling, and
evaluation models. These issues may provide valuable information about user preferences
Preface
viii

and expectations, together with usage, content, and structural patterns as practiced over the
Web.
As the demand for data and information management increases, there is also a critical need
for effectively managing Web content. Specifically, Web data mining has a major effect on
the performance of Web data accessing and querying. Therefore, new implementations (such
as Web data clustering, Web data caching, and Web services) have emerged to manage the
(continuously growing) number of documents, their dynamic content, and services under
quality-of-service (QoS) guarantees. The term QoS refers to certain technical characteristics,
such as performance, scalability, reliability, and speed. So, current Web data-management
issues are enforced with specific capabilities to design new Web applications and improve
Web data searching and workload balancing.
In this framework, the chapters of this book provide an overview of current research and
development activities in the area of Web data management. Following our call for chapters
in 2005, we received 25 chapter proposals. All chapters underwent a rigorous, double-
blind refereeing process before final acceptance. Eventually, 12 chapters were accepted for
inclusion in this book. This book brought together academic and industrial researchers and
practitioners from many different countries, including Australia, Brazil, Canada, Cyprus,
Czech Republic, France, Greece, Italy, Spain, and the USA.Authors’research and industrial
experience are reflected in their work and will certainly offer to readers in-depth knowledge
of their areas of expertise.
Organization.of.This.Book
The broad range of topics of the present book makes it an excellent reference on Web
data management. The book is organized so that it could cover a wide range of audiences
including undergraduate university students, postgraduate students, research engineers, and
system developers. Because each chapter is self-contained, readers can focus on the topics
that most interest them. Most of the chapters (if not all) in this book have great practical
utility. Each chapter is accompanied by examples or case studies to show the applicability
of the described techniques or methodologies. Each chapter opens with an abstract, which
is the summary of the chapter, and an introduction, and then closes with a conclusion in
order to give readers adequate background and knowledge for understanding the subject.
All chapters also include an extensive list of references to allow readers to understand the
subject thoroughly by not only studying the topic in depth, but also by referring to other
works related to their topic of interest.
Overall, this book includes work in some of the most breaking topics, such as Web modeling,
clustering over Web data, XML (extensible markup language) data management, Web
services, and Web data prefetching and caching. A short outline of the material presented
in each of the chapters of this book follows to help readers understand the chapter within
their interests.
The book is divided into four major sections:
I. Web Data Mining
II. Content Management on the Web
III. Web Information Integration and Applications
IV. Web Services for Data Accessing
ix

Section I deals with the issues concerned with Web data clustering and association rules,
providing a pathway for current research and development activities in this area. This section
is further divided into four chapters:
Chapter.I, entitled “Data Clustering: From Documents to the Web” by Dušan Húsek,
Jaroslav Pokorný, Hana Řezanková, and Václav Snášel, presents an overview of approaches
to clustering in the Web environment. Particularly, this chapter deals with clustering Web
search results, in which clustering search engines arrange the search results into groups around
a common theme. Furthermore, some considerations are given concerning the justification
of so many clustering algorithms and their application in the Web environment.
Chapter.II, entitled “Clustering Web Information Sources” by Athena Vakali, George
Pallis, and Lefteris Angelis, focuses on the topic of clustering information over the Web in
an effort to provide an overview and survey on the theoretical background and the adopted
practices of the most popular emerging and challenging clustering research efforts. An up-
to-date survey of the existing clustering schemes is given to be of use for both researchers
and practitioners interested in the area of Web data mining.
Chapter.III, entitled “An Overview of Similarity Measures for Clustering XMLDocuments”
by Giovanna Guerrini, Marco Mesiti, and Ismael Sanz, presents the most indicative research
efforts for clustering XML documents relying on contents, structures, and link-related
properties of XML documents. Specifically, the most relevant similarity measures are
evaluated, providing a systematic comparison of all the presented measures that allows one
to determine which measure applies in a particular context.
Chapter.IV, entitled “Mining Association Rules from XML Documents” by Laura Irina
Rusu, Wenny Rahayu, and David Taniar, presents some of the existing mining techniques
for extracting association rules out of XMLdocuments in the context of rapid changes in the
Web knowledge-discovery area. Specifically, it presents the latest discoveries in the area of
mining association rules from XMLdocuments, both static and dynamic, in a well-structured
manner, with examples and explanations so the reader will be able to easily identify the
appropriate technique for his or her needs and replicate the algorithm in a development
environment. At the same time, this chapter includes research work with a high level of
usability, in which concepts and models are easy to be applied in real situations without
imposing knowledge of any high-level mathematics concepts.
Section II presents interesting techniques for effectively managing the Web content. There
are three chapters in this section:
Chapter.V, entitled “Dynamically Generated Web Content: Research and Technology
Practices” by Stavros Papastavrou, George Samaras, Paraskevas Evripidou, and Panos
Chrysanthis, deals with the dynamic Web content technology, which is definitely one of the
most emerging research areas due to the exponential increase in the information circulation
and dissemination over the Web. This chapter covers past and present research approaches,
practices, and available technologies that facilitate the extraction of information from Web
databases and its dissemination to Web users.
Chapter.VI, entitled “Caching on the Web” by Mehregan Mahdavi and Boualem Benatallah,
studies Web caching techniques with focus on dynamic content. Caching is a key technique
that addresses some of the performance issues in today’s Web-enabled applications.
Deploying dynamic data, especially in an emerging class of Web applications, called Web
x

portals, makes caching even more interesting. This chapter discusses the limitations of
caching in Web portals and studies a solution that addresses these limitations. The solution
is based on the collaboration between the portal and its providers.
Chapter.VII, entitled “Information-Theoretic Methods for Prediction in the Wireless and
Wired Web” by Dimitrios Katsaros, presents information-theoretic techniques for discrete
sequence prediction. It surveys, classifies, and compares the state-of-the-art solutions,
suggesting routes for further research by discussing the critical issues and challenges of
prediction in wired and wireless networks.
Section III is about information integration and Web-based applications and includes three
chapters:
Chapter.VIII, entitled “Designing and Mining Web Applications: A Conceptual Modeling
Approach” by Rosa Meo and Maristella Matera, presents the usage of a modeling language,
WebML, for the design of Web applications. It discusses the advantages of adopting
conceptual modeling for the design and maintenance of a Web data-intensive application.
Furthermore, it presents a case study about the analysis of the conceptual logs for testifying
to the effectiveness of WebML and its conceptual modeling methods. The methodology of
the analysis of Web logs is based on the data-mining paradigm of item sets and frequent
patterns and makes full use of constraints on the conceptual logs’content. Many interesting
patterns are obtained, such as recurrent navigation paths, the most frequently visited page’s
contents, and anomalies.
Chapter.IX, entitled “Integrating Heterogeneous Data Sources in the Web” by Angelo
Brayner, Marcelo Meirelles, and José de Aguiar Moraes Filho, describes an extension to
the XQuery language, called MXQuery, which supports queries over several data sources
and solves integration problems as semantic heterogeneity and incomplete information. The
proposed language provides the necessary support for integrating a variable number of data
sources with different degrees of autonomy. MXQuery solves problems of data integration,
such as semantics heterogeneity, and copes with incomplete information. Furthermore, this
chapter presents an architecture to process MXQuery queries over multiple heterogeneous
databases available on the Web.
Chapter.X, entitled “E-Mail Mining: Emerging Techniques for E-Mail Management” by
Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas, deals with e-mail mining. In
particular, this chapter discusses how disciplines like machine learning and data mining can
contribute to the solution of the problem by constructing intelligent techniques that automate
e-mail managing tasks, and what advantages they hold over other conventional solutions.
It also discusses the particularity of e-mail data and what special treatment e-mail requires.
Some interesting e-mail mining applications like mail categorization, summarization,
automatic answering, and spam filtering are also presented.
Section IV presents insights and perspectives for Web services and contains two chapters:
Chapter.XI, entitled “Web Services: Technology Issues and Foundations” by BerndAmann,
Salima Benbernou, and Benjamin Nguyen, introduces the concept of service-oriented
computing (SOC) on theWeb and the current standards enabling the definition and publication
of Web services. Moreover, this chapter illustrates the complexity of the Web-service
composition problem and provides a representative overview of the existing approaches.
xi

The chapter concludes with a short presentation of two research projects exploiting and
extending the Web-service paradigm.
Chapter.XII, “Web-Services Management: Toward Efficient Web Data Access” by
Farhana Zulkernine and Pat Martin, presents an overview and the state-of-the-art of various
management approaches, models, and architectures for Web-services systems toward
achieving QoS in Web data access. Moreover, it discusses the importance of autonomic or
self-managing systems and provides an outline of the current research on autonomic Web
services.
What.Makes.This.Book.Different
Several research efforts have already appeared in the area of Web data management, and
this field seems to be of high importance for a wide academic and technical group due to
the difficulties raised by the diversity of Web data structure and representation, information
distribution, and communication and accessing costs. However, a dedicated book on
important issues in Web data-management systems is still difficult to find. Most books are
about either Web technology focusing on developing Web warehouses, or very specific areas
such as Web modeling, Web mining, and Web replication.
This book provides a complete overview on important aspects in the Web data-management
practice in order to be used either as a class textbook or as a complementary course text
in a Web data-management course; in that case, its level is suitable for undergraduate- or
graduate-level courses.
This book is, therefore, different in that it covers an extensive range of topics, including related
issues about Web modeling, Web mining, Web caching and replication, Web semantics, and
the XML standard. Furthermore, the main advantage of this book is the integration of both
theoretical and practical aspects in the Web data-management research area.
Intended.Audience
Web Data Management Practices: Emerging Techniques and Technologies is intended for
academic institutions and for working professionals, and for technical and non-technical
readers. The broad range of topics in this book makes it a pathway for current research
and development activities in the area of Web data management. The book is organized so
that it could cover a wide range of audiences including undergraduate university students,
postgraduate students, research engineers, and system developers.
Computer science instructors could use this book to teach Web data-management issues to
senior undergraduate or postgraduate students. The chapters are organized such that they
provide a great deal of flexibility; emphasis can be given to different chapters depending on
the scope of the course and the instructor’s interests. Equivalently, computer-science students
could use it in the context of a course or as a supplementary book for their independent
study.
Computer-science researchers could also benefit from this book because it surveys a vast
content of recent research in the area of Web data management. The research coverage is
likely to benefit researchers and students from academia as well as industry. Moreover,
this book is also ideal for researchers from other computer-science disciplines who wish to
xii

get acquainted with this area and integrate it with their own fields. The general computer
community will benefit from this book through its technical as well as practical overview
of the area.
Finally, the chapters in this book can be used by Web application developers as a reference
to use the correct techniques for modeling and designing Web services, as well as efficiently
handling a huge amount of Web information.
How.to.Read.This.Book
The book as a whole is meant for anyone professionally interested in Web data-management
techniques and who in some way wants to gain an understanding of how data-mining
applications are implemented on the Web. The organization of the book has been carefully
selected to help the reader. Each chapter may be studied separately or in conjunction with
other chapters. Thus, it is not mandatory to study the topics in their order of appearance. If
the reader wishes to perform an in-depth study of a particular subject, then he or she could
focus on the corresponding section.
Although how the book is read largely depends on the personal interests of the reader, two
possible paths are recommended. For readers who are interested in the Web applications and
implementations, it is recommended to read Sections II, III, and IV. For the reader who wants
to acquire a theoretical knowledge about Web data management issues, it is recommended
to read Sections I and II, and at least “scan” Section IV.
A.Closing.Remark
The authors have made significant efforts to provide high-quality chapters despite space
restrictions. The authors are well-known researchers in the area of Web data management,
and they have already offered significant contributions to the literature. We hope that the
reader will benefit from the works presented in this book.
xiii

Acknowledgments
The editors would like to acknowledge the help of all people involved in the collation
and review process of the book, without whose support this project could not have been
completed.
We thank all the chapter reviewers for their dedicated effort to review chapters in their
areas of expertise in a timely manner. Special thanks go to all the staff at Idea Group Inc.,
whose contributions throughout the whole process from inception of the initial idea to final
publication have been invaluable. In particular, the editors are thankful to Mehdi Khosrow-
Pour, senior academic editor of Idea Group Inc., for his support and encouragement in
publishing this book. The authors are grateful to Kristin Roth, development editor, for her
timely and professional cooperation, and for her decisive and kind support of this project.
In closing, we wish to thank all of the authors for their insights and excellent contributions
to this book. We also want to thank Professor Elena Ferrari for kindly accepting to write a
foreword for this book. Finally, we thank our families for their support and understanding
throughout this project.
We hope that the readers will find these chapters informative and enlightening. Comments
from readers will be greatly appreciated. Please contact us at avakali@csd.auth.gr and
gpallis@ccf.auth.gr.
Athena Vakali and George Pallis
Thessaloniki, Greece
June 2006
xiv

Special Thanks to Reviewers
This project depends on the efforts of our volunteers to help us deliver high-quality peer-
reviewed chapters. We thank the following reviewers for their gracious efforts.
• Lefteris Angelis, Aristotle University, Greece
• Nikos Bassiliades, Aristotle University, Greece
• Salima Benbernou, University of Lyon 1, France
• Barbara Catania, University of Genova, Italy
• Elena Ferrari, University of Insubria, Italy
• Fang Li, Shanghai Jiao Tong University, China
• Rosa Meo, University of Torino, Italy
• Marco Mesiti, University of Milano, Italy
• Giovanna Guerrini, University of Pisa, Italy
• Carlos Hurtado, University of Chile, Chile
• Dimitrios Katsaros, Aristotle University, Greece
• Vassiliki Koutsonikola, Aristotle University, Greece
• Jaroslav Pokorny, Charles University, Czech Republic
• Yucel Saygin, Sabanci University, Turkey
xv

Data Clustering
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission
of Idea Group Inc. is prohibited.
Chapter.I
Data.Clustering:
From.Documents.to.the.Web
Dušan Húsek, Academy of the Sciences of the Czech Republic, Czech Republic
Jaroslav Pokorný, Charles University, Czech Republic
Hana Řezanková, University of Economics, Prague, Czech Republic
Václav Snášel, Technical University of Ostrava, Czech Republic
Abstract
This chapter provides a survey of some clustering methods relevant to clustering docu-
ment collections and, in consequence, Web data. We start with classical methods of cluster
analysis that seem to be relevant in approaching the clustering Web data. Graph clustering
is also described since its methods contribute significantly to clustering Web data. The use
of artificial neural networks for clustering has the same motivation. Based on previously
presented material, the core of the chapter provides an overview of approaches to clustering
in the Web environment. Particularly, we focus on clustering Web search results, in which
clustering search engines arrange the search results into groups around a common theme.

Húsek, Pokorný, Řezanková Snášel
Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of
Idea Group Inc. is prohibited.
We conclude with some general considerations concerning the justification of so many
clustering algorithms and their application in the Web environment.
Introduction
Document and information retrieval (IR) is an important task for Web communities. In this
chapter, we introduce some clustering methods and focus on their use for the clustering,
classification, and retrieval of Web documents.
The aim of clustering is either to create groups of similar objects or to create a hierarchy of
such groups (Jain Dubes, 1988). Clustering is often confused with classification, but there
is some difference between the two techniques. In classification, the objects are assigned to
predefined classes, whereas in clustering, the classes are also to be defined. We focus here
mainly on document clustering, for example, in which objects are texts, Web pages, phrases,
and so forth. Any clustering technique relies on four concepts:
1. the model of the data to be clustered,
2. the similarity measure,
3. the cluster model, and
4. the clustering algorithm that builds the clusters using the data model and the similarity
measure.
By a data model, we mean the common notion used in IR. For example, in the Boolean
model, text is represented by a set of significant terms; in the vector space model, documents
are modeled by vectors of term weights. The way objects are clustered is called the cluster
model. This approach is in accordance with Jain and Dubes (1988), in which objects are
called patterns and the following steps are considered:
• pattern representation (optionally including feature extraction and/or selection),
• the definition of a pattern proximity measure appropriate to the data domain,
• clustering or grouping,
• data abstraction (if needed), and
• the assessment of output (if needed).
The last two steps concern the application of clustering. Data abstraction influences the
description of clusters, for example, labels of folders when clustering with snippets in the
Web environment. A difficult task is the assessment of output, that is, evaluating the quality
of the clustering. Various statistical approaches are used in this context, while in IR we make
this evaluation using the usual measures such as precision and recall. In the past, clustering
was mainly addressed in exploratory data analysis. In consequence, most data clustering

Data Clustering
methods come from statistics. The other application area is the fast retrieval of relevant
information from databases, particularly from huge text collections. In this chapter, we
will present clustering from this perspective. As texts become more and more multimedia
oriented, a lot of special clustering techniques can be applied in this context (e.g., image
clustering). Consider the Web or a set of Web search results as a text collection. Web pages
are modeled from various points of view. In a Web model, we can combine:
• textual information,
• the hyperlink structure,
• co-citation,
• metadata,
• pictures, and
• the HTML (HyperText Markup Language) or XML (eXtensible Markup Language)
structure of Web pages.
We can observe that the hyperlink structure or combining data and metadata in XML docu-
ments, for example, extend usual assumptions about texts to be clustered. Consequently,
new issues appear.
As different communities use clustering, the associated terminology varies widely. We will
freely take up the taxonomy presented in Jain, Murty, and Flynn (1999).
• Hierarchical.vs..Flat: In the former case, a hierarchy of clusters is found and objects
can be assigned to a different number of clusters. The result of flat clustering is an as-
signment of objects to certain numbers of clusters determined before analysis. These
methods are sometimes divided into partitioning methods if classes are mutually
exclusive and use clumping methods, in which an overlap is allowed.
• Agglomerative. vs.. Divisive. (Hierarchical. Clustering): Agglomerative methods
start with each object in a separation group, and then proceed until all objects are in
a single group. Divisive methods start with all objects in a single group and proceed
until each object is in a separate group.
• Monothetic. vs.. Polythetic. (Hierarchical. Clustering): Monothetic methods use
single-feature-based assignment to divide objects into clusters. Polythetic algorithms
consider multiple-feature-based assignment.
• Hard.vs..Fuzzy: In nonfuzzy or hard clustering, objects are divided into crisp clusters,
where each object belongs to exactly one cluster. In fuzzy clustering, the object can
belongtomorethanonecluster,andassociatedwitheachoftheobjectsaremembership
grades that indicate the degree to which the objects belong to the different clusters.
• Deterministic.vs..Stochastic: Deterministic clustering methods, given a data set,
always arrive at the same clustering. Stochastic clustering methods employ stochastic
elements (e.g., random numbers) to find a good clustering.
• Incremental.vs..Nonincremental:.Non-incremental clustering methods mainly rely
on having the whole data set ready before applying the algorithm. For example, a

hierarchical agglomerative clustering belongs to this class. Incremental clustering
algorithms work by assigning objects to their respective clusters as they arrive.
Besides flat and hierarchical methods, some authors (e.g., Han Kamber, 2001; Mercer,
2003) distinguish three to four further categories. These are density-based approaches, grid-
based approaches, model-based approaches, and also hybrid approaches, which are based
on all mentioned approaches.
This chapter provides a survey of some clustering methods relevant to clustering document
collections. We start with classical methods of cluster analysis. In general, the choice of
methods has been influenced by progress appearing recently in approaching the cluster-
ing Web data. Graph clustering contributes to this issue. The section about artificial neural
networks (ANNs) is built with the same motivation. Based on previously presented mate-
rial, the chapter then tries to provide an overview of approaches to clustering in the Web
environment. Particularly, we focus on clustering Web search results, in which clustering
search engines arrange the search results into groups around a common theme. Finally, the
chapter concludes with some considerations about clustering in Web environment.
Methods.of.Cluster.Analysis
The following terms and notation will be used throughout this chapter.
• An object (or pattern, feature vector, or Web page) x.is a single data item used by the
clustering algorithm. It typically consists of a vector of p components x = (x1
,..., xp
).
• The individual scalar components xi
of an object x are called features (or attributes or
values of variables).
• p is the dimensionality of the objects or of the feature space.
• An object set will be denoted by X = {x1
,...,xn
} The ith
object in X will be denoted by
xi
= (xi1
,..., xip
) In many cases, an object set to be clustered is viewed as an n×p object
matrix.
Clustering is a division of the object set into subsets (groups) of similar objects. Each group,
called a cluster, consists of objects that are similar between themselves and dissimilar to
objects of other groups.
Clustering can be realized by means of such techniques as multivariate statistical methods,
neural networks, genetic algorithms, and formal concept analysis. In the terminology of
machine learning, we can talk about unsupervised learning. Statistical methods for cluster-
ing can be classified into groups like cluster analysis, multidimensional scaling (Gordon,
1999; Jain Dubes, 1988), factor analysis, and correspondence analysis.
The following tasks should be solved in connection with the clustering of documents: the
clusteringoflargedatasets,clusteringinhigh-dimensionalspaces,asparsematrixapproach,
and outlier detection and handling.

Data Clustering
We can start with methods for the vector space model (VSM; Salton Buckley, 1988),
which represents a document as a vector of the terms that appear in the document set. Each
feature vector contains term weights of the terms appearing in that document. The term
weighting scheme is usually based on the tf×idf method in IR.
Acollectionofdocumentscanberepresentedbyaterm-documentmatrix.Similaritybetween
documents is measured using one of several similarity measures that are based on the rela-
tions of feature vectors, for example, by the cosine of feature vectors or, equivalently, by a
distance measure (generally, we will use the term proximity measure). We can consider both
documents (Web pages) and terms (topics) as objects of clustering. In the latter case, the
searching of clusters is very close to the reduction of dimensionality. For example, factor
analysis can be used both for the reduction of dimensionality and for clustering (Hartigan,
1975).
We can mention the following basic requirements for clustering techniques for large data
sets (Han Kamber, 2001): scalability (clustering techniques must be scalable, both in
terms of computing time and memory requirements), the independence of the order of input
(i.e., the order of objects that enter into analysis), and the ability to evaluate the validity of
the produced clusters. The user usually wants to have a robust clustering technique that is
strong in the following areas: dimensionality (the distance between two objects must be
distinguishable in a high-dimensional space), noise and outliers (an algorithm must be able
to detect noise and outliers and eliminate their negative effects), statistical distribution,
cluster shape, cluster size, cluster density, and cluster separation (an algorithm must be able
to detect overlapping clusters).
Particular attention is paid to the problem of high-dimensional data. Clustering algorithms
based on proximity measures work effectively for dimensions below 16. Therefore, Berkhin
(2002) claims that data with more than 16 attributes are high-dimensional. Two general
techniques are used in the case of high dimensionality: attributes transformation and do-
main decomposition.
In the former case, for certain types of data, aggregated attributes can be used. If it is im-
possible, principal component analysis (PCA) can be applied. However, this approach is
problematic since it leads to a cluster with poor interpretability. In the area of IR, the singular
value decomposition (SVD) technique is used to reduce dimensionality.As concerns domain
decomposition, it divides the data into subsets (canopies) using some inexpensive similarity
measure. The dimension stays the same, but the costs are reduced. Some algorithms were
designedforsubspaceclustering,forexample,CLIQUE(CLusteringInQUEst)andMAFIA
(Merging of Adaptive Finite Intervals, And more than a CLIQUE).
Forlargedatasets,hybridmethods,whichcombinedifferenttechniques,areoftensuggested.
In the past 10 years, new approaches to clustering large data sets were suggested and some
surveys of clustering methods were prepared, for example, in Berkhin (2002), Jain et al.
(1999), Mercer (2003), and Řezanková, Húsek, and Snášel (2004).
Several approaches are used for clustering large data sets by means of traditional methods
of cluster analysis. One of them can be characterized in the following way. Only objects of
the sample (either random or representative) are clustered to the desired number of clusters.
Other objects are assigned to these created clusters. In the second approach, the data set is
divided into blocks (their size is determined by the capability of used software product),
and in each block, objects are clustered. As results we obtain centroids, which characterize

created clusters (a centroid is a vector of average values of object features computed on the
base of objects assigned to the cluster). At the final stage, the centroids are clustered for
obtaining the desired number of clusters. The centroids can be obtained also by other ways,
for example, by incremental clustering.
For easier searching of document clusters, we can find groups of similar terms (topics). We
can repeat the clustering of terms and documents for the achievement of interesting co-oc-
currences. We can find second-order co-occurrences of documents.
In the following text, we will focus only on the clustering of documents (Web pages) and
subspace clustering. When clusters of documents (Web pages) are found, each cluster can
be characterized in a certain way, for example, by a centroid or medoid (an object of the
cluster that was chosen as a representative). In the process of IR, we can calculate similar-
ity coefficients between the query and the centroid or medoid, and search for clusters of
documents that best correspond to the query. This way of calculation is less time consuming
for searching documents with high similarity than the calculation of similarity coefficients
between the query and individual documents.
Dissimilarity.and.Similarity.Measures
A dissimilarity (or distance) between object x and y (or distance measure) is represented by
the function d(x,.y) : X×X→R, which satisfies the following conditions:
d(x,.x) = 0
d(x,.y) ≥ 0
d(x,.y) = d(y,.x)
For distance, we require triangle inequality to be satisfied; that is, for any objects x, y, and z:
d(x,.z) = d(x,.y) + d(y,.z)
A similarity s(x,.y) between object x and y is the function s(x,.y) : X×X→R, which satisfies
the following conditions:
s(x,.x) = 1
s(x,.y) ≥ 0
s(x,.y) = s(y,.x)
Both dissimilarity and similarity functions are often defined by a matrix.
Some clustering algorithms operate on a dissimilarity matrix (they are called distance-space
methods in Mercer, 2003). How the dissimilarity between two objects is computed depends
on the type of the original objects.

Data Clustering
Here are some of the most frequently used dissimilarity measures for continuous data:
• Minkowski Lq
distance (for 1 ≥ q)
1
( )
p q
q
i j il jl
l
d x x
=
, = | − |
∑
x x
• City-block (or Manhattan distance or L1
)
1
( )
p
i j il jl
l
d x x
=
, = | − |
∑
x x
• Euclidean distance (L2
)
2
1
( ) ( )
p
i j il jl
l
d x x
=
, = −
∑
x x
• Chebychev distance metric (or maximum or L∞ )
1
( ) max( )
i j il jl
l … p
d x x
= , ,
, = | − |
x x
In the case of Chebychev distance, the objects with the largest dispersion will have the larg-
est impact on the clustering. If all objects are considered equally important, the data need
to be standardized first. If continuous measurements are on an unknown scale (continuous
ordinal variables), each value xip
must be replaced by its rank, and the rank scale must be
transformed to 0,1 Then dissimilarities, as for interval-scaled variables, can be used.
Arelation between two objects can be expressed also as a similarity (Berry Browne, 1999). It
can be measured as a correlation between feature vectors. For interval-scaled data, the Pearson
correlationcoefficientisused,butvaluesarefromtheinterval 1 1
− , .Afurtherpossibilityisaco-
sine measure. The cosine of feature vectors is calculated according to the following formula:
1
2 2
1 1
( )
p
il jl
l
i j p p
il jl
l l
x x
s
x x
=
= =
, =
∑
∑ ∑
x x .
Furthermore, we can use the Jaccard coefficient or Dice coefficient. The former can be
expressed as:
1
2 2
1 1 1
( )
p
il jl
l
i j p p p
il jl il jl
l l l
x x
s
x x x x
=
= = =
, =
+ −
∑
∑ ∑ ∑
x x

and the latter as:
1
2 2
1 1
2
( )
p
il jl
l
i j p p
il jl
l l
x x
s
x x
=
= =
×
, =
+
∑
∑ ∑
x x .
Concerning binary variables, we distinguish symmetric ones (both categories are equally
important,e.g.,maleandfemale)andasymmetricones(onecategorycarriesmoreimportance
than the other). For document clustering, the latter has to be considered. Let us consider the
following contingency table:
i j
/
x x 1 0
1 a b
0 c d
with frequencies a, b, c, and d. For asymmetric variables, we can use, for example, the Jac-
card coefficient or Dice coefficient. The former can be expressed as:
( )
i j
a
s
a b c
, =
+ +
x x
and the latter as:
2
( )
2
i j
a
s
a b c
, =
+ +
x x .
We can also use the cosine of feature vectors, that is, the Ochiai coefficient:
( )
i j
a a
s
a b a c
, = ×
+ +
x x .
If the data set is a contingency table with frequencies of categories, we can use dissimilarity
measures based on the chi-square test of equality for two sets of frequencies:
2
2
1 1
( ( ))
( ( ))
( )
( ) ( )
p p jl jl
il il
i j l l
il jl
x E x
x E x
d
E x E x
= =
−
−
, = +
∑ ∑
x x ,
where E(xil
) and E(xjl
) are expected values on the assumption of independency in the con-
tingency table:

Data Clustering
1
1 1
( ) ( )
( )
p
im il jl
m
il p p
im jm
m m
x x x
E x
x x
=
= =
× +
=
+
∑
∑ ∑
.
We can also use phi-square between sets of frequencies: The chi-square statistic is divided
by the total number of cases, and the square root of this value is computed.
There are a lot of measures for clustering.We will mention how the distance between clusters
can be measured. The log-likelihood distance between clusters a and b is:
( ) a b a b
d a b z z z ,
, = + − ,
where a,b denotes a cluster created by joining objects from clusters a and b, and:
2 2
1
1
log( )
2
p
v v l vl
l
n s s
z =
= − +
∑ ,
where nv
is the number of objects in the vth
cluster, p is the number of variables, 2
l
s is a
sample variance of the lth
continuous variable, and 2
vl
s is a sample variance of the lth
continu-
ous variable in the vth
cluster. This measure can also be used for investigating the distance
between objects.
Partitioning.Algorithms
These methods divide the data set into k clusters, where the integer k needs to be specified
by the user.An initial classification is modified by moving objects from one group to another
if this will reduce the sum of the squares. The algorithm of k-means is very often described
in the literature. For large data sets, some algorithms are based on the PAM (partitioning
around medoids) algorithm. The algorithms k-means and k-medoids belong to methods of
hard clustering. However, we have to consider also the possibility of overlapping clusters.
One approach to solve this task is fuzzy clustering.
Partitioning around Medoids. The algorithm proceeds in two steps. First, for a given cluster
assignment,centrallylocatedobjects(medoids)areselectedbyminimizingthetotaldistance
tootherobjectsinthecluster.Atthesecondstep,eachobjectisassignedtotheclosestmedoid.
Object xi
is put into cluster v when medoid mv
is nearer than any other medoid mw
; that is:
( ) ( ) for all 1 2
i v i w
d m d m w … k
, ≤ , = , , , .
x x
These two steps are repeated until assignments do not change.
The PAM algorithm was extended to the CLARA(Clustering LARgeApplications) method
(Kaufman Rousseeuw, 1990). CLARA clusters a sample from the data set and then as-

10 Húsek, Pokorný, Řezanková Snášel
signs all objects in the data set to these clusters. The process is repeated several times, and
then the clustering with the smallest average distance is selected.
The improvement of the CLARA algorithm is CLARANS (Clustering Large Applications
based on a RANdomized Search; Ng Han, 1994). It proceeds by searching a random
subset of the neighbours of a particular solution. Thus, the search for the best representation
is not confined to a local area of the data.
Fuzzy Cluster Analysis. The aim of these methods is to compute memberships uiv
for each
object xi
and each cluster v. Memberships have to satisfy the following conditions (Gordon,
1999; Höppner, Klawon, Kruse, Runkler, 2000):
1
0 for all 1 and all 1
1for all 1
iv
k
iv
v
u i … n v … k
u i … n
=
≥ = , , = , , ,
= = , ,
∑
The memberships are defined through the minimization of function f:
2 2
1
1 2
1
( )
2
n
iv jv i j
k i j
n
v
jv
j
u u d
f
u
, =
=
=
,
=
×
∑
∑
∑
x x
,
where dissimilarities d(xi
,xj
) are known and memberships uiv
and ujv
are unknown.
Hierarchical.Algorithms
A hierarchical agglomerative algorithm starts with each object in a group of its own. Then
the algorithm merges clusters until only one large cluster remains, which is the whole data
set. The user must select variables, and choose dissimilarity or similarity measures and the
agglomerative procedure. At the first step, when each object represents its own cluster, the
dissimilarity d(xi
,xj
) between objects xi
and xj
is defined by the chosen dissimilarity measure.
However,onceseveralobjectshavebeenlinkedtogether,weneedalinkageoramalgamation
rule to determine when two clusters are sufficiently similar to be linked together. Numerous
linkage rules have been proposed.
The distance between two different clusters can be determined by the distance of the two
closest objects in the clusters (single-linkage method), the greatest distance between two
objects in the clusters (complete-linkage method), or the average distance between all pairs
of objects in the two clusters (unweighted pair-group average method). Furthermore, this
distance can be determined by the weighted average distance between all pairs of objects
in the two clusters (the number of objects in a cluster is used as a weight), or the distance
betweencentroids(unweightedorweighted).Moreover,wecanusethemethodthatattempts
to minimize the sum of squares of differences of individual values from their average in the
cluster (Ward’s method).
The hierarchical approach is used in some algorithms proposed for clustering large data sets.
We can mention the BIRCH (balanced iterative reducing and clustering using hierarchies;

Data Clustering
Zhang, Ramakrishnan, Livny, 1996) method as an example. Objects in the data set are
arranged into subclusters, known as cluster features (CFs). These cluster features are then
clustered into k groups using a traditional hierarchical clustering procedure.ACF represents
a set of summary statistics on a subset of the data. The algorithm consists of two phases.
In the first one, an initial CF tree is built (a multilevel compression of the data that tries to
preserve the inherent clustering structure of the data). In the second one, an arbitrary clus-
tering algorithm is used to cluster the leaf nodes of the CF tree. The disadvantage of this
method is its sensitivity to the order of the objects.
Two-Way.Joining.Algorithm
Two-way joining is useful in (the relatively rare) circumstances when one expects that
both objects and variables (documents and features) will simultaneously contribute to the
uncovering of meaningful patterns of clusters. The difficulty with interpreting these results
may arise from the fact that the similarities between different clusters may pertain to (or be
caused by) somewhat different subsets of variables. Thus, the resulting structure (clusters)
is by nature not homogeneous. However, this method offers a powerful exploratory data
analysis tool (the detailed description of this method is in Hartigan, 1975).
We can explain the use of this method by a simple example. Let us suppose that we have
three variables. Two of them are categorical. We know only one value of the third variable
corresponding to the certain combination of categories of categorical variables.This value is
0 or 1. We investigate the similarity of categories for each categorical variable on the basis
of values of the third variable. If values of the third variable are written into a cross-table,
where categories of one variable are situated in rows and categories of the second one in
columns, both row clusters and column clusters can be distinguished.
At each step of the algorithm, such pairs of rows or columns that are closest in a certain
distance measure are joined. The closest pair of rows (columns) makes a new row (column)
by using a certain linkage rule. This algorithm can be generalized to many-way tables.
Subspace.Clustering
In high-dimensional spaces, clusters often lie in a subspace. To handle this situation, some
algorithms are suggested. Instead of the creation of reduced matrices based on new features
(obtained, for example, by linear combinations of original features), subspaces of the origi-
nal data space are investigated. The task is based on the original features, which have real
meaning while linear combinations of many dimensions may be sometimes hard to interpret.
Subspace clustering is based on a density-based approach. The aim is to find subsets of
features for which projections of the input data include high-density regions. The principle
is the partitioning of each dimension into the same number of equal-length intervals. The
clusters are unions of connected high-density units within a subspace.
CLIQUE,suggestedfornumericalvariablesbyAgrawal,Gehrke,Gunopulos,andRaghavan
(1998), is a clustering algorithm that finds high-density regions by partitioning the data space
into cells (hyperrectangles) and finding the dense cells. Clusters are found by taking the

union of all high-density cells. For simplicity, clusters are described by expressing a cluster
as a DNF (disjunctive normal form) expression and then simplifying the expression.
MAFIAis a modification of CLIQUE that runs faster and finds better quality clusters. pMA-
FIA is the parallel version. MAFIA was presented by Goil, Nagesh, and Choudhary (1999),
and Nagesh, Goil, and Choudhary (2001). The main modification is the use of an adaptive
grid. Initially, each dimension is partitioned into a fixed number of cells.
Moreover, we can mention the algorithm ENCLUS (entropy-based clustering) suggested
by Cheng, Fu, and Zhang (1999). In comparison with CLIQUE, it uses a different criterion
for subspace selection.
Graph.Clustering
Networks arising from real life are concerned with relations between real objects and are
an important part of modern life. Important examples include links between Web pages,
citations of references in scientific papers, social networks of acquaintance, or other con-
nections between individuals, such as those for electric power grids and so forth. A word
network is usually used for what mathematicians and a few computer scientists calls graphs
(Newman, 2003). A graph (network) is a set of items called nodes (vertices) with connec-
tions between them called edges (links). The study of graph theory is one of the fundamental
pillars of discrete mathematics.
A social network is a set of people or groups of people with some pattern of contacts or
interactions between them. Social networks have been studied extensively since the begin-
ning of the 20th
century, when sociologists realized the importance of understanding how
human society functions. The traditional way to analyze a graph is to look at its picture,
but for large networks, this is not useful. A new approach to examine properties of graphs
has been driven largely by the availability of computers and communication networks that
allow us to analyze data on a scale far larger than before (Guimerà, Danon, Díaz-Guilera,
Giralt, Arenas, 2003; Newman, Balthrop, Forrest, Williamson, 2004).
An interesting source of reliable data about personal connections between people is com-
munication records of certain kinds. For example, one could construct a network in which
each node represents an e-mail address and directed edges represent messages passing from
one address to another.
ComplexnetworkssuchastheWeb,socialnetworks,ore-mailoftendonothaveanengineered
architecture, but instead are self-organized by the actions of a large number of individuals.
From these local interactions, nontrivial global phenomena can emerge as, for example,
small-world properties or a scale-free distribution of the degree (Newman et al., 2003). In
small-world networks, short paths between almost any two sites exist even though nodes
are highly clustered. Scale-free networks are characterized by a power-law distribution of a
node’s degree, defined as the number of its next neighbours, meaning that the structure and
dynamics of the network are strongly affected by nodes with a great number of connections.
It is reported in Ebel, Mielsch, and Bornholdt (2002) that networks composed of persons
connected by exchanged e-mails show both the characteristics of small-world networks and
scale-free networks.

Data Clustering
The Web can be considered a graph where nodes are HTML pages and edges are hyperlinks
between these pages. This graph is called the Web graph. It has been the subject of a variety
of recent works aimed at understanding the structure of the Web (Huang Lai, 2003).
A directed graph G = (V, E) consists of a set of nodes, denoted as V, and a set of edges,
denoted as E. Each edge is an ordered pair of nodes (u, v) representing a directed connec-
tion from u to v. The graph G = (V, E) is often represented by the adjacency matrix W by
V V
| |×| |, where wij
= 1 if ( )
i j
v v E
, ∈ and 0
ij
w = in other cases. The out-degree of a node u
is the number of distinct edges (u,v1
)...(u,vk
) (i.e., the number of links from u), and the in-
degree is the number of distinct edges (v1
,u)...(vk
,u) (i.e., the number of links to u). A path
from node u to node v is a sequence of edges (u,u1
), (u1
,u2
), ... (uk
,v). One can follow such a
sequence of edges to walk through the graph from u to v. Note that a path from u to v does
not imply a path from v to u. The distance from u to v is one more than the smallest k for
which such a path exists. If no path exists, the distance from u to v is defined to be infinity.
If (u,v) is an edge, then the distance from u to v is 1.
Given a directed graph, a strongly connected component (strong component for brevity) of
this graph is a set of nodes such that for any pair of nodes u and v in the set, there is a path
from u to v. In general, a directed graph may have one or many strong components. The
strong components of a graph consist of disjoint sets of nodes. One focus of our studies will
be in understanding the distribution of the sizes of strong components on the Web graph.
An undirected graph consists of a set of nodes and a set of edges, each of which is an unor-
dered pair {u,v} of nodes. In our context, we say there is an edge between u and v if there
is a link between u and v, without regard to whether the link points from u to v or the other
way around. The degree deg(u) of a node u is the number of edges incident to u. A path is
defined as in directed graphs, except that now the existence of a path from u to v implies
a path from v to u. A component of an undirected graph is a set of nodes such that for any
pair of nodes u and v in the set, there is a path from u to v. We refer to the components of
the undirected graph obtained from a directed graph by ignoring the directions of its edges
as the weak components of the directed graph. Thus, two nodes on the Web may be in the
same weak component even though there is no directed path between them (consider, for
instance, a node u that points to two other nodes v and w; then v and w are in the same
weak component even though there may be no sequence of links leading from v to w or vice
versa). The interplay of strong and weak components on the (directed) Web graph turns out
to reveal some unexpected properties of the Web’s connectivity.
Informally, we can say that two nodes are considered similar if there are many short paths
connecting them. On the contrary, the shortest path distance does not necessarily decrease
when connections between nodes are added, and thus it does not capture the fact that
strongly connected nodes are at a smaller distance away from each other than weakly con-
nected nodes.
The main findings about the Web structure are as follows.
• A power-law distribution of degrees (Kumar, Raghavan, Rajagopalan, Tomkins,
1999): The in-degree and out-degree distribution of the nodes of the Web graph fol-
lows the power law.
• A bow-tie shape (Broder et al., 2000): the Web’s macroscopic structure.

• The average path length between two Web pages: 16 (Broder et al., 2000) and 19
(Barabasi Albert, 1999).
• Small-world phenomenon (Adamic, 1999): Six degrees of separation between any
two Web pages.
• Cybercommunities (Kumar et al., 1999): groups of individuals who share a common
interest, together with the most popular Web pages among them.
• Self-similaritystructure(Dill,Kumar,McCurley,Rajagopalan,Sivakumar,Tomkins,
2002): The Web shows a fractal structure in many different ways.
Link analysis plays an import role in understanding theWeb’s structure.There are three well-
known algorithms for ranking pages: HITS (hypertext induced topic selection), PageRank,
and SALSA (stochastic approach for link-structure analysis; Schenker, Kande, Bunke,
Last, 2005).
Schenker et al. (2005) describe exciting new opportunities for utilizing robust graph repre-
sentations of data with common machine-learning algorithms. Graphs can model additional
information that is often not present in commonly used data representations, such as vec-
tors. Through the use of graph distance, a relatively new approach for determining graph
similarity, the authors show how well-known algorithms, such as k-means clustering and
k-nearest-neighbours classification, can be easily extended to work with graphs instead of
vectors. This allows for the utilization of additional information found in graph representa-
tions, while at the same time employing well-known, proven algorithms.
Linear.Algebra.Background
Any m×n matrix A can be expressed as:
1
( ) ( ) ( )
r
T
t
t
A A u t v t
s
=
= ,
∑
where r is the rank of A, 1 2
( ) ( ) ( ) 0
r
A A … A
≥ ≥ ≥ are its singular values, and ( ) Rm
u t ∈ ,
( ) Rn
v t ∈ , 1
t … r
= , , are its left and right singular vectors, respectively. The u(t) and the v(t)
are orthonormal sets of vectors; namely, ( ) ( )
T
u i u j is 1 if i = j, and 0 otherwise. We also
remind the reader that:
• Frobenius norm
2 2 2
1
( )
r
i j i
F
i j i
A A A
,
, =
= =
∑ ∑

Data Clustering
• 2-norm
1
2 1 1
max max ( )
n m
T
x R x x R x
A Ax x A A
∈ : = ∈ : =
= = =
.
In matrix notation, SVD is defined as T
A U V
= Σ , where U and V are orthogonal matrices
(thus, UT
U = I and VT
V = I, and an I matrix is the identity matrix { }
ij
I d
= , where ij
d is
the Kronecker symbol) of dimensions m×r and n×r respectively, containing the left and
right singular vectors of A 1
( ( ) ( ))
r
diag A … A
Σ = , , is an r×r diagonal matrix containing
the singular values of A. If we define 1
( ) ( ) ( )
l T
l t
t
A A u t v t
=
= ,
∑ then Al
is the best rank l ap-
proximation to A with respect to the 2-norm and the Frobenius norm. Thus, for any matrix
D of rank at most l 2 2
k
A A A D
− ≤ − and k F F
A A A D
− ≤ − . A matrix A has a good rank
l approximation if l
A A
− is small with respect to the 2-norm and the Frobenius norm. It is
well known that
2 2
1
( )
r
k t
F t l
A A A
= +
− = ∑ and 1
2
( )
k l
A A A
+
− = . From basic linear algebra,
T T T
l l l l l l l l
A U V AVV U U A
= Σ = = , where Ul
and Vl
are submatrices of U and V containing only
the top k left and right singular vectors of A respectively; for a detailed treatment of SVD,
see Golub and Van Loan (1989).
Eigenvector.Clustering.of.Graphs
Donath and Hoffman (1973) introduced the use of eigenvectors for the purpose of parti-
tioning an undirected graph in a balanced way. Since then, there has been a lot of work on
spectral approaches for graph partitioning. See Chung (1997) for an excellent overview of
the field. Shi and Malik (2000) showed that the eigenvectors of different matrices based on
the adjacency matrix of a graph are related to different kinds of balanced cuts in a graph.
Let W be the adjacency matrix of an undirected graph G = (V,E) with nodes 1,2,...,n, and
let D be a diagonal matrix with ( )
i
d deg i
= . Let A and B be sets of nodes and let ( )
E A B
, be
the set of edges ( )
u v
, with u A
∈ and v B
∈ . Two subsets A and B of V, such that A B V
∪ =
and A B
∩ = ∅ , define a cut in G, which we denote as (A, B).
The average association of a set A is:
( )
E A A A
| , | / | |.
The average cut of a set A is:
( ) ( )
E A V A A E A V A V A
| , − | / | | + | , − | / | − |.
The normalized cut of a set A is:
( ) ( ) ( ) ( )
E A V A E A V E A V A E V A V
| , − | / | , | + | , − | / | − , |.

Then Shi and Malik (2000) show that:
• the second largest eigenvector of W is related to a set that maximizes the average as-
sociation,
• the second smallest eigenvector of D − W (also known as the algebraic connectivity or
Fiedler value; Fiedler, 1975) is related to a set that minimizes the average cut, and
• thesecondsmallesteigenvectorofthegeneralizedeigenvectorproblem ( )
D W x Dx
− =
gives an approximation of the smallest normalized cut.
These results hold for undirected graphs, but the Web graph is a directed graph. Thus, it
would be interesting to understand what the above relationships are for directed graphs,
that is, whether the eigenvectors of the corresponding matrices of a directed graph are also
related to balanced decompositions of the directed graph. It is possible that this would lead
to an interesting clustering of the Web graph or to a topic-specific subgraph. The first step
in this direction was taken by Gibson, Kleinberg, and Raghavan (1998). They used the
eigenvectors of the matrix AAT
and the matrix AT
A, where A is the adjacency matrix of a
topic-specific subgraph, to decompose topic-specific subgraphs. They show that the princi-
pal eigenvector and the top few nonprincipal eigenvectors decompose the topic graphs into
multiplehyperlinkedcommunities,thatis,clustersofpagesonthesamesubtopic(Henzinger,
2003). Lots of examples of eigenvector computations can be found in the survey paper of
Langville and Meyer (2005).
Roughly speaking, from spectral analysis, we obtain the decomposition of a graph to a high-
order connected component (Fiedler, 1973, 1975). The work by He, Ding, Zha, and Simon
(2001) compares clustering based on the Fiedler vector with the k-means clustering method
and found that the results of spectral partitioning are usually much better.
Connectivity.Clustering.of.Graphs
Although there are numerous algorithms for cluster analysis in the literature, we briefly
review the approaches that are closely related to the structure of a graph.
Matula (1970, 1972, 1977) uses high connectivity in similarity graphs for cluster analysis,
which is based on the cohesiveness function. The function defines every node and edge of
a graph to be the maximum edge connectivity of any subgraph containing that element. The
k-connected subgraphs of the graph are obtained by deleting all elements with cohesiveness
less than k in the graph, where k is a constant value. It is hard to determine the connectivity
values in real clustering applications with this approach.
There are approaches using biconnected components (two connected subgraphs). The work
of Canutescu, Shelenkob, and Dunbrack (2003) introduces a new algorithm for protein
structure prediction based on biconnected components. In Henzinger (1997), the author
presents fully dynamic algorithms for maintaining the biconnected components.

Data Clustering
There is a recent work related to the clustering of a graph. The HCS (highly connected
subgraphs) algorithm (Hartuv Shamir, 2000) use a similarity graph as the input data. The
algorithm recursively partitions a current set of elements into two subsets. It then identifies
highly connected subgraphs in which the number of edges exceeds half the number of their
corresponding nodes as kernels among them.Akernel is considered a cluster. Unfortunately,
the result of the clustering is not uniquely determined.
The CLICK (CLuster Identification via Connectivity Kernal) algorithm (Sharan Shamir,
2000) builds on a statistical model. It uses the same basic scheme as HCS to form kernels,
and includes the following processing: singleton adoption, a recursive clustering process
on the set of remaining singletons, and an iterative merging step.
CAST (cluster affinity search technique; Ben-Dor, Shamir, Yakhini, 1999) uses a single
parameter t and starts with a single object. Objects are added or removed from the cluster if
their affinities are larger or lower than t, respectively, until the process stabilizes.
In Huang and Lai (2003), definitions of homogeneity and separation to measure the quality
of a graph clustering are introduced.
InNewman(2006),Newman’sQfunctionisusedforgraphembeddingintoEuclideanspace.
This representation is used for fast geometric clustering.
Combined.Methods
Cai, He, Li, Ma, and Wen (2004) described a method to organize Web image search results.
Based on the Web context, they proposed three representations for Web images, that is,
representation based on a visual feature, representation based on a textual feature, and rep-
resentation induced from the image link graph. Spectral techniques were applied to cluster
the search results into different semantic categories. They show that the combination of
textual-feature-based representation and graph-based representation actually reflects the
semantic relationships between Web images.
In Lian, Cheung, Mamoulis, and Yiu (2004), the algorithm S-GRACE (s-graph-based
clustering algorithm for query performance enhancement) is presented. S-GRACE is a
hierarchical clustering algorithm on XML documents that applies a categorical clustering
algorithm ROCK (RObust Clustering using linKs; Guha, Rastogi, Shim, 1999) on the
s-graphs (structure graphs) extracted from the XML documents.
For two XML documents xi
and xj
, the distance between them is defined by:
( ) ( )
( ) 1
max( ( ) ( ) )
i j
i j
i j
sg sg
d
sg sg
| ∩ |
, = −
| |,| |
x x
x x
x x
,
where sg(xi
) is a structure graph (i = 1, 2), ( )
i
sg
| |
x is the number of edges in ( )
i
sg x , and
( ) ( )
i j
sg sg
∩
x x is the set of common edges of ( )
i
sg x and ( )
j
sg x .

Artificial Neural Networks
Artificialneuralnetworks(ANNS) belongtotheadaptiveclassoftechniquesinthemachine-
learning area. ANNs try to mimic the biological neural network — the brain — to solve
basic, computationally hard problems of AI (artificial intelligence).
There are three important and attractive features of ANNs:
1. their capability of learning from example (extracting knowledge from data),
2. they are naturally parallel and thus should be computationally effective, and
3. they work incrementally — the whole data set is not necessary at once.
These features makeANNs a very interesting and promising clustering choice for large data
sets including multimedia and text files.
Most models ofANNs are organized in the form of a number of processing units (also called
artificialneuronsorsimplyneurons;McCulloch,1943)andanumberofweightedconnections
(artificial synapses) between the neurons. The process of building an ANN, similar to its
biological inspiration, involves a learning episode (also called training). During the learning
episode, the network observes a sequence of recorded data and adjusts the strength of its
synapses according to a learning algorithm and the observed data. The process of adjusting
the synaptic strengths in order to be able to accomplish a certain task, much like the brain,
is called learning. Learning algorithms are generally divided into two types: supervised and
unsupervised. The supervised algorithms require labeled training data. In other words, they
require more a priori knowledge about the training set.
There is a very large body of research that has resulted in a large number of ANN designs.
For a more complete review of the various ANN types, see Hassoun (1995) and Rumelhart
and McClelland (1988). In this chapter, we discuss only some of the types that have been
used in the data mining area.
Layered,.Feed-Forward,.and.Back-Propagation.Neural..
Networks
These are a class of ANNs whose neurons are organized in layers. The layers are normally
fully connected, meaning that each element (neuron) of a layer is connected to each ele-
ment of the next layer. However, self-organizing varieties also exist in which a network
either starts with a minimal number of synaptic connections between the layers and adds
new ones as training progresses (constructive), or starts as a fully connected network and
prunes connections based on the data observed in training (destructive; Hassoun, 1995;
Rumelhart McClelland, 1988).
Backpropagation(RumelhartMcClelland,1988)isalearningalgorithmthat,initsoriginal
version, belongs to the gradient-descent optimization methods (Wu, 1996b). The combina-
tion of the back-propagation learning algorithm and the feed-forward, layered networks

Data Clustering
provides the most popular type of ANNs. These ANNs have been applied to virtually all
pattern-recognition problems and are typically the first networks tried on a new problem.
The reason for this is the simplicity of the algorithm and the vast body of research that has
studied these networks. As such, in sequencing, many researchers have also used this type
of network as a first line of attack. Examples can be found in Wu, Zhao, Chen, Lo, and
McLarty (1996) and Wu (1995). In Wu (1995), the author has developed a system called the
gene-classification artificial neural system (GenCANS), which is based on a three-layered,
feed-forward, back-propagation network.
Self-Organizing.Neural.Networks
These networks are a very large class of neural networks whose structure (number of neu-
rons, number of synaptic connections, number of modules, or number of layers) changes
during learning based on the observed data. There are two classes of this type of networks:
destructive and constructive. Destructive networks are initially a fully connected topology,
and the learning algorithm prunes synapses (sometimes entire neurons, modules, or layers)
based on the observed data. The final remaining network after learning is complete usually
is a sparsely connected network. Constructive algorithms start with a minimally connected
network and gradually add synapses (neurons, modules, or layers) as training progresses in
order to accommodate for the complexity of the task at hand.
Self-Organizing Map. A self-organizing map (SOM; Kohonen, 2001) is a neural-network
paradigm first proposed by Kohonen (1991). SOMs have been used as a divisive clustering
approach in many areas. Several groups have used SOMs to discover pattern clusters in Web
pages or in textual documents (Anonymous, 2005a, b). A special version of this paradigm,
WEBSOM, was developed for Web-page clustering (Kaski, Honkela, Lagus, Kohonen,
1998; Kohonen et al., 2000). With the WEBSOM method, a textual document collection
is organized onto a graphical map display that provides an overview of the collection and
facilitates interactive browsing. Interesting documents can be located on the map using a
content-directed search. Each document is encoded as a histogram of term categories that
are formed by the SOM algorithm based on the similarities in the contexts of the terms. The
encoded documents are organized on another self-organizing map — a document map — on
whichnearbylocationscontainsimilardocuments.Specialconsiderationisgiventothecom-
putation of very large document maps, which is possible with general-purpose computers if
the dimensionality of the term-category histograms is first reduced with a random mapping
method and if computationally efficient algorithms are used in computing the SOMs.
SOM as a clustering method has some disadvantages. One of them is the necessity for the
introduction of a decay coefficient that stops the learning (clustering) phase. If the map is
allowed to grow indefinitely, the size of the SOM is gradually increased to a point at which
clearly different sets of expression patterns are identified. Therefore, as with k-means clus-
tering, the user has to rely on some other source of information, such as PCA, to determine
the number of clusters that best represents the available data. For this reason, Sásik, Hwa,
Iranfar, and Loomis (2001) believe that SOM, as implemented by Tamayo et al. (1999), is
essentially a restricted version of k-means: Here, the k clusters are linked by some arbitrary
user-imposed topological constraints (e.g., a 3×2 grid), and as such, the SOM suffers from
all of the problems mentioned above for k-means (and more), except that the constraints

expedite the optimization process. There are many varieties of SOM, among which the
self-organizing feature maps (SOFM) should be mentioned (Kohonen, 1991, 2001). The
growing cell structure (GCS; Fritzke, 1974) is a derivative of SOFM. It is a self-organizing
and incremental (constructive) neural learning approach.
Self-OrganizingTrees.Self-organizingtreesarenormallyconstructiveneuralnetworkmethods
that develop into a tree (usually a binary tree) topology during learning.Among examples of
these networks, the work of Dopazo and Carazo (1997), Wang, Dopazo, and Carazo (1998),
and Herrero, Valencia, and Dopazo (2001) can be mentioned. Dopazo and Carazo introduce
the self-organizing tree algorithm (SOTA). SOTAis a hierarchical neural network that grows
into a binary tree topology. For this reason, SOTAcan be considered a hierarchical clustering
algorithm. SOTA is based on Kohonen’s SOM discussed above and Fritzke’s growing cell
(Fritzke, 1974). SOTA’s performance is superior to that of classical hierarchical clustering
techniques. Among the advantages of SOTA as compared to hierarchical cluster algorithms
are its lower time complexity and its top-to-bottom hierarchical approach. SOTA’s runtimes
are approximately linear with the number of items to be classified, making it suitable for
large data sets.Also, because SOTAforms higher clusters in the hierarchy before forming the
lower clusters, it can be stopped at any level of the hierarchy and still produces meaningful
intermediate results. There are many other types of self-organizing trees.
Recurrent.ANNs
ART and its Derivatives. Adaptive resonance theory (ART) was introduced by Stephen
Grossberg (1976a, b). Networks based on ART are unsupervised and self-organizing, and
they only learn in the so-called resonant state. ART can form (stable) clusters of arbitrary
sequences of input patterns by learning (entering resonant states) and self-organizing. Since
its inception, many derivatives of ART have emerged. Among these, ART-1 (the binary
version of ART that forms clusters of binary input data; Carpenter Grossberg, 1987b),
ART-2 (analog version of ART; Carpenter Grossberg, 1987a), ART-2A (fast version of
ART-2; Carpenter, Grossberg, Rosen, 1991a), ART-3 (includes chemical transmitters to
control the search process in a hierarchical ART structure; Carpenter Grossberg, 1990),
and ARTMAP (supervised version of ART; Carpenter, Grossberg, Reynolds, 1991) can
be mentioned. Many hybrid varieties such as fuzzy ART (Carpenter, Grossberg, Rosen,
1991b),fuzzyARTMAP(supervisedFuzzy-ART;Carpenter,Grossberg,Markuzon,Reynolds,
Rosen, 1992; Carpenter, Grossberg, Reynolds, 1995), and simplified fuzzy ARTMAP
(SFAM; Kasuba, 1993) have also been developed.
These networks have a broad application in virtually all areas of clustering. In general, in
problem settings when the number of clusters is not previously known a priori, research-
ers tend to use unsupervised ART; when the number of clusters is known a priori, usually
the supervised version, ARTMAP, is used. Among the unsupervised implementations, the
work of Tomida, Hanai, Honda, and Kobayashi (2001) should be mentioned. Here the
authors used fuzzy ART for expression-level data analysis. Fuzzy ART incorporates the
basic features of all ART systems, notably, pattern matching between bottom-up input and
top-down learned prototype vectors. This matching process leads either to a resonant state
that focuses attention and triggers stable prototype learning or to a self-regulating parallel
memory search. If the search ends by selecting an established category, then the category’s

Data Clustering 2
prototype may be refined to incorporate new information in the input pattern. If the search
ends by selecting a previously untrained node, then learning of a new category takes place.
Fuzzy ART performs best in noisy data. Although ART has been used in several research
works as a text clustering tool, the level of the quality of the resulting document clusters
has not been clearly established. In Massey (2003), the author presents experimental results
with binary ART that address this issue by determining how close the clustering quality is
to an upper bound on the quality.
Associativeclusteringneuralnetworks.Sincetheintroductionoftheconceptofauto-associa-
tive memory by Hopfield (1982), there have been many associative-memory models built
with neural networks (Kawamura, Okada, Hirai, 1999; Kosko, 1987). Most of them can
beconsideredstore-recallmodels,andthecorrelationbetweenanytwoD-bitbipolarpatterns
s(xi
,xj
), { 1 1}
id
x ∈ − , for all 1
l … p
= , , is often determined by a static measurement such as:
1
1
( , )
p
i j il jl
l
s x x
p =
= .
∑
x x
The human mind, however, associates one pattern in memory to others in a much more
sophisticated way than merely attempting to homogeneously link vectors. Such associations
would interfere with each other (Hinton Anderson, 1989). To mimic the formation of such
associations in cybernetics, Yao, Chen, and Chen (2001) built a recurrent neural network to
dynamically evaluate the association of any pairwise patterns through the interaction among
a group of patterns, and incorporate the results of interaction into data clustering. The novel
rule based on the characteristics of clusters has been proposed to determine the number of
clusters with a reject option. The hybrid model was named associative clustering neural
network (ACNN). The performance of ACNN has been studied by authors on simulated
data only, but the results have demonstrated that ACNN has the feasibility to cluster data
with a reject option and label the data robustly.
Bayesian neural networks. There are a number of recent networks that have been suggested
as solutions to clustering. For instance, Bayesian neural networks (BNNs) are another tech-
nique that has been recently used for Web clustering. Her, Jun, Choi, and Lee (1999) have
used BNNs for clustering Web query results. Their BNN is based on SOM and differs in
the last step when n documents are assigned under each cluster by the Bayesian rule. The
BNNs are an important addition to the host of ANN solutions that have been offered to the
problem at hand as they represent a large group of hybrid ANNs that combine classical
ANNs with statistical classification and prediction theories.
Web.Clustering
The Web has undergone exponential growth since its birth, which is the cause of a number
of problems with its usage. Particularly, the quality of Web search and the corresponding
interpretation of search results are often far from satisfying due to various reasons like the
huge volume of information and diverse requirements for search results.

The lack of a central structure and freedom from a strict syntax allow the availability of a
vast amount of information on the Web, but they often cause the difficult retrieval of data
that is not meaningful. Although ranked lists of search results returned by a search engine
are still popular, this method is highly inefficient since the number of retrieved search re-
sults can be high for a typical query. Most users just view the top 10 results and therefore
might miss relevant information. Moreover, the criteria used for ranking may not reflect the
needs of the user. A majority of the queries tend to be short and, consequently, nonspecific
or imprecise. Moreover, as terms or phrases are ambiguous in the absence of their contexts,
a large amount of search results is irrelevant to the user.
In an effort to keep up with the tremendous growth of the Web, many research projects were
targeted on how to deal with its content and structure to make it easier for the users to find
the information they want more efficiently and accurately. In the last years, mainly data-
mining methods applied in the Web environment created new possibilities and challenges.
Methods of Web data mining can be divided into a number of categories according to the
kind of mined information and the goals that particular categories set. In Pal, Talwar, and
Mitra (2002), three categories are distinguished: Web structure mining (WSM), Web usage
mining (WUM), and Web content mining (WCM). Particularly, WCM refers broadly to the
process of uncovering interesting and potentially useful knowledge from Web documents.
WCM shares many concepts with traditional text-mining techniques. One of these, cluster-
ing, groups similar documents together to make information retrieval more effective. When
applied to Web pages, clustering methods try to identify inherent groupings of pages so that
a set of clusters is produced in which clusters contain relevant pages (to a specific topic); ir-
relevant pages are separated. Generally, text-document clustering methods attempt to collect
the documents into groups, where each group represents some topic that is different than
those topics represented by the other groups. Such clustering is expected to be helpful for
discrimination, summarization, organization, and navigation for unstructured Web pages.
In a more general approach, we can consider Web documents as collections of Web pages
includingnotonlyHTMLfiles,butalsoXMLfiles,images,andsoforth.Animportantresearch
direction in Web clustering is Web XML data clustering, stating the clustering problem with
two dimensions: content and structure (Vakali, Pokorný, Dalamagas, 2004).
WUM techniques use the Web log data coming from users’ sessions. In this framework,
Web log data provide information about activities performed by a user from the moment
the user enters a Web site to the moment the same user leaves it. In WUM, the clustering
tries to group together users’ navigation sessions having similar characteristics (Vakali et
al., 2004). Concerning WSM techniques, graph-oriented methods can be used.
Considering Web clustering techniques, it is important to be aware of two main categories
of approaches:
1. clustering Web pages in a space of resources to facilitate some search services and
2. clustering Web search results.
In Boley et al. (1999), these categories are called off-line clustering and online clustering,
respectively. We mention approaches of both categories, although the main accent is put
on the latter.

Data Clustering 2
Application.of.Web.Clustering
Web clustering is currently one of the crucial IR problems related to the Web. It is used by
many intelligent software agents in order to retrieve, filter, and categorize Web documents.
Variousformsofclusteringarerequiredinawiderangeofapplications:efficientinformation
retrieval by focusing on relevant subsets (clusters) rather than whole collections, clustering
documents in collections of digital libraries, clustering search results to present them in an
organized and understandable form, finding mirrored Web pages, and detecting copyright
violations, among others.
Clustering techniques are immensely important for Web applications to assist the automated
(or semiautomated) generation of proper categories of documents and organize repositories
of search engines. The hierarchical categorization of documents is often used (see Google,
Yahoo, Open Directory, and LookSmart as examples). The reason for this is that the search
results are not summarized in terms of topics; they are not well suited for browsing tasks.
One possible solution is to create manually a static hierarchical categorization of a reason-
able part of the Web and use these categories to organize the search results of a particular
query. However, this solution is feasible only for small collections. To categorize the entire
Web either manually or automatically is, unfortunately, not realistic.
In Pierrakos, Paliouras, Papatheodorou, Karkaletsis, and Dikaiakos (2003), document clus-
tering and a WUM technique are used for the construction of Web community directories
as a means of personalizing Web services. Also, the effective summarization of Web page
collections becomes more and more critical as the amount of information continues to grow
on the Web. The significance of Web collection clustering for automatic Web collection
summarization is investigated in Zamir and Etzioni (1999).
Clustering is also useful in extracting salient features of related Web documents to automati-
cally formulate queries and search for other similar documents on the Web.
Principles.of.Web.Clustering.Methods
Most of the document-clustering methods that are in use today are based on the VSM. The
similarity between documents is measured using one of several similarity measures that are
based on relations of feature vectors, for example, the cosine of feature vectors. Many of
the traditional algorithms based on VSM, however, falter when the dimensionality of the
feature space becomes high relative to the size of the document space. In a high-dimensional
space, the distance between any two documents tends to be constant, making clustering on
the basis of distance ill defined. This phenomenon is called the curse of dimensionality.
Therefore, the issue of reducing the dimensionality of the space is critical. The methods
presented earlier are often used.
Traditional clustering algorithms either use a priori knowledge of document structures to
define a distance or similarity among these documents, or use probabilistic techniques such
as Bayesian classification.
Taxonomies are generated using document clustering algorithms that typically result in
topic or concept hierarchies. These classification and clustering techniques are combined.

Concept hierarchies expose the different concepts presented in the Web page (or search-
result) collection. The user can choose the concept he or she is interested in and can browse
it in detail.
Classification of Web Clustering Methods
Generally, clustering approaches could be classified in two broad categories (Y. Wang
Kitsuregawa, 2002): term-based clustering and link-based clustering. Recent work in online
clustering has included both link-based and term-based methods.
Term-based clustering. We start with methods in which each term is a single word. Zamir,
Etzioni, Madanim, and Karp (1997) mention a very simple word-intersection clustering
method in which words that are shared by documents are used to produce clusters. Let n
denote the number of documents to be clustered. The method runs in 2
( )
O n and produces
good results for Web documents originating rather from a corpus of texts. We point out that
standard methods such as k-means are also in this category since they usually exploit single
words as features. Most of the methods based on VSM belong to this category. They do not
make use of any word proximity or phrase-based approach.
Word-based clustering that is used on common words shared among documents does not
adapt well to the Web environment since it ignores the availability of hyperlinks between
Web pages and is susceptible to spam. Also, the curse of dimensionality restricts the usabil-
ity of these methods. A more successful clustering in this case (also ignoring links among
documents) is based on multiword terms (phrases or sentences). Here we speak about
term-based clustering (Zamir Etzioni, 1999). Extracting terms significantly reduces the
high dimensionality. Zamir and Etzioni show that this reduction is almost on an order of
magnitude while maintaining comparable performance with the word-based model.
Among the first works using phrases in clustering, we find the approach based on suffix-tree
clustering (STC; Zamir Etzioni, 1998). STC first transforms the string of text representing
each document into a sequence of stems. Second, it identifies the sets of documents that share
a common phrase as base clusters by a suffix tree. Finally, these base clusters are combined
into clusters. Tree building often requires ( log )
O n n time and produces high-quality clusters.
On the other hand, the suffix-tree model can have a high number of redundancies in terms of
the suffixes stored in the tree. However, STC based on phrases shared between documents
generates inferior results to those based on the full text of the document.
In Hammouda and Kamel (2004), a system for Web clustering is based on two key concepts.
The first is the use of weighted phrases as an essential constituent of documents. Similarity
between documents will be based on matching phrases and their weights. The similarity
calculation between documents combines single-word similarity and phrase-based similar-
ity. The latter is proven to have a more significant effect on the clustering quality due to
its insensitivity to noisy terms that could lead to incorrect similarity measures. The second
concept is the incremental clustering of documents using a histogram-based method to maxi-
mize the tightness of clusters by carefully watching the similarity distribution inside each
cluster. In the system, a novel phrase-based document-index model is used — the document
index graph (DIG) — that captures the structure of sentences in the document set rather than
single words only. The DIG model is based on graph theory and utilizes graph properties to

Data Clustering 2
match any-length phrases from a document to any number of previously seen documents
in a time nearly proportional to the number of words of the document. Improvement over
traditional clustering methods was 10 to 29%.
•. Link-based.clustering:.Links among Web pages could provide valuable information
to determine the related pages since they give objective opinions for the topics of the
pages they point to. Many works tried to explore link analysis to improve the term-
based methods. In general, these methods belong to the category of graph clustering.
Kleinberg (1999) suggested that there are two kinds of pages on the Web for a specific
querytopic—hubandauthority—andtheyreinforceeachother.TheHITSalgorithm,
which was used to locate hubs and authorities from the search results given a query
topic, provided a possible way to alleviate the problems. However, sometimes one’s
most authoritative pages are not useful for others. It is also observable that many
authority pages contain very little text. The work of Y. Wang and Kitsuregawa (2002)
successfully combines link-based features (co-citations and bibliographic coupling)
and content information in clustering. Co-citation measures the number of citations
(out-links) in common between two documents, and coupling measures the number
of documents (in-links) that cites both of the two documents under consideration.
•. Structure.of.clusters: Two clustering algorithms that can effectively cluster docu-
ments, even in the presence of a very high-dimensional feature space, are described in
Haveliwala, Gionis, and Indyk (2000). These clustering techniques, which are based
on generalizations of graph partitioning, do not require prespecified ad hoc distance
functions and are capable of automatically discovering document similarities or as-
sociations.
Aswementionedintheintroduction,mostclusteringmethodscanbedividedintotwocatego-
ries: hierarchical clusters and flat clusters. Hierarchical clustering is exceedingly slow when
it is used online for very high n Its implementing time can be from O(n2
) up to O(n3
).
The flat clustering algorithms are model-based algorithms that search for the model param-
eters given the data and prior expectation. For example, k-means is an O(nkT) algorithm,
where T is the number of iterations, but the task to determine a model describing the data
complicates its use for large collections, particularly in a Web environment.
Clustering.with.Snippets
Today, search engines return with a ranked list of search results some contextual informa-
tion in the form of a Web-page excerpt, the so-called snippet. Web snippet clustering is an
innovative approach to help users in searching the Web. It consists of clustering the snippets
returned by a (meta-) search engine into a hierarchy of folders each labeled with a term. The
term expresses latent semantics of the folder and of the corresponding Web pages contained
in the folder. The folder labels vary from a bag of words to variable-length sentences.
Web snippet clustering methods are classified in Ferragin and Gulli (2005) according to
two dimensions: words vs. terms, and flat vs. hierarchical. Four categories of approaches
are distinguished.

1.. Word-based and flat clustering: This category includes systems like SCATTER-
GATHER and WEBCAT. Other systems use, for example, fuzzy relations (Joshi
Jiang, 2002) or take into account in-linking and out-linking pages to improve preci-
sion.
2.. Term-based and flat clustering: Zamir and Etzioni (1999) used sentences of variable
length to label the folders, but these sentences were drawn as contiguous portions of
the snippets by means of a suffix-tree data structure. Other systems use SVD on a
term-document matrix to find meaningful long labels. This approach is restricted by
the time complexity of SVD applied to a large number of snippets. In addition, the
similar snippets can lead to very high overlap.
3.. Word-based.and.hierarchical.clustering:Theseareapproachesbasedonthefrequent
item-sets problem and a concept lattice on single words (Rice Siff, 2001) in order
to construct the folder hierarchy.
4.. Term-based.and.hierarchical.clustering: This class includes the best metasearch
engines of the years 2000 to 2003: Vivisimo and Dogpile. These tools add to the flat
list of search results a hierarchy of clusters built on the fly over snippets. They im-
prove precision over recall by using a snippet representation made of pairs of words
(not necessarily contiguous) linked by a lexical affinity, that is, a correlation of their
common appearance.Among older approaches, there is a simple extension of Grouper
(Zamir Etzioni, 1998) to hierarchical clustering based on the size of folder overlaps.
Ahierarchical engine SNAKET introduced in Ferragin and Gulli (2005) organizes on
the fly the search results from 16 commodity search engines and offers folder labeling
with variable-length sentences. Hierarchies are overlapping because snippets might
cover multiple themes.
Conclusion
Clustering is currently one of the most crucial techniques for:
• dealing with a massive amount of heterogeneous information on the Web, and
• organizing Web search results.
Unlike clustering in other fields, Web clustering separates unrelated pages and clusters
related pages (to a specific topic) into semantically meaningful groups, which is useful
for the discrimination, summarization, organization, and navigation of unstructured Web
pages. In this chapter, we have presented a lot of general approaches to clustering as well
as a lot of various classifications of clustering algorithms. Consequently, two important
questions arise:
1. Why are there so many clustering algorithms and
2. Which of them are usable for Web clustering?,

Data Clustering 2
In his paper, Estivill-Castro (2002) tries to answer the first question in terms of the model of
the data to be clustered and the cluster model (inductive principle, in his terminology). For
a single model of data and a cluster model, there are many clustering algorithms. Why are
there so many clustering models? There are so many because clustering is in part beholder
dependent. Cluster models are just formal models of what researchers believe is the defini-
tion of a cluster. Thus, it is very hard to compare particular approaches.
To answer the second question, we can first consider the techniques that are not usable for
Web clustering. Observe that clustering in a Web environment eliminates naturally the use
of some general clustering techniques. The reason is easy to see. Since clustering translates
into an optimization problem, its computational complexity is typically intractable in the
case of huge Web data collections.
Another reason for the inapplicability of some classical techniques is associated with the
usability of the clustering achieved. Given a large document collection, it is difficult to pro-
vide the number of real categories for users when they attempt to categorize the documents.
Organizing Web search results into a hierarchy of topics and subtopics facilitates browsing
thecollectionandlocatingresultsofinterest.Traditionalclusteringtechniquesareinadequate
for the Web since they do not generate clusters with highly readable names. It seems that
Web snippet clustering methods deal successfully with this issue. We have also mentioned
how link information can be used to improve classification results for Web collections. In
practice, it is desirable to combine term-based clustering and link-based clustering.
This survey represents only a small part of the research being conducted in the area. Fur-
thermore, as new techniques and algorithms are being proposed for Web data sets, it makes
a survey such as this highly time dependent.
Acknowledgements
This work was partly supported by the project 1ET100300419 of the Program Information
SocietyoftheThematicProgramIIoftheNationalResearchProgramoftheCzechRepublic,
and the project 201/05/0079 of the Grant Agency of the Czech Republic.
References
Adamic, L. A. (1999). The small world Web. In S. Abiteboul A. M. Vercoustre (Eds.),
Lecture notes in computer science: Vol. 1696. ECDL’99 (pp. 443-452). Springer.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P. (1998). Automatic subspace clus-
tering of high dimensional data for data mining applications. ACM SIGMOD Record,
27(2), 94-105.
Anonymous. (2005a). 5384 works that have been based on the self-organizing map (SOM)
method developed by Kohonen: Part I. Retrieved from http://www.cis.hut.fi/research/
som-bibl/references_a-k.ps

Anonymous. (2005b). 5384 works that have been based on the self-organizing map (SOM)
methoddevelopedbyKohonen:PartII.Retrievedfromhttp://www.cis.hut.fi/research/
som-bibl/references_l-z.ps
Barabasi, A. L., Albert, R. (1999). Emergence of scaling in random networks. Science,
286(5439), 509-512.
Ben-Dor,A., Shamir, R., Yakhini, Z. (1999). Clustering gene expression patterns. Journal
of Computational Biology, 6(3/4), 281-297.
Berkhin, P. (2002). Survey of clustering data mining techniques. Accrue Software, Inc.
Retrieved from http://www.ee.ucr.edu/barth/EE242/clustering_survey.pdf
Berry, M. W., Browne, M. (1999). Understanding search engines: Mathematical model-
ing and text retrieval: Software, environments, tools. Society for Industrial Applied
Mathematics.
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., et al. (1999). Partition-
ing-based clustering for Web document categorization. Journal of Decision Support
Systems, 27(3), 329-341.
Broder, A., Kumar, R., Maghoul, R., Raghavan, P., Rajagopalan, P., Stata, R., et al. (2000).
GraphstructureintheWeb.The9thInternationalWWWConference(2000).Retrieved
from http://www9.org/w9cdrom/160/160.html
Cai, D., He, X., Li, Z., Ma, W., Wen, J. (2004). Hierarchical clustering of WWW im-
age search results using visual, textual and link information. Proceedings of ACM
MM’04, New York.
Canutescu, A. A., Shelenkov, A. A., Dunbrack, R. L. (2003). A graph-theory algorithm
for rapid protein side-chain prediction. Protein Science, 12, 2001-2014.
Carpenter, G. A., Grossberg, S. (1987a). Art 2: Self organisation of stable category rec-
ognition codes for analog input patterns. Applied Optics, 26, 4919-4930.
Carpenter, G. A., Grossberg, S. (1987b). Invariant pattern recognition and recall by an
attentive self-organizing art architecture in a nonstationary world. Proceedings of the
IEEE 1st
International Conference on Neural Networks (737-745).
Carpenter,G.A.,Grossberg,S.(1990).Art3:Hierarchicalsearchusingchemicaltransmitters
in self-organizing pattern recognition architectures. Neural Networks, 3, 129-152.
Carpenter,G.A.,Grossberg,S.,Markuzon,N.,Reynolds,J.H.,Rosen,D.B.(1992).Fuzzy
artmap: A neural network architecture for incremental supervised learning of analog
multidimensional maps. IEEE Transactions on Neural Networks, 3(5), 698-713.
Carpenter, G. A., Grossberg, S., Reynolds, J. H. (1991). Artmap: Supervised real-time
learning and classification of nonstationary data by a self-organizing neural network.
Neural Networks, 4, 565-588.
Carpenter, G. A., Grossberg, S., Reynolds, J. H. (1995). A fuzzy artmap nonparametric
probability estimator for nonstationary pattern recognition problems. IEEE Transac-
tions on Neural Networks, 6(6), 1330-1336.
Carpenter, G. A., Grossberg, S., Rosen, D. B. (1991a). Art2-a: An adaptive resonance al-
gorithm for rapid category learning and recognition. Neural Networks, 4, 493-504.

Data Clustering 2
Carpenter, G.A., Grossberg, S., Rosen, D. B. (1991b). Fuzzy art: Fast stable learning and
categorization of analog patterns by an adaptive resonance system. Neural Networks,
4, 759-771.
Cheng, C., Fu, A. W., Zhang, Y. (1999). Entropy-based subspace clustering for mining
numerical data. Proceedings of 5th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (84-93).
Chung, F. R. K. (1997). Spectral graph theory. CBMS Regional Conference Series in Math-
ematics, 92.
Dill,S.,Kumar,R.,McCurley,K.,Rajagopalan,S.,Sivakumar,D.,Tomkins,A.(2002).Self-
similarity in the Web. ACM Transactions on Internet Technology, 2(3), 205-223.
Donath, W. E., Hoffman, A. J. (1973). Lower bounds for the partitioning of graphs. IBM
Journal of Research and Development, 17, 420-425.
Dopazo, J., Carazo, J. M. (1997). Phylogenetic reconstruction using an unsupervised
growing neural network that adopts the topology of a phylogenetic tree. Journal of
Molecular Evolution, 44, 226-233.
Ebel, H., Mielsch, L. I., Bornholdt, S. (2002). Scale-free topology of e-mail networks.
Phys. Rev. E, 66.
Estivill-Castro, V. (2002). Why so many clustering algorithms: A position paper. ACM
SIGKDD Explorations Newsletter, 4(1), 65-75.
Ferragin, P., Gulli, A. (2005). A personalized search engine based on Web-snippet hi-
erarchical clustering. Proceedings of 14th International Conference on World Wide
Web 2005 (801-810).
Fiedler, M. (1973). Algebraic connectivity of graphs. Czech. Math. J., 23, 298-305.
Fiedler, M. (1975). A property of eigenvectors of non-negative symmetric matrices and its
applications to graph theory. Czech. Math. J., 25(100), 619-633.
Fritzke, B. (1974). Growing cell structures:Aself-organizing network for unsupervised and
supervised learning. Neural Network, 7, 1141-1160.
Gibson, D., Kleinberg, J., Raghavan, P. (1998). Inferring Web communities from link
topology. Proceedings of the 9th ACM Conference on Hypertext and Hypermedia
(225-234).
Goil, S., Nagesh, H., Choudhary, A. (1999). MAFIA: Efficient and scalable subspace
clustering for very large data sets (Tech. Rep. No. CPDC-TR-9906-010). Northwest-
ern University.
Golub, G., Van Loan, C. (1989). Matrix computations. Johns Hopkins University
Press.
Gordon, A. D. (1999). Classification (2nd
ed.). Boca Raton, FL: Chapman Hall/CRC.
Grossberg, S. (1976). Adaptive pattern recognition and universal recoding: II. Feedback,
expectation, olfaction, and illusions. Biological Cybernetics, 23, 187-202.
Grossberg, S. (1988). Adaptive pattern classification and universal recoding: I. Parallel
development and coding of neural feature detectors. InAnderson Rosenfeld (Eds.),
(pp. 121-134). (Reprinted from Biological Cybernetics, 23)

Guha, S., Rastogi, R., Shim, K. (1999). ROCK: A robust clustering algorithm for cat-
egorical attributes. Proceedings of 15th
International Conference on Data Engineering
(512-521).
Guimerà, R., Danon, L., Díaz-Guilera,A., Giralt, F., Arenas,A. (2003). Self-similar com-
munity structure in a network of human interactions. Physical Review, 68.
Hammouda, K. M., Kamel, M. S. (2004). Efficient phrase-based document indexing for
Web document clustering. IEEE Transactions on Knowledge and Data Engineering,
18(10), 1279-1296.
Han, J., Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco:
Morgan Kaufmann Publishers.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley Sons.
Hartuv, E., Shamir, R. (2000). A clustering algorithm based on graph connectivity. Infor-
mation Processing Letters, 76(4-6), 175-181.
Hassoun, M. H. (1995). Fundamentals of artificial neural networks. MIT Press.
Haveliwala, T., Gionis, A., Indyk, P. (2000). Scalable techniques for clustering the Web.
Proceedings of WebDB.
He, X., Ding, C. H. Q., Zha, H., Simon, H. D. (2001). Automatic topic identification us-
ing Webpage clustering. Proceedings of the 2001 IEEE International Conference on
Data Mining (ICDM 01) (195-203).
Henzinger, M. R. (1997). Improved data structures for fully dynamic biconnectivity. Digital
Equipment Corporation.
Henzinger, M. R. (2003). Algorithmic challenges in Web search engines. Internet Math-
ematics, 1(1), 115-126.
Her, J. H., Jun, S. H., Choi, J. H., Lee, J. H. (1999). A Bayesian neural network model for
dynamic Web document clustering. Proceedings of the IEEE Region 10 Conference
(TENCON 99) (vol. 2, 1415-1418).
Herrero, J., Valencia, A., Dopazo, J. (2001). A hierarchical unsupervised growing neural
network for clustering gene expression patterns. Bioinformatics, 17, 126-136.
Hinton, G. E., Anderson, J. A. (1989). Parallel models of associative memory. Hillsdale,
NJ.
Hopfield, J. J. (1982). Neural network and physical systems with emergent collective com-
putational abilities. Proceedings of Acad. Sci. USA, (vol. 79, 2554-2558).
Höppner, F., Klawon, F., Kruse, R., Runkler, T. (2000). Fuzzy cluster analysis: Methods
for classification, data analysis and image recognition. New York: Wiley.
Huang, X., Lai, W. (2003). Identification of clusters in the Web graph based on link to-
pology. Proceedings of the7th
International Database Engineering and Applications
Symposium (IDEAS’03) (pp. 123-130).
Jain, A. K., Dubes, R. C. (1988). Algorithms for clustering data. NJ: Prentice Hall.
Jain,A. K., Murty, M. N., Flynn, P. J. (1999). Data clustering:Areview. ACM Computing
Surveys, 31(3), 264-323.

Data Clustering
Joshi,A., Jiang, Z. (2002). Retriever: Improving Web search engine results using cluster-
ing. Idea Group Publishing.
Kaski, S., Honkela, T., Lagus, K., Kohonen, T. (1998). WEBSOM: Self-organizing maps
of document collections. Neurocomputing, 21, 101-117.
Kasuba, T. (1993). Simplified fuzzy ARTMAP. AI Expert, 18-25.
Kaufman, L., Rousseeuw, P. (1990). Finding groups in data: An introduction to cluster
analysis. New York: Wiley.
Kawamura, M., Okada, M., Hirai, Y. (1999). Dynamics of selective recall in an associa-
tive memory model with one-to-many associations. IEEE Transactions on Neural
Networks, 10(3), 704-713.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. JACM, 46(5)
(pp. 604-632).
Kohonen, T. (1991). Self-organizing maps. Proceedings of IEEE 78 (pp. 1464-1480).
Kohonen, T. (2001). Self-organizing maps (3rd
extended ed.). Berlin, Germany: Springer.
Kohonen, T., Kaski, S., Lagus, K., Salogärui, J., Honkela, J., Paatero, V., et al. (2000).
Self organization of a massive document collection. IEEE Transactions on Neural
Networks, 11, 574-585.
Kosko, B. (1987). Adaptive bidirectional associative memories. Appl. Opt., 26(23), 4947-
4960.
Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A. (1999). Trawling the Web for
emerging cyber communities. Proceedings of the 8th
WWW Conference (pp. 403-
416).
Langville,A. N., Meyer, C. D. (2005).Asurvey of eigenvector methods for Web informa-
tion retrieval. SIAM Review, 47(1), 135-161.
Lian, W., Cheung, D. W. L., Mamoulis, N., Yiu, S. M. (2004). An efficient and scalable
algorithm for clustering XML documents by structure. IEEE Transactions on Knowl-
edge Data Engineering, 16(1), 82-96.
Massey, L. (2003). On the quality of ART1 text clustering. Neural Networks, 16(5-6),
771-778.
Matula, D. W. (1970). Cluster analysis via graph theoretic techniques. Proceedings of the
Louisiana Conference on Combinatorics: Graph Theory and Computing (pp. 199-
212).
Matula, D. W. (1972). K-components, clusters and slicings in graphs. SIAM Journal of Ap-
plied Mathematics, 22(3), 459-480.
Matula, D. W. (1987). Graph theoretic techniques for cluster analysis algorithms. In J. Van
Ryzin (Ed.), Classification and clustering (pp. 95-129).
McCulloch, W. S., Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 115-133.
Mercer, D. P. (2003). Clustering large datasets. Linacre College.
Nagesh, H., Goil, S., Choudhary, A. (2001). Adaptive grids for clustering massive data
sets. Proceedings of the 1st
SIAM ICDM (p. 477).

Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Review,
45, 167-256.
Newman, M. E. J., Balthrop, J., Forrest, S., Williamson, M. M. (2004). Technological
networks and the spread of computer viruses. Science, 304, 527-529.
Ng, R. T., Han, J. (1994). Efficient and effective clustering methods for spatial data
mining. Proceedings of the 20th
International Conference on Very Large Data Bases
(pp. 144-155).
Pal, S. K., Talwar, V., Mitra, P. (2002). Web mining in soft computing framework: Rel-
evance, state of the art and future directions. IEEE Transactions on Neural Networks,
13(5), 1163-1177.
Pierrakos, D., Paliouras, G., Papatheodorou, C., Karkaletsis, V., Dikaiakos, M. (2003).
ConstructionofWebcommunitydirectoriesusingdocumentclusteringandWebusage
mining. ECML/PKDD 2003: First European Web Mining Forum.
Řezanková, H., Húsek, D., Snášel, V. (2004). Clustering as a tool for data mining. In M.
Klíma (Ed.), Applications of mathematics and statistics in economy (pp. 203-208).
Praha: Professional Publishing.
Rice, M. D., Siff, M. (2001). Clusters, concepts, and pseudo-metrics. Electronic Notes
in Theoretical Computer Science, 40, 323-346.
Rumelhart, D. E., McClelland, J. L. (1988). Parallel distributed processing: Explorations
in the microstructure of cognition (Vols. 1 2). Cambridge, MA: MIT Press.
Salton, G., Buckley, C. (1988). Term weighting approaches in automatic text retrieval.
Information Processing and Management, 24(5), 513-523.
Sásik, R., Hwa, T., Iranfar, N., Loomis, W. F. (2001). Percolation clustering: A novel
approach to the clustering of gene expression patterns. Dictyostelium Development
PSB Proceedings (vol. 6, pp. 335-347).
Schenker,A., Kande,A., Bunke, H., Last, M. (2005). Graph-theoretic techniques for Web
content mining. World Scientific Publishing.
Sharan,R.,Shamir,R.(2000).CLICK:Aclusteringalgorithmforgeneexpressionanalysis.
In S. Miyano, R. Shamir, T. Takagi (Eds.), Currents in computational molecular
biology (pp. 6-7). Universal Academy Press.
Shi, J., Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8), 888-905.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., et al. (1999).
Interpreting patterns of gene expression with self-organizing maps: Methods and
application to hematopoietic differentiation. Proceedings of the National Acad. Sci.
USA (vol. 96, pp. 2907-2912).
Tomida, S., Hanai, T., Honda, H., Kobayashi, T. (2001). Gene expression analysis using
fuzzy ART. Genome Informatics, 12, 245-246.
Vakali,A., Pokorný, J., Dalamagas,T. (2004).An overview ofWeb clustering practices. In
Lecturenotesincomputerscience:Vol.3268.CurrentTrendsinDatabaseTechnology:
International Workshop on Database Technologies for Handling XML Information on
the Web, DataX, EDBT 2004 (pp. 597-606). Springer Verlag.

Data Clustering
Wang, H. C., Dopazo, J., Carazo, J. M. (1998). Self-organizing tree growing network for
classifying amino acids. Bioinformatics, 14(4), 376-377.
Wang,Y., Kitsuregawa, M. (2002). Evaluating contents-link Web page clustering for Web
search results. CIKM’02 (pp. 499-506).
White, S., Smyth, P. (2005). A spectral clustering approach to finding communities in
graph. SDM.
Wu, C., Zhao, S., Chen, H. L., Lo, C. J., McLarty, J. (1996). Motif identification neural
design for rapid and sensitive protein family search. CABIOS, 12(2), 109-118.
Wu, C. H. (1995). Gene classification artificial neural system. In R. F. Doolittle (Ed.),
Methods in enzymology: Computer methods for macromolecular sequence analysis.
New York: Academic Press.
Yao, Y., Chen, L., Chen, Y. Q. (2001). Associative clustering for clusters of arbitrary
distribution shapes. Neural Processing Letters, 14, 169-177.
Zamir, O., Etzioni, O. (1998). Web document clustering: A feasibility demonstration.
Proceedings of the 21st International ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (pp. 46-54).
Zamir, O., Etzioni, O. (1999). Grouper: A dynamic clustering interface to Web search
results. Computer Networks: The International Journal of Computer and Telecom-
munications Networking Archive, 31(11-16), 1361-1374.
Zamir, O., Etzioni, O., Madanim, O., Karp, R. M. (1997). Fast and intuitive clustering
of Web documents. Proceedings of the 3rd
International Conference on Knowledge
Discovery and Data Mining (pp. 287-290).
Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: An efficient data clustering
method for very large databases. ACM SIGMOD Record, 25(2), 103-114.

Exploring the Variety of Random
Documents with Different Content

The Project Gutenberg eBook of History of the
Pirates Who Infested the China Sea From 1807
to 1810

This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Title: History of the Pirates Who Infested the China Sea From 1807
to 1810
Author: Yung-lun Yüan
Translator: Karl Friedrich Neumann
Release date: November 23, 2013 [eBook #44261]
Most recently updated: October 23, 2024
Language: English
Credits: Produced by Charlene Taylor and the Online Distributed
Proofreading Team at http://www.pgdp.net (This file was
produced from images generously made available by The
Internet Archive/American Libraries.)
*** START OF THE PROJECT GUTENBERG EBOOK HISTORY OF THE
PIRATES WHO INFESTED THE CHINA SEA FROM 1807 TO 1810 ***

HISTORY
OF
T HE P I R A T ES
WHO
INFESTED THE CHINA SEA,
From 1807 to 1810.
TRANSLATED FROM THE CHINESE ORIGINAL,
WITH
NOTES AND ILLUSTRATIONS,
BY
CHARLES FRIED. NEUMANN.

LONDON:
PRINTED FOR THE ORIENTAL TRANSLATION FUND,
And Sold by
J. MURRAY, ALBEMARLE STREET;
PARBURY, ALLEN, CO., LEADENHALL STREET;
THACKER CO., CALCUTTA; TREUTTEL WÜRTZ, PARIS;
AND E. FLEISCHER, LEIPSIG.
1831.
LONDON
Printed by J. L. Cox, Great Queen Street,
Lincoln's-Inn Fields

CONTENTS
TRANSLATOR'S PREFACE.
YING HING SOO's PREFACE.
KING CHUNG HO's PREFACE.
BOOK FIRST.
BOOK SECOND.
APPENDIX.

TRANSLATOR'S PREFACE.
Conquerors are deemed successful robbers, while robbers are
unsuccessful conquerors. If the founder of the dynasty of the Ming
had failed in his rebellion against the Moguls, history would have
called him a robber; and if any one of the various robber-chiefs, who
in the course of the two last centuries made war against the reigning
Manchow, had overthrown the government of the foreigners, the
official historiographers of the Middle empire would have called
him the far-famed, illustrious elder father of the new dynasty.
Robbers or pirates are usually ignorant of the principles concerning
human society. They are not aware that power is derived from the
people for the general advantage, and that when it is abused to a
certain extent, all means of redress resorted to are legitimate. But
they feel most violently the abuse of power. The fruit of labour is too
often taken out of their hands, justice sold for money, and nothing is
safe from their rapacious and luxurious masters. People arise to
oppose, and act according to the philosophical principles of human
society, without having any clear idea about them. Robbers and
pirates are, in fact, the opposition party in the despotical empires of
the East; and their history is far more interesting than that of the
reigning despot.[1] The sameness which is to be observed in the
history of all Asiatic governments, presents a great difficulty to any
historian who wishes to write a history of any nation in Asia for the
general reader.
The history of the transactions between Europeans and the Chinese
is intimately connected with that of the pirate chiefs who appeared
from time to time in the Chinese Sea, or Southern Ocean. The
Europeans themselves, at their first appearance in the middle
empire, only became known as pirates. Simon de Andrada, the first
Portuguese who (1521) tried to establish any regular trade with

China, committed violence against the merchants, and bought young
Chinese to use them as slaves; and it is known that it was the policy
of the civilized foreigners from the Great Western Ocean (which is
the Chinese name for Europe) to decry their competitors in trade as
pirates and outlaws.
The footing which Europeans and Americans now enjoy in China,
originated from the assistance given by the Portuguese to the
Manchow against the Patriots, otherwise called pirates, who would
not submit to the sway of foreigners. Macao, the only residence (or
large prison) in which foreigners are shut up, is not considered by
the Chinese Government as belonging exclusively to the Portuguese.
The Dutch, on not being allowed to remain in Macao, complained to
the Chinese Government, and the authorities of the middle empire
commanded the Portuguese to grant houses to the newly arrived
Holan or Hollander, since Macao was to be considered as the abode
of all foreigners trading with China. The edicts concerning this
transaction are stated to be now in the archives of the Dutch factory
at Macao.
It is one of the most interesting facts in the history of the Chinese
empire, that the various barbarous tribes, who subdued either the
whole or a part of this singular country, were themselves ultimately
subdued by the peculiar civilization of their subjects. The Kitans,
Moguls, and Manchow, became, in the course of time, Chinese
people; like the Ostro, and Visigoths, and Longobards—Romans. But
we may remark, that both the Chinese and the Roman civilization
under the Emperors recommended itself to the conquerors, as
connected with a despotism which particularly suited the views of
the conquerors. Though this large division of the human race, which
we are accustomed to call Tatars, never felt a spark of that liberty
which everywhere animated the various German nations and tribes,
and the Khakhans, in consequence of this, were not in need of any
foreign policy to enslave their compatriots; yet it may be said, that
neither Moguls nor Manchow were able to establish a despotic form
of government which worked so well for a large nation as that of the
Chinese.

The extremes of both despotism and democracy acknowledge no
intermediary power or rank. The sovereign is the vice-regent of
heaven, and all in all; he is the only rule of right and wrong, and
commands both what shall be done in this world and thought of
concerning the next. It may be easily imagined, that the Jesuits, on
their first arrival in China, were delighted with such a perfect
specimen of government according to their political sentiments. They
tried all that human power could command to succeed in the
conversion of this worldly paradise. The fathers disguised themselves
as astronomers, watchmakers, painters, musicians, and engineers.[2]
They forged inscriptions[3] and invented miracles, and almost went
to the extent of canonizing Confucius. But this cunning deference to
Chinese customs involved the Jesuits in a dispute with their more
pious but less prudent competitors; and notwithstanding all the
cleverness of the Jesuits, the Chinese saw at last, that in becoming
Roman Catholic Christians they must cease to be Chinese, and obey
a foreign sovereign in the Great Western Ocean. Toland affirms, that
the Chinese and the Irish, in the time of their heathen monarch
Laogirius, were the only nations in which religious persecutions
never existed;[4] this praise now refers exclusively to Ireland. Roman
Catholicism is at this moment nearly extinguished in China. To
become a Christian is considered high-treason, and the only Roman
Catholic priest at Canton at the present time, is compelled to hide
himself under the mask of shopkeeper. In their successful times,
during the seventeenth century, the Roman Catholic Missionaries
published in Europe, that no nation was more virtuous, nor any
government more enlightened than that of the Chinese; these false
eulogies were the source of that high opinion in which the Chinese
were formerly held in Europe.
The merchants and adventurers who came to China to make
money found both the government and people widely different from
descriptions given by the Jesuits. They found that the Chinese
officers of government, commonly called Mandarins, would think
themselves defiled by the least intercourse with foreigners,

particularly merchants; and that the laws are often interpreted quite
differently before and after receiving bribes. The Europeans were
proud of their civilization and cleverness in mercantile transactions,
and considered the inhabitants of all the other parts of the world as
barbarians; but they found, to their astonishment and
disappointment, the Chinese still more proud and cunning. We may
easily presume that these deluded merchants became very irritated,
and in their anger they reported to their countrymen in Europe that
the Chinese were the most treacherous and abandoned people in
the world,[5] that they were only a peculiar race of savages, and
required to be chastised in one way or another; which would
certainly be very easy. Commodore Anson, with a single weather-
beaten sixty-gun ship, in fact, set the whole power of the Chinese
Government at defiance.
The Translator of the History of the Pirates ventures to affirm, that
the Chinese system of government is by far the best that ever
existed in Asia; not excepting any of the different monarchies
founded by the followers of Alexander, the government of the
Roman Prætors and of Byzantine Dukes, or that of Christian Kings
and Barons who reigned in various parts of the East during the
middle ages. The principles of Chinese government are those of
virtue and justice; but they are greatly corrupted by the passions
and vices of men. The greater part of their laws are good and just,
though the practice is often bad; but unfortunately this is generally
not known to the Son of Heaven. It is the interest of the Emperor
to deal out justice to the lowest of his subjects; but, supposing it
were possible that one man could manage the government of such
an immense empire, who either could or would dare to denounce
every vicious or unjust act of the officers employed by government?
The Chinese themselves are a clever shrewd sort of people; deceit
and falsehood are, perhaps, more generally found in the flowery
empire than any where else; but take them all in all, they rank high
in the scale of nations, and the generality of the people seem to be
quite satisfied with their government; they may wish for a change of

masters, but certainly not for an entire change of the system of
government.
There has existed for a long period, and still exists, a powerful party
in the Chinese Empire, which is against the dominion of the
Manchow; the different mountainous tribes maintain, even now, in
the interior of China, a certain independence of the Tay tsing
dynasty. The Meao tsze, who were in Canton some years ago,
stated, with a proud feeling, that they were Ming jin, people of Ming;
the title of the native sovereigns of China before the conquest of the
Manchow. It is said, that the whole disaffected party is united in a
society—generally called the Triade-Union—and that they aimed at
the overthrow of the Tatars, particularly under the weak government
of the late Emperor; but the rebels totally failed in their object both
by sea and land.
It has been falsely reported in Europe, that it is not allowed by the
laws of China to publish the transactions of the reigning dynasty. It
is true that the history written by the official or imperial historians is
not published; but there is no statute which prohibits other persons
from writing the occurrences of their times. It may be easily
imagined that such authors will take especial care not to state any
thing which may be offensive to persons in power. There is, however,
no official court in China to regulate the course of the human
understanding, there is nothing like that tribunal which in the
greater part of the Continent of Europe is called the Censorship. Fear
alone is quite sufficient to check the rising spirits of the liberals in
the middle empire. The reader, therefore, should not expect that
either the author of the History of the Rebellions in the Interior of
China, or the writer of the Pacification of the Pirates, would
presume to state that persons whom government is pleased to style
robbers and pirates, are in reality enemies of the present dynasty;
neither would they state that government, not being able to quell
these rebellions, are compelled to give large recompenses to the
different chiefs who submit. These facts are scarcely hinted at in the
Chinese histories. The government officers are usually delineated as
the most excellent men in the world. When they run away, they

know before-hand that fighting will avail nothing; and when they
pardon, they are not said to be compelled by necessity, but it is
described as an act of heavenly virtue! From what we learn by the
statements of a Chinese executioner, we should be led to form a bad
opinion of the veracity of these historians, and the heavenly virtue of
their government; for it is said, that one Chinese executioner
beheaded a thousand pirates in one year.[6]
The author of the following work is a certain Yung lun yuen, called
Jang sëen,[7] a native of the city or market town Shun tih, eighty le
southerly from Canton. The great number of proper names, of
persons and places, to be found in the History of the Pacification of
the Pirates, together with the nicknames and thieves' slang
employed by the followers of Ching yĭh, presented peculiar
difficulties in the translation of Yuen's publication. The work was
published in November 1830 at Canton; and it is to be regretted, for
the fame of the author in the Great Western Ocean, that he used
provincial and abbreviated characters. I will not complain that by so
doing he caused many difficulties to his translator, for a native of
Shun tih would not trouble himself on that point; but I have reason
to believe that the head schoolmaster of Kwang tung will think it an
abomination that Yung lun yuen should dare take such liberties in a
historical composition. Schoolmasters have a greater sway in China
than any where else, and they like not to be trifled with. These are
particularly the men, who, above all others, oppose any innovation
or reform; scholars, who presume to know every thing between
heaven and earth: and they may certainly satisfy every man, who
will rest satisfied by mere words. These learned gentlemen are too
much occupied with their own philosophical and literary disquisitions,
to have any time, or to think it worth their notice, to pay attention to
surrounding empires or nations. If we consider the scanty and
foolish notices which are found in recent Chinese publications
regarding those nations with which the Chinese should be well
acquainted, we cannot but form a very low estimate of the present
state of Chinese literature. How far otherwise are the accounts of
foreign nations, which are to be found in the great work of

Matuanlin! It will, perhaps, be interesting to the European reader to
learn, what the Chinese know and report concerning the nations of
Ta se yang, or the Great Western Ocean. I therefore take an
opportunity here to give some extracts from a Chinese publication
relative to European nations, printed last year at Canton.
The fifty-seventh book of the Memoirs concerning the South of the
Mei ling Mountains, contains a history of all the Southern barbarians
(or foreigners); and here are mentioned—with the Tanka people and
other barbarous tribes of Kwang tung and Kwang se—the Siamese,
the Mahometans, the French, Dutch, English, Portuguese, Austrians,
Prussians, and Americans. The work was published by the command
of Yuen, the ex-Governor-General of Canton, who is considered one
of the principal living literary characters of China, and it consists
chiefly of extracts from the voluminous history of the province
Kwang tung, published by his Excellency:—
The Religion of the Hwy hwy, or Mahometans.
This religion is professed by various sorts of barbarians who
live southerly beyond Chen ching (Tséamba, or Zeampa), to the
Se yu. Their doctrines originated in the kingdom of Me tih no
(Medina). They say that heaven is the origin of all things; they
do not use any images. Their country is close to Tëen choo
(India); their customs are quite different from those of the
Buddhists; they kill living creatures, but they do not eat
indiscriminately all that is killed; they eat not hog's flesh, and
this is the essence of the doctrine of Hwy hwy. They have now a
foreign pagoda (fan tă), near the temple of the compassionate
saint (in Canton), which exists since the time of the Tang. It is
of a spiral form, and 163 cubits high.[8] They go every day
therein to say prayers.
By the kindness of Dr. Morrison, the translator had the pleasure to
converse with a member of the Mahometan clergy at Canton. He
stated, that in the Mosque at Canton is a tablet, whereon it is
written, that the religion of the Prophet of Mecca was brought to

China, Tang ching yuen san nëen, that is, in the third year of the
period called Ching yuen, under the Tang dynasty, i.e. 787 of our
era.[9] The compilers of the Memoirs, c. have taken their extract
from the historical work of Ho (4051, M.); they seem not to have any
knowledge of Matuanlin, where the Arabs are spoken of under the
name of Ta she. See the notes to my translation of the Chronicle of
Vahram, p. 76. During the time the translator was at Canton, there
arrived a pilgrim from Pekin on his way to Mecca.
The Fa lan se, Francs and Frenchmen.
The Fa lan se are also called Fo lang se, and now Fo lang ke. In
the beginning they adopted the religion of Buddha, but
afterwards they received the religion of the Lord of Heaven.
They are assembled together and stay in Leu song (Spain?);
they strive now very hard with the Hung maou or red-haired
people (the Dutch), and the Ying keih le (English); but the Fa
lan se have rather the worst of it. These foreigners, or
barbarians (e jin) wear white caps and black woollen hats; they
salute one another by taking off the hat. Regarding their
garments and eating and drinking, they have the same customs
as the people of Great Leu song and Small Leu song (Spain and
Manilla).
This extract is taken from the Hwang tsing chĭh kung too, or the
Register of the Tribute as recorded under the present dynasty
(Memoirs, l. c. p. 10 v., p. 11 r.). I am not sure if Ke tsew (10,869)
keu (6,063) Leu song, can really be translated by the words—they
are assembled together and stay in Leu song. The use of tsew in the
place of tseu (10,826) is confirmed by the authorities in Kang he;
but does Leu song really mean Spain? The Philippinas are called Leu
song (Luzon), from the island whereon Manilla is, and in opposition
to Spain (Ta Leu song, the great L. s.), Seao Leu song, the small Leu
song. It may be doubted whether Leu song without Ta, great, can be
taken for Spain. The Chinese have moreover learned from Matthæus
Ricci the proper name of Spain, and write it She pan ya. The Dutch,

the English, and the Germans, are, from a reddish colour of their
hair, called Hung maou. This peculiar colour of the hair found among
people of German origin, is often spoken of by the ancient Roman
authors; as for instance in Tacitus, Germania, c. 4. Juvenal says, Sat.
XIII. v. 164,
Cærula quis stupuit Germani lumina? flavam
Cæsariem, et madido torquentem cornua cirro?
It would carry us too far at present to translate the statements of
the Chinese concerning the Portuguese and Dutch. Under the head
of Se yang, or Portugal, may be read an extract of the account of
Europe (Gow lo pa) the Chinese received by Paulus Matthæus Ricci
(Le ma paou). The Chinese know that the European Universities are
divided into four faculties; and his Excellency Yuen is aware of the
great similarity between the ceremonies of the Buddhists and those
of the Roman Catholic church (l. c. 17 v). The present Translator of
the History of the Pirates intends to translate the whole of the 57th
book of the often-quoted Memoirs, and to subjoin copious extracts
of other works, particularly from the Hae kwŏ hëen këen lăh, or
Memoirs concerning the Empires surrounded by the Ocean. This
very interesting small work is divided into two books; one containing
the text, and the other the maps. The text consists of eight
chapters, including a description of the sea-coast of China, with a
map, constructed on a large scale, of the nations to the east, the
south-east, and the south; then follows a topography of Portugal
and Europe generally. Concerning England we find:—
The Kingdom of the Ying keih le, or English.
The kingdom of the Ying keih le is a dependent or tributary
state[10] to Ho lan (Holland). Their garments and manners in
eating and drinking are the same. This kingdom is rather rich.
The males use much cloth and like to drink wine. The females,
before marriage, bind the waist, being desirous to look slender;
their hair hangs in curls over the neck; they use a short

garment and petticoats, but dress in a larger cloth when they go
out. They take snuff out of boxes made from gold and threads.
This extract is taken from the Register of the Tribute as recorded
under the present dynasty.
Ying keih le is a kingdom composed of three islands: it is in the
middle of four kingdoms, called Lin yin:[11] Hwang ke, the
yellow flag (Denmark), Ho lan, and Fo lang se. The Great
Western Ocean (Europe) worships the Lord of Heaven; and
there are, firstly, She pan ya (Spain), Poo keăh ya (Portugal),
the yellow flag, c.; but there are too many kingdoms to
nominate them one by one. Ying keih le is a kingdom which
produces silver, woollen cloths,[12] camlets, peih ke, or English
cloth, called long ells,[13] glass, and other things of this kind.
This extract is taken from the Hae kwŏ hëen këen lăh, book i. p. 34
v. 35 r; and I am sorry to see that in the Memoirs it is abbreviated
in such a manner that the sense is materially changed.
Ying keih le, says the author of the Hae kwo hëen këen lăh (l.
c.), is a realm composed out of three islands. To the west and
the north of the four kingdoms of Lin yin, the Yellow flag, Holan,
and Fo lang se, is the ocean. From Lin yin the ocean takes its
direction to the east, and surrounds Go lo sse (Russia); and
from Go lo sse, yet more to the east, Se me le (Siberia?).
Through the northern sea you cannot sail; the sea is frozen, and
does not thaw, and for this reason it is called the Frozen Ocean.
From Lin yin, to the south, are the various empires of the Woo
and Kwei (Crows and Demons), and they all belong to the red-
haired people of the Great Western Ocean. On the west and on
the north there are different barbarians under various names;
* * * * * *
but they are, in one word, similar to the Go lo sse (Russians),
who stay in the metropolis (Pekin). It is said that the Kaou chun

peih mow (?) are similar to the inhabitants of the Middle
Empire; they are of a vigorous body and an ingenious mind. All
that they produce is fine and strong; their attention is directed
to making fire-arms. They make researches in astronomy and
geography, and generally they do not marry. Every kingdom has
a particular language, and they greet one another by taking off
the hat. They worship, c. (The same as p. xxx.)
My copy of the Hae kwŏ hëen këen lăh was printed in the province
Che keang, in the year 1794.
In the narrative regarding foreign countries, and forming part
of the history of the Ming, the English are called Yen go le; in
the Hae kwŏ hëen këen lŭh, Ying ke le (5272, 6950); but in the
maps the name is now always written Ying keih le (5018, 6947).
In expressing the sound of words we sometimes use different
characters. This kingdom lies to the west of Gow lo pa (Europa),
and was originally a tributary state to Ho lan (Holland); but in
the course of time it became richer and more powerful than Ho
lan, and revolted. These kingdoms are, therefore, enemies. It is
not known at what time the Ying keih le grasped the country of
North O mŏ le kea (America), which is called Kea no (Canada).
Great Ying keih le is a kingdom of Gow lo pa (Europe.)[14] In
the twelfth year of Yung ching (1735), they came the first time
to Canton for trade. Their country produces wheat, with which
they trade to all the neighboring countries. They are generally
called Keang heŏ (that is, English ships from India, or country
ships), and there arrive many vessels.
This extract is taken from the Tan chay hëen këen lăh, and it is all
that we find regarding England in the Memoirs concerning the south
of the Mei ling Mountains (p. 18 r. v.). In the latter extract, the
author appears to confound the country trade of India and China
with that of the mother country. England is again mentioned in the
notice regarding Me le keih (America), taken out of Yuen's History of
Canton. It is there said, that the Me le keih passed, in the 52d year

of Këen lung (1788), the Bocca Tigris, and that they then separated
from the Ying keih le (p. 19 r.) At the end of the extract concerning
the Americans (p. 190) we read the following words:
The characters which are used in the writings of these realms
are, according to the statements of Ma lo ko, twenty-six; all
sounds can be sufficiently expressed by these characters. Every
realm has large and small characters; they are called La ting
characters, and La te na (Latin) characters.
It is pleasing to observe that his Excellency Yuen had some
knowledge of Dr. Morrison's Dictionary. In the third part of his
Dictionary, Dr. Morrison has given, in Chinese, a short and clear
notice concerning the European alphabet. Yuen seems to have taken
his statements from this notice, and to have written the name of the
author, by a mistake, Ma lo ko, for Ma le so, as Dr. Morrison is
generally called by the Chinese.
The Man ying, the Double Eagle, or
Austrians.
The Man ying passed the Bocca Tigris the first time in the 45th
year of Këen lung (1781), and are called Ta chen (Teutchen).
They have accepted the religion of the Lord of Heaven. In
customs and manners they are similar to the Se yang, or
Portuguese; they are the brethren of the Tan ying, or Single
eagle kingdom (Prussia); in difficulties and distress they help
one another. Their ships which came to Canton had a white flag,
on which an eagle was painted with two heads.
This extract is taken from the History of Yuen. I take the liberty to
observe, that the Chinese scholar must be careful not to take the Sui
chen, or Chen kwŏ (the Swedes), for the Ta chen (the Teutchen). In
the Memoirs, l. c. p. 19 v., we read the following notice on the Chen
kwŏ (the Swedes):

The Chen realm is also called Tan (Denmark) realm, and now
the yellow flag. This country is opposite to that of the Ho lan,
and a little farther off from the sea. There are two realms called
Sui chen, and they border both on the Go lo sse, or Russia.
They passed the Bocca Tigris the first year of Këen lung
(1765).
The Tan ying, the Single Eagle or Prussians.
The Tan ying passed the Bocca Tigris the 52d year of Këen
lung (1788.) They live to the west and north of the Man ying
(Austrians). In customs and manners they are similar to them.
On their ships flies a white flag, on which an eagle is painted.
This last extract is also taken from the History of Canton, published
by his Excellency Yuen.
If we consider how easily the Chinese could procure information
regarding foreign countries during the course of the two last
centuries, and then see how shamefully they let pass all such
opportunities to inform and improve themselves, we can only look
upon these proud slaves of hereditary customs with the utmost
disgust and contempt. The ancient Britons and Germans had no
books; yet what perfect descriptions of those barbarian nations have
been handed down to us by the immortal genius of Tacitus!
Montesquieu says, that in Cæsar and Tacitus we read the code of
barbarian laws; and in the code we read Cæsar and Tacitus. In the
statement of the modern Chinese regarding foreign nations, we see,
on the contrary, both the want of enquiry, and the childish remarks
of unenlightened and uncultivated minds.[15]

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Web Data Management Practices Emerging Techniques And Technologies Athena Vakali

More Related Content

Similar to Web Data Management Practices Emerging Techniques And Technologies Athena Vakali (20)

Recently uploaded (20)

Web Data Management Practices Emerging Techniques And Technologies Athena Vakali