3. payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-750-4470, or on the web at
http://www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201)
748-6011, fax (201) 748-6008, e-mail: permcoordinator@wiley.com.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fittness for a particular purpose. No warranty may be created
or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional
where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any
other commercial damages, including but not limited to special, incidental, consequential, or
other damages.
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-
4002.
Wiley also publishes its books is a variety of electronic formats. Some content that appears in
print, however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data is available
0-471-22852-4
10 9 8 7 6 5 4 3 2 1
To Belma and Nermin
About the Author
Mehmed Kantardzic received B.S., M.S., and Ph.D. degrees in computer science from the
University of Sarajevo, Bosnia. Until 1994, he was an Associate Professor at the University of
Sarajevo, Faculty of Electrical Engineering. In 1995 Dr. Kantardzic joined the University of
Louisville, and since 2001 he has been an Associate Professor at the Computer Engineering
and Computer Science Department, University of Louisville, and the Director of the Data
Mining Lab. His research interests are: data mining and knowledge discovery, soft computing,
visualization of neural network generalizations, multimedia technologies on Internet, and
distributed intelligent systems. Dr. Kantardzic has recently focused his work on application of
data mining technologies in biomedical research. Dr. Kantardzic is the author of five books,
and he initiated and led more than 30 research and development projects. He has also published
more than 120 articles in refereed journals and conference proceedings. Dr. Kantardzic is a
member of IEEE, ISCA, and SPIA, and he was Program Chair for ISCA'99 Conference in
Denver, CO, and General Chair for the International Conference on Intelligent Systems 2001
in Arlington, VA.
4. < Free Open Study >
< Free Open Study >
Preface
Traditionally, analysts have performed the task of extracting useful information from recorded
data. But, the increasing volume of data in modern business and science calls for computer-
based approaches. As data sets have grown in size and complexity, there has been an inevitable
shift away from direct hands-on data analysis toward indirect, automatic data analysis using
more complex and sophisticated tools. The modern technologies of computers, networks, and
sensors have made data collection and organization an almost effortless task. However, the
captured data needs to be converted into information and knowledge from recorded data to
become useful. Data mining is the entire process of applying computer-based methodology,
including new techniques for knowledge discovery, from data.
The modern world is a data-driven one. We are surrounded by data, numerical and otherwise,
which must be analyzed and processed to convert it into information that informs, instructs,
answers, or otherwise aids understanding and decision-making. This is the age of the Internet,
intranets, data warehouses and data marts, and the fundamental paradigms of classical data
analysis are ripe for change. Very large collections of data—sometimes hundreds of millions
of individual records–are being stored in centralized data warehouses, allowing analysts to use
more comprehensive, powerful data mining methods. While the quantity of data is huge, and
growing, the number of sources is unlimited, and the range of areas covered is vast: industrial,
commercial, financial, and scientific.
In recent years there has been an explosive growth of methods for discovering new knowledge
from raw data. In response to this, a new discipline of data mining has been specially developed
to extract valuable information from such huge data sets. Given the proliferation of low-cost
computers (for software implementation), low-cost sensors, communications, database
technology (to collect and store data), and computer-literature application experts who can pose
"interesting" and "useful" application problems, this is not surprising.
Data mining technology has recently become a hot topic for decision-makers because it
provides valuable, hidden business and scientific "intelligence" from a large amount of
historical data. Fundamentally however, data mining is not a new technology. Extracting
information and knowledge from recorded data is a well-established concept in scientific and
medical studies. What is new is the convergence of several disciplines and corresponding
technologies that have created a unique opportunity for data mining in a scientific and corporate
world.
Originally, this book was intended to fulfill a wish for a single, introductory source to direct
students to. However, it soon became apparent that people from a wide variety of backgrounds,
and positions, confronted by the need to make sense of large amounts of raw data, would also
appreciate a compilation of some of the most important methods, tools, and algorithms in data
mining. Thus, this book was written for a range of readers; from students wishing to learn about
basic processes and techniques in data mining, to analysts and programmers who will be
engaged directly in interdisciplinary teams for selected data mining applications. This book
5. reviews state-of-the-art techniques for analyzing enormous quantities of raw data in high-
dimensional data spaces to extract new information useful to the decision-making process.
Most of the definitions, classifications, and explanations of the techniques covered in this book
are not new, and they are presented in references at the end of the book. One of my main goals
was to concentrate on systematic and balanced approach to all phases of a data mining process,
and present them with enough illustrative examples. I except that carefully prepared examples
should give the reader additional arguments and guidelines in the selection and structure of
techniques and tools for their own data mining application. Better understanding of
implementation details for most of the introduced techniques challenge the reader to build their
own tools or to improve the applied methods and techniques.
To teach data mining, one has to emphasize the concepts and properties of the applied methods,
rather than the mechanical details of applying different data mining tools. Despite all of their
attractive bells and whistles, computer-based tools alone will never replace the practitioner who
makes important decisions on how the process will be designed, and how and what tools will
be employed. A deeper understanding of methods and models, how they behave, and why, is a
prerequisite for efficient and successful application of data mining technology. Any researcher
or practitioner in this field needs to be aware of these issues in order to successfully apply a
particular methodology, understand a method's limitations, or develop new techniques. This is
an attempt to present and discuss such issues and principles, and then describe representative
and popular methods originating from statistics, machine learning, computer graphics,
databases, information retrieval, neural networks, fuzzy logic, and evolutionary computation.
It discusses approaches that have proven critical in revealing important patterns, trends, and
models in large data sets.
Although it is easy to focus on technologies, as you read through the book keep in mind that
technology alone does not provide the entire solution. One of my goals in writing this book
was to minimize the hype associated with data mining, rather than making false promises of
what can reasonably be expected. I have tried to take a more objective approach. I describe the
processes and algorithms that are necessary to produce reliable, useful results in data mining
applications.
I do not advocate the use of any particular product or technique over another; the designer of
data mining process has to have enough background to select the appropriate methodologies
and software tools. I expect that once a reader has completed this text, he or she will be able to
initiate and perform basic activities in all phases of a data mining process successfully and
effectively.
Louisville, KY Mehmed Kantardzic
August, 2002
< Free Open Study >
< Free Open Study >
6. Chapter 1: Data-Mining Concepts
1.1 INTRODUCTION
Modern science and engineering are based on using first-principle models to describe physical,
biological, and social systems. Such an approach starts with a basic scientific model, such as
Newton's laws of motion or Maxwell's equations in electromagnetism, and then builds upon
them various applications in mechanical engineering or electrical engineering. In this approach,
experimental data are used to verify the underlying first-principle models and to estimate some
of the parameters that are difficult or sometimes impossible to measure directly. However, in
many domains the underlying first principles are unknown, or the systems under study are too
complex to be mathematically formalized. With the growing use of computers, there is a great
amount of data being generated by such systems. In the absence of first-principle models, such
readily available data can be used to derive models by estimating useful relationships between
a system's variables (i.e., un-known input—output dependencies). Thus there is currently a
paradigm shift from classical modeling and analyses based on first principles to developing
models and the corresponding analyses directly from data.
We have grown accustomed gradually to the fact that there are tremendous volumes of data
filling our computers, networks, and lives. Government agencies, scientific institutions, and
businesses have all dedicated enormous resources to collecting and storing data. In reality, only
a small amount of these data will ever be used because, in many cases, the volumes are simply
too large to manage, or the data structures themselves are too complicated to be analyzed
effectively. How could this happen? The primary reason is that the original effort to create a
data set is often focused on issues such as storage efficiency; it does not include a plan for how
the data will eventually be used and analyzed.
The need to understand large, complex, information-rich data sets is common to virtually all
fields of business, science, and engineering. In the business world, corporate and customer data
are becoming recognized as a strategic asset. The ability to extract useful knowledge hidden in
these data and to act on that knowledge is becoming increasingly important in today's
competitive world. The entire process of applying a computer-based methodology, including
new techniques, for discovering knowledge from data is called data mining.
Data mining is an iterative process within which progress is defined by discovery, through
either automatic or manual methods. Data mining is most useful in an exploratory analysis
scenario in which there are no predetermined notions about what will constitute an "interesting"
outcome. Data mining is the search for new, valuable, and nontrivial information in large
volumes of data. It is a cooperative effort of humans and computers. Best results are achieved
by balancing the knowledge of human experts in describing problems and goals with the search
capabilities of computers.
In practice, the two primary goals of data mining tend to be prediction and description.
Prediction involves using some variables or fields in the data set to predict unknown or future
values of other variables of interest. Description, on the other hand, focuses on finding patterns
describing the data that can be interpreted by humans. Therefore, it is possible to put data-
mining activities into one of two categories:
7. 1. Predictive data mining, which produces the model of the system described by the given data
set, or
2. Descriptive data mining, with produces new, nontrivial information based on the available data
set.
On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed
as an executable code, which can be used to perform classification, prediction, estimation, or
other similar tasks. On the other, descriptive, end of the spectrum, the goal is to gain an
understanding of the analyzed system by uncovering patterns and relationships in large data
sets. The relative importance of prediction and description for particular data-mining
applications can vary considerably. The goals of prediction and description are achieved by
using data-mining techniques, explained later in this book, for the following primary data-
mining tasks:
1. Classification – discovery of a predictive learning function that classifies a data item into one
of several predefined classes.
2. Regression – discovery of a predictive learning function, which maps a data item to a real-
value prediction variable.
3. Clustering – a common descriptive task in which one seeks to identify a finite set of categories
or clusters to describe the data.
4. Summarization – an additional descriptive task that involves methods for finding a compact
description for a set (or subset) of data.
5. Dependency Modeling – finding a local model that describes significant dependencies between
variables or between the values of a feature in a data set or in a part of a data set.
6. Change and Deviation Detection – discovering the most significant changes in the data set.
The more formal approach, with graphical interpretation of data-mining tasks for complex and
large data sets and illustrative examples, is given in Chapter 4. Current introductory
classifications and definitions are given here only to give the reader a feeling of the wide
spectrum of problems and tasks that may be solved using data-mining technology.
The success of a data-mining engagement depends largely on the amount of energy,
knowledge, and creativity that the designer puts into it. In essence, data mining is like solving
a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves.
Taken as a collective whole, however, they can constitute very elaborate systems. As you try
to unravel these systems, you will probably get frustrated, start forcing parts together, and
generally become annoyed at the entire process; but once you know how to work with the
pieces, you realize that it was not really that hard in the first place. The same analogy can be
applied to data mining. In the beginning, the designers of the data-mining process probably do
not know much about the data sources; if they did, they would most likely not be interested in
performing data mining. Individually, the data seem simple, complete, and explainable. But
collectively, they take on a whole new appearance that is intimidating and difficult to
comprehend, like the puzzle. Therefore, being an analyst and designer in a data-mining process
requires, besides thorough professional knowledge, creative thinking and a willingness to see
problems in a different light.
Data mining is one of the fastest growing fields in the computer industry. Once a small interest
area within computer science and statistics, it has quickly expanded into a field of its own. One
of the greatest strengths of data mining is reflected in its wide range of methodologies and
techniques that can be applied to a host of problem sets. Since data mining is the entire data
8. warehousing, data-mart, and decision-support community, encompassing professionals from
such industries as retail, manufacturing, telecommunications, healthcare, insurance, and
transportation. In the business community, data mining can be used to discover new purchasing
trends, plan investment strategies, and detect unauthorized expenditures in the accounting
system. It can improve marketing campaigns and the outcomes can be used to provide
customers with more focused support and attention. Data-mining techniques can be applied to
problems of business process reengineering, in which the goal is to understand interactions and
relationships among business practices and organizations.
Many law enforcement and special investigative units, whose mission is to identify fraudulent
activities and discover crime trends, have also used data mining successfully. For example,
these methodologies can aid analysts in the identification of critical behavior patterns in the
communication interactions of narcotics organizations, the monetary transactions of money
laundering and insider trading operations, the movements of serial killers, and the targeting of
smugglers at border crossings. Data-mining techniques have also been employed by people in
the intelligence community who maintain many large data sources as a part of the activities
relating to matters of national security. Appendix B of the book gives a brief overview of
typical commercial applications of data-mining technology today.
< Free Open Study >
< Free Open Study >
Chapter 2: Preparing the Data
2.1 REPRESENTATION OF RAW DATA
Data samples introduced as rows in Figure 1.4 are basic components in a datamining process.
Every sample is described with several features and there are different types of values for every
feature. We will start with the two most common types: numeric and categorical. Numeric
values include real-value variables or integer variables such as age, speed, or length. A feature
with numeric values has two important properties: its values have an order relation (2 < 5 and
5 < 7) and a distance relation (d(2.3, 4.2) = 1.9).
In contrast, categorical (often called symbolic) variables have neither of these two relations.
The two values of a categorical variable can be either equal or not equal: they only support an
equality relation (Blue = Blue, or Red ≠ Black). Examples of variables of this type are eye
color, sex, or country of citizenship. A categorical variable with two values can be converted,
in principle, to a numeric binary variable with two values: 0 or 1. A categorical variable with
N values can be converted into N binary numeric variables, namely, one binary variable for
each categorical value. These coded categorical variables are known as "dummy variables" in
statistics. For example, if the variable eye-color has four values: black, blue, green, and brown,
they can be coded with four binary digits.
9. Feature value Code
Black 1000
Blue 0100
Green 0010
Brown 0001
Another way of classifying variable, based on its values, is to look at it as a continuous variable
or a discrete variable.
Continuous variables are also known as quantitative or metric variables. They are measured
using either an interval scale or a ratio scale. Both scales allow the underlying variable to be
defined or measured theoretically with infinite precision. The difference between these two
scales lies in how the zero point is defined in the scale. The zero point in the interval scale is
placed arbitrarily and thus it does not indicate the complete absence of whatever is being
measured. The best example of the interval scale is the temperature scale, where zero degrees
Fahrenheit does not mean a total absence of temperature. Because of the arbitrary placement
of the zero point, the ratio relation does not hold true for variables measured using interval
scales. For example, 80 degrees Fahrenheit does not imply twice as much heat as 40 degrees
Fahrenheit. In contrast, a ratio scale has an absolute zero point and, consequently, the ratio
relation holds true for variables measured using this scale. Quantities such as height, length,
and salary use this type of scale. Continuous variables are represented in large data sets with
values that are numbers-real or integers.
Discrete variables are also called qualitative variables. Such variables are measured, or its
values defined, using one of two kinds of nonmetric scales-nominal or ordinal. A nominal scale
is an order-less scale, which uses different symbols, characters, and numbers to represent the
different states (values) of the variable being measured. An example of a nominal variable, a
utility, customer-type identifier with possible values is residential, commercial, and industrial.
These values can be coded alphabetically as A, B, and C, or numerically as 1, 2. or 3, but they
do not have metric characteristics as the other numeric data have. Another example of a
nominal attribute is the zip-code field available in many data sets. In both examples, the
numbers used to designate different attribute values have no particular order and no necessary
relation to one another.
An ordinal scale consists of ordered, discrete gradations, e.g., rankings. An ordinal variable is
a categorical variable for which an order relation is defined but not a distance relation. Some
examples of an ordinal attribute are the ran of a student in a class and the gold, silver, and
bronze medal positions in a sports competition. The ordered scale need not be necessarily
linear; e.g., the difference between the students ranked 4th
and 5th
need not be identical to the
difference between the students ranked 15th
and 16th
. All that can be established from an
ordered scale for ordinal attributes is greater-than, equal-to, or less-than relations. Typically,
ordinal variables encode a numeric variable onto a small set of overlapping intervals
corresponding to the values of an ordinal variable. These ordinal variables are closely related
to the linguistic or fuzzy variables commonly used in spoken English; e.g., AGE (with values
young, middle-aged, and old) and INCOME (with values low, middle-class, upper-middle-
class, and rich). More about the formalization and use of fuzzy values in a data-mining process
has been given in Chapter 11.
10. A special class of discrete variables in periodic variables. A periodic variable is a feature for
which the distance relation exists but there is no order relation. Examples are days of the week,
days of the month, or year. Monday and Tuesday, as the values of a feature, are closer than
Monday and Thursday, but Monday can come before or after Friday.
Finally, one additional dimension of classification of data is based on its behavior with respect
to time. Some data do not change with time and we consider them static data. On the other
hand, there are attribute values that change with time and this type of data we call dynamic or
temporal data. The majority of the data-mining methods are more suitable for static data, and
special consideration and some preprocessing are often required to mine dynamic data.
Most data-mining problems arise because there are large amounts of samples with different
types of features. Besides, these samples are very often high dimensional, which means they
have extremely large number of measurable features. This additional dimension of large data
sets causes the problem known in data-mining terminology as "the curse of dimensionality".
The "curse of dimensionality" is produced because of the geometry of high-dimensional spaces,
and these kinds of data spaces are typical for data-mining problems. The properties of high-
dimensional spaces often appear counterintuitive because our experience with the physical
world is in a low-dimensional space, such as a space with two or three dimensions.
Conceptually, objects in high-dimensional spaces have a larger surface area for a given volume
than objects in low-dimensional spaces. For example, a high-dimensional hypercube, if it could
be visualized, would look like a porcupine, as in Figure 2.1. As the dimensionality grows larger,
the edges grow longer relative to the size of the central part of the hypercube. Four important
properties of high-dimensional data are often the guidelines in the interpretation of input data
and data-mining results.
Figure 2.1: High-dimensional data looks conceptually like a porcupine
1. The size of a data set yielding the same density of data points in an n-dimensional space
increases exponentially with dimensions. For example, if a one-dimensional sample containing
n data points has a satisfactory level of density, then to achieve the same density of points in k
dimensions, we need nk
data points. If integers 1 to 100 are values one-dimensional samples,
11. where the domain of the dimension is [0, 100], then to obtain the same density of samples in a
5-dimensional space we will need 1005
= 1010
different samples. This is true even for the largest
real-world data sets; because of their large dimensionality, the density of samples is still
relatively low and, very often, unsatisfactory for data-mining purposes.
2. A larger radius is needed to enclose a fraction of the data points in a high-dimensional space.
For a given fraction of samples, it is possible to determine the edge length e of the hypercube
using the formula
where p is the prespecified fraction of samples and d is the number of dimensions. For example,
if one wishes to enclose 10% of the samples (p = 0.1), the corresponding edge for a two-
dimensional space will be e2(0.1) = 0.32, for a three-dimensional space e3(0.1) = 0.46, and for
a 10-dimensional space e10(0.1) = 0.80. Graphical interpretation of these edges is given in
Figure 2.2.
Figure 2.2: Regions enclose 10% of the samples for 1-, 2-, and 3-dimensional spaces
This shows that a very large neighborhood is required to capture even a small portion of the
data in a high-dimensional space.
3. Almost every point is closer to an edge than to another sample point in a high-dimensional
space. For a sample size n, the expected distance D between data points in a d-dimensional
spaces is
For example, for a two-dimensional space with 10000 points the expected distance is
D(2,10000) = 0.0005 and for a 10-dimensional space with the same number of sample points
D(10,10000) = 0.4. Keep in mind that the maximum distance from any point to the edge occurs
at the center of the distribution, and it is 0.5 for normalized values of all dimensions.
4. Almost every point is an outlier. As the dimension of the input space increases. the distance
between the prediction point and the center of the classified points increases. For example,
when d = 10, the expected value of the prediction point is 3.1 standard deviations away from
the center of the data belonging to one class. When d = 20, the distance is 4.4 standard
deviations. From this standpoint, the prediction of every new point looks like an outlier of the
initially classified data. This is illustrated conceptually in Figure 2.1, where predicted points
are mostly in the edges of the porcupine, far from the central part.
These rules of the "curse of dimensionality" most often have serious consequences when
dealing with a finite number of samples in a high-dimensional space. From properties (1) and
(2) we see the difficulty in making local estimates for high-dimensional samples; we need more
12. and more samples to establish the required data density for performing planned mining
activities. Properties (3) and (4) indicate the difficulty of predicting a response at a given point,
since any new point will on average be closer to an edge than to the training examples in the
central part.
< Free Open Study >
< Free Open Study >
Chapter 3: Data Reduction
OVERVIEW
For sample or moderate data sets, the preprocessing steps mentioned in the previous chapter in
preparation for data mining are usually enough. For really large data sets, there is an increased
likelihood that an intermediate, additional step-data reduction-should be performed prior to
applying the data-mining techniques. While large data sets have the potential for better mining
results, there is no guarantee that they will yield better knowledge than small data sets. Given
multidimensional data, a central question is whether it can be determined, prior to searching
for all data-mining solutions in all dimensions, that the method has exhausted its potential for
mining and discovery in a reduced data set. More commonly, a general solution is deduced
from a subset of available features or cases, and it will remain the same even when the search
space is enlarged.
The main theme for simplifying the data in this step is dimension reduction, and the main
question is whether some of these prepared and preprocessed data can be discarded without
sacrificing the quality of results. There is one additional question about techniques for data
reduction: Can the prepared data be reviewed and a subset found in a reasonable amount of
time and space? If the complexity of algorithms for data reduction increases exponentially,
then there is little to gain in reducing dimensions in big data. In this chapter, we will present
basic and relatively efficient techniques for dimension reduction applicable to different data-
mining problems.
< Free Open Study >
< Free Open Study >
13. Chapter 4: Learning from Data
OVERVIEW
Many recent approaches to developing models from data have been inspired by the learning
capabilities of biological systems and, in particular, those of humans. In fact, biological systems
learn to cope with the unknown, statistical nature of the environment in a data-driven fashion.
Babies are not aware of the laws of mechanics when they learn how to walk, and most adults
drive a car without knowledge of the underlying laws of physics. Humans as well as animals
also have superior pattern-recognition capabilities for such tasks as identifying faces, voices,
or smells. People are not born with such capabilities, but learn them through data-driven
interaction with the environment.
It is possible to relate the problem of learning from data samples to the general notion of
inference in classical philosophy. Every predictive-learning process consists of two main
phases:
1. Learning or estimating unknown dependencies in the system from a given set of samples, and
2. Using estimated dependencies to predict new outputs for future input values of the system.
These two steps correspond to the two classical types of inference known as induction
(progressing from particular cases-training data-to a general mapping or model) and deduction
(progressing from a general model and given input values to particular cases of output values).
These two phases are shown graphically in Figure 4.1.
Figure 4.1: Types of inference: induction, deduction, and transduction
A unique estimated model implies that a learned function can be applied everywhere, i.e., for
all possible input values. Such global-function estimation can be overkill, because many
practical problems require one to deduce estimated outputs only for a few given input values.
In that case, a better approach may be to estimate the outputs of the unknown function for
several points of interest directly from the training data without building a global model. Such
an approach is called transductive inference in which a local estimation is more important than
a global one. An important application of the transductive approach is a process of mining
association rules, which has been described in detail in Chapter 8. It is very important to
mention that the standard formalization of machine learning does not apply to this type of
inference.
14. The process of inductive learning and estimating the model may be described, formalized, and
implemented using different learning methods. A learning method is an algorithm (usually
implemented in software) that estimates an unknown mapping (dependency) between a
system's inputs and outputs from the available data set, namely, from known samples. Once
such a dependency has been accurately estimated, it can be used to predict the future outputs
of the system from the known input values. Learning from data has been traditionally explored
in such diverse fields as statistics, engineering, and computer science. Formalization of the
learning process and a precise, mathematically correct description of different inductive-
learning methods were the primary tasks of disciplines such as statistical learning theory and
artificial intelligence. In this chapter, we will introduce the basics of these theoretical
fundamentals for inductive learning.
< Free Open Study >
< Free Open Study >
Chapter 5: Statistical Methods
OVERVIEW
Statistics is the science of collecting and organizing data and drawing conclusions from data
sets. The organization and description of the general characteristics of data sets is the subject
area of descriptive statistics. How to draw conclusions from data is the subject of statistical
inference. In this chapter, the emphasis is on the basic principles of statistical inference; other
related topics will be described briefly, only enough to understand the basic concepts.
Statistical data analysis is the most well-established set of methodologies for data mining.
Historically, the first computer-based applications of data analysis were developed with the
support of statisticians. Ranging from one-dimensional data analysis to multivariate data
analysis, statistics offered a variety of methods for data mining, including different types of
regression and discriminant analysis. In this short overview of statistical methods that support
the data-mining process, we will not cover all approaches and methodologies; a selection has
been made of the techniques used most often in real-world data-mining applications.
< Free Open Study >
< Free Open Study >
15. Chapter 6: Cluster Analysis
Cluster analysis is a set of methodologies for automatic classification of samples into a number
of groups using a measure of association, so that the samples in one group are similar and
samples belonging to different groups are not similar. The input for a system of cluster analysis
is a set of samples and a measure of similarity (or dissimilarity) between two samples. The
output from cluster analysis is a number of groups (clusters) that form a partition, or a structure
of partitions, of the data set. One additional result of cluster analysis is a generalized description
of every cluster, and this is especially important for a deeper analysis of the data set's
characteristics.
6.1 CLUSTERING CONCEPTS
Samples for clustering are represented as a vector of measurements, or more formally, as a
point in a multidimensional space. Samples within a valid cluster are more similar to each other
than they are to a sample belonging to a different cluster. Clustering methodology is
particularly appropriate for the exploration of interrelationships among samples to make a
preliminary assessment of the sample structure. Humans perform competitively with
automatic-clustering procedures in one, two, or three dimensions, but most real problems
involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret
data embedded in a high-dimensional space.
Table 6.1 shows a simple example of clustering information for nine customers, distributed
across three clusters. Two features describe customers: the first feature is the number of items
the customers bought, and the second feature shows the price they paid for each.
Table 6.1: Sample set of clusters consisting of similar objects
# of items Price
2 1700
Cluster 1 3 2000
4 2300
10 1800
Cluster 2 12 2100
11 2500
2 100
Cluster 3 3 200
3 350
16. Customers in Cluster 1 purchase few high-priced items; customers in Cluster 2 purchase many
high-priced items, and customers in Cluster 3 purchase few low-priced items. Even this simple
example and interpretation of a cluster's characteristics shows that clustering analysis (in some
references also called unsupervised classification) refers to situations in which the objective is
to construct decision boundaries (classification surfaces) based on unlabeled training data set.
The samples in these data sets have only input dimensions, and the learning process is classified
as unsupervised.
Clustering is a very difficult problem because data can reveal clusters with different shapes and
sizes in an n-dimensional data space. To compound the problem further, the number of clusters
in the data often depends on the resolution (fine vs. coarse) with which we view the data. The
next example illustrates these problems through the process of clustering points in the
Euclidean 2D space. Figure 6.1a shows a set of points (samples in a two-dimensional space)
scattered on a 2D plane. Let us analyze the problem of dividing the points into a number of
groups. The number of groups N is not given beforehand. Figure 6.1b shows the natural clusters
GI, G2, and G3 bordered by broken curves. Since the number of clusters is not given, we have
another partition of four clusters in Figure 6.1c that is as natural as the groups in Figure 6.1b.
This kind of arbitrariness for the number of clusters is a major problem in clustering.
Figure 6.1: Cluster analysis of points in a 2D-space
Note that the above clusters can be recognized by sight. For a set of points in a higher-
dimensional Euclidean space, we cannot recognize clusters visually.
Accordingly, we need an objective criterion for clustering. To describe this criterion, we have
to introduce a more formalized approach in describing the basic concepts and the clustering
process.
An input to a cluster analysis can be described as an ordered pair (X, s), or (X, d), where X is
a set of descriptions of samples, and s and d are measures for similarity or dissimilarity
(distance) between samples, respectively. Output from the clustering system is a partition Λ =
{G1, G2, …, GN} where Gk, k = 1, …, N is a crisp subset of X such that
The members G1, G2, …, GN of Λ are called clusters. Every cluster may be described with
some characteristics. In discovery-based clustering, both the cluster (a separate set of points in
X) and its descriptions or characterizations are generated as a result of a clustering procedure.
There are several schemata for a formal description of discovered clusters:
17. 1. Represent a cluster of points in an n-dimensional space (samples) by their centroid or by a set
of distant (border) points in a cluster.
2. Represent a cluster graphically using nodes in a clustering tree.
3. Represent clusters by using logical expression on sample attributes.
Figure 6.2 illustrates these ideas. Using the centroid to represent a cluster is the most popular
schema. It works well when the clusters are compact or isotropic. When the clusters are
elongated or non-isotropic, however, this schema fails to represent them properly.
Figure 6.2: Different schemata for cluster representation
The availability of a vast collection of clustering algorithms in the literature and also in
different software environments can easily confound a user attempting to select an approach
suitable for the problem at hand. It is important to mention that there is no clustering technique
that is universally applicable in uncovering the variety of structures present in
multidimensional data sets. The user's understanding of the problem and the, corresponding
data types will be the best criteria to select the appropriate method. Most clustering algorithms
are based on the following two popular approaches:
1. Hierarchical clustering
2. Iterative square-error partitional clustering
Hierarchical techniques organize data in a nested sequence of groups, which can be displayed
In the form of a dendrogram or a tree structure. Square-error partitional algorithms attempt to
obtain that partition which minimizes the within-cluster scatter or maximizes the between-
cluster scatter. These methods are nonhierarchical because all resulting clusters are groups of
samples at the same level of partition. To guarantee that an optimum solution has been
obtained, one has to examine all possible partitions of N samples of n-dimensions into K
clusters (for a given K), but that retrieval process is not computation ally feasible. Notice that
the number of all possible partitions of a set of N objects into K clusters is given by:
So various heuristics are used to reduce the search space, but then there is no guarantee that the
optimal solution will be found.
Hierarchical methods that produce a nested series of partitions are explained in Section 6.3,
while partitional methods that produce only one level of data grouping are given with more
details in Section 6.4. The next section introduces different measures of similarity between
samples; these measures are the core component of every clustering algorithm.
18. < Free Open Study >
< Free Open Study >
Chapter 7: Decision Trees and Decision
Rules
OVERVIEW
Decision trees and decision rules are data-mining methodologies applied in many real-world
applications as a powerful solution to classification problems. Therefore, at the beginning, let
us briefly summarize the basic principles of classification. In general, classification is a process
of learning a function that maps a data item into one of several predefined classes. Every
classification based on inductive-learning algorithms is given as input a set of samples that
consist of vectors of attribute values (also called feature vectors) and a corresponding class.
The goal of learning is to create a classification model, known as a classifier, which will
predict, with the values of its available input attributes, the class for some entity (a given
sample). In other words, classification is the process of assigning a discrete label value (class)
to an unlabeled record, and a classifier is a model (a result of classification) that predicts one
attribute-class of a sample-when the other attributes are given. In doing so, samples are divided
into predefined groups. For example, a simple classification might group customer billing
records into two specific classes: those who pay their bills within thirty days and those who
takes longer than thirty days to pay. Different classification methodologies are applied today
in almost every discipline where the task of classification, because of the large amount of data,
requires automation of the process. Examples of classification methods used as a part of data-
mining applications include classifying trends in financial market and identifying objects in
large image databases.
A more formalized approach to classification problems is given through its graphical
interpretation. A data set with N features may be thought of as a collection of discrete points
(one per example) in an N-dimensional space. A classification rule is a hypercube in this space,
and the hypercube contains one or more of these points. When there is more than one cube for
a given class, all the cubes are OR-ed to provide a complete classification for the class, such as
the example of two 2D classes in Figure 7.1. Within a cube the conditions for each part are
AND-ed. The size of a cube indicates its generality, i.e., the larger the cube the more vertices
it contains and potentially covers more sample-points.
19. Figure 7.1: Classification of samples in a 2D space
In a classification model, the connection between classes and other properties of the samples
can be defined by something as simple as a flowchart or as complex and unstructured as a
procedure manual. Data-mining methodologies restrict discussion to formalized, "executable"
models of classification, and there are two very different ways in which they can be
constructed. On the one hand, the model might be obtained by interviewing the relevant expert
or experts, and most knowledge-based systems have been built this way despite the well-known
difficulties attendant on taking this approach. Alternatively, numerous recorded classifications
might be examined and a model constructed inductively by generalizing from specific
examples that are of primary interest for data-mining applications.
The statistical approach to classification explained in Chapter 5 gives on type of model for
classification problems: summarizing the statistical characteristics of the set of samples. The
other approach is based on logic. Instead of using math operations like addition and
multiplication, the logical model is based on expressions that are evaluated as true or false by
applying Boolean and comparative operators to the feature values. These methods of modeling
give accurate classification results compared to other nonlogical methods, and they have
superior explanatory characteristics. Decision trees and decision rules are typical data-mining
techniques that belong to a class of methodologies that give the output in the form of logical
models.
< Free Open Study >
< Free Open Study >
Chapter 8: Association Rules
OVERVIEW
Association rules are one of the major techniques of data mining and it is perhaps the most
common from of local-pattern discovery in unsupervised learning systems. It is a form of data
mining that most closely resembles the process that most people think about when they try to
understand the data-mining process; namely, "mining" for gold through a vast database. The
20. gold in this case would be a rule that is interesting, that tells you something about your database
that you didn't already know and probably weren't able to explicitly articulate. These
methodologies retrieve all possible interesting patterns in the database. This is a strength in the
sense that it leaves no stone unturned, but it can be viewed also as a weakness because the user
can easily become overwhelmed with a large amount of new information, and an analysis of
their usability is difficult and time-consuming.
Besides the standard methodologies such as the Apriori technique for association-rule mining,
we will explain some data-mining methods related to web mining and text mining in this
chapter. The reason we have included these techniques in this chapter is their local-modeling
nature and, therefore, their fundamental similarity with the association-rules approach,
although the techniques are different.
< Free Open Study >
< Free Open Study >
Chapter 9: Artificial Neural Networks
OVERVIEW
Work on artificial neural networks (ANNs) has been motivated by the recognition that the
human brain computes in an entirely different way from the conventional digital computer. It
was a great challenge for many researchers in different disciplines to model the brain's
computational processes. The brain is a highly complex, nonlinear, and parallel information-
processing system. It has the capability to organize its components so as to perform certain
computations with a higher quality and many times faster than the fastest computer in existence
today. Examples of these processes are pattern recognition, perception, and motor control.
Artificial neural networks have been studied for more than four decades since Rosenblatt first
applied the single-layer perceptrons to pattern-classification learning in the late 1950s
An artificial neural network is an abstract computational model of the human brain. The human
brain has an estimated 1011
tiny units called neurons. These neurons are interconnected with an
estimated 1015
links. Similar to the brain, an ANN is composed of artificial neurons (or
processing units) and interconnections. When we view such a network as a graph, neurons can
be represented as nodes (or vertices) and interconnections as edges. Although the term artificial
neural network is most commonly used, other names include "neural network", parallel
distributed-processing system (PDP), connectionist model, and distributed adaptive system.
ANNs are also referred to in the literature as neurocomputers
A neural network, as the name indicates, is a network structure consisting of a number of nodes
connected through directional links. Each node represents a processing unit, and the links
between nodes specify the causal relationship between connected nodes. All nodes are
21. adaptive, which means that the outputs of these nodes depend on modifiable parameters
pertaining to these nodes. Although there are several definitions and several approaches to the
ANN concept, we may accept the following definition, which views the ANN as a formalized
adaptive machine:
DEF: An artificial neural network is a massive parallel distributed processor made up of simple
processing units. It has the ability to learn from experiential knowledge expressed through
interunit connection strengths, and can make such knowledge available for use.
It is apparent that an ANN derives its computing power through, first, its massive parallel
distributed structure and, second, its ability to learn and therefore to generalize. Generalization
refers to the ANN producing reasonable outputs for new inputs not encountered during a
learning process. The use of artificial neural networks offers several useful properties and
capabilities:
1. Nonlinearity - An artificial neuron as a basic unit can be a linear- or nonlinear-processing
element, but the entire ANN is highly nonlinear. It is a special kind of nonlinearity in the sense
that it is distributed throughout the network. This characteristic is especially important, for
ANN models the inherently nonlinear real-world mechanisms responsible for generating data
for learning.
2. Learning from examples - An ANN modifies its interconnection weights by applying a set of
training or learning samples. The final effects of a learning process are tuned parameters of a
network (the parameters are distributed through the main components of the established
model), and they represent implicitly stored knowledge for the problem at hand.
3. Adaptivity: An ANN has a built-in capability to adapt its interconnection weights to changes in
the surrounding environment. In particular, an ANN trained to operate in a specific
environment can be easily retrained to deal with changes in its environmental conditions.
Moreover, when it is operating in a nonstationary environment, an ANN can be designed to
adopt its parameters in real time.
4. Evidential Response: In the context of data classification, an. ANN can be designed to provide
information not only about which particular class to select for a given sample, but also about
confidence in the decision made. This later information may be used to reject ambiguous data,
should they arise, and thereby improve the classification performance or performances of the
other tasks modeled by the network.
5. Fault Tolerance: An ANN has the potential to be inherently fault-tolerant, or capable of robust
computation. Its performances do not degrade significantly under adverse operating conditions
such as disconnection of neurons, and noisy or missing data. There is some empirical evidence
for robust computation, but usually it is uncontrolled.
6. Uniformity of Analysis and Design: Basically, artificial neural networks enjoy universality as
information processors. The same principles, notation, and the same steps in methodology are
used in all domains)nvolving application of artificial neural networks.
To explain a classification of different types of ANNs and their basic principles it is necessary
to introduce an elementary component of every ANN. This simple processing unit is called an
artificial neuron.
22. < Free Open Study >
< Free Open Study >
Chapter 10: Genetic Algorithms
OVERVIEW
There is a large class of interesting problems for which no reasonably fast algorithms have been
developed. Many of these problems are optimization problems that arise frequently in
applications. The fundamental approach to optimization is to formulate a single standard of
measurement-a cost function-that summarizes the performance or value of a decision and
iteratively improves this performance by selecting from among the available alternatives. Most
classical methods of optimization generate a deterministic sequence of trial solutions based on
the gradient or higher-order statistics of the cost function. In general, any abstract task to be
accomplished can be thought of as solving a problem, which can be perceived as a search
through a space of potential solutions. Since we are looking for "the best" solution, we can
view this task as an optimization process. For small data spaces, classical, exhaustive search
methods usually suffice; for large spaces special techniques must be employed. Under regular
conditions, the techniques can be shown to generate sequences that asymptotically converge to
optimal solutions, and in certain cases they converge exponentially fast. But the methods often
fail to perform adequately when random perturbations are imposed on the function that is
optimized. Further, locally optimal solutions often prove insufficient in real-world situations.
Despite such problems, which we call hard-optimization problems, it is often possible to find
an effective algorithm whose solution is approximately optimal. One of the approaches is based
on genetic algorithms, which are developed on the principles of natural evolution.
Natural evolution is a population-based optimization process. Simulating this process on a
computer results in stochastic-optimization techniques that can often outperform classical
methods of optimization when applied to difficult, real-world problems. The problems that the
biological species have solved are typified by chaos, chance, temporality, and nonlinear
interactivity. These are the characteristics of the problems that have proved to be especially
intractable to classical methods of optimization. Therefore, the main avenue of research in
simulated evolution is a genetic algorithm (GA), which is a new, iterative, optimization method
that emphasizes some facets of natural evolution. GAs approximate an optimal solution to the
problem at hand; they are by nature stochastic algorithms whose search methods model some
natural phenomena such as genetic inheritance and the Darwinian strife for survival.
< Free Open Study >
< Free Open Study >
23. Chapter 11: Fuzzy Sets and Fuzzy Logic
In the previous chapters, a number of different methodologies for the analysis of large data sets
have been discussed. Most of the approaches presented, however, assume that the data is
precise. That is, they assume that we deal with exact measurements for further analysis.
Historically, as reflected in classical mathematics, we commonly seek a precise and crisp
description of things or events. This precision is accomplished by expressing phenomena in
numeric or categorical values. But in most, if not all, real-world scenarios, we will never have
totally precise values. There is always going to be a degree of uncertainty. However, classical
mathematics can encounter substantial difficulties because of this fuzziness. In many real-
world situations; we may say that fuzziness is reality whereas crispness or precision is
simplification and idealization. The polarity between fuzziness and precision is quite a striking
contradiction in the development of modern information-processing systems. One effective
means of resolving the contradiction is the fuzzy-set theory, a bridge between high precision
and the high complexity of fuzziness.
11.1 FUZZY SETS
Fuzzy concepts derive from fuzzy phenomena that commonly occur in the real world. For
example, rain is a common natural phenomenon that is difficult to describe precisely since it
can rain with varying intensity, anywhere from a light shower to a torrential downpour. Since
the word rain does net adequately or precisely describe the wide variations in the amount and
intensity of any rain event, "rain" is considered a fuzzy phenomenon.
Often, the concepts formed in the human brain for perceiving, recognizing, and categorizing
natural phenomena are also fuzzy. The boundaries ef these concepts are vague. Therefore, the
judging and reasoning that emerges from them are also fuzzy. For instance, "rain" might be
classified as "light rain", "moderate rain", and "heavy rain" in order to describe the degree of
raining. Unfortunately, it is difficult to say when the rain is light, moderate, or heavy, because
the boundaries are undefined. The concepts of "light", "moderate", and "heavy" are prime
examples of fuzzy concepts themselves. To explain the principles of fuzzy sets, we will start
with the basics in classical set theory.
The notion of a set occurs frequently as we tend to organize, summarize, and generalize
knowledge about objects. We can even speculate that the fundamental nature of any human
being is to organize, arrange, and systematically classify information about the diversity of any
environment. The encapsulation of objects into a collection whose members all share some
general features naturally implies the notion of a set. Sets are used often and almost
unconsciously; we talk about a set of even numbers, positive temperatures, personal computers,
fruits, and the like. For example, a classical set A of real numbers greater than 6 is a set with a
crisp boundary, and it can be expressed as
where there is a clear, unambiguous boundary 6 such that if x is greater than this number, then
x belongs to the set A; otherwise x does not belong to the set. Although classical sets have
suitable applications and have proven to be an important tool for mathematics and computer
24. science, they do not reflect the nature of human concepts and thoughts, which tend to be
abstract and imprecise. As an illustration, mathematically we can express a set of tall persons
as a collection of persons whose height is more than 6 ft; this is the set denoted by previous
equation, if we let A = "tall person" and x = "height". Yet, this is an unnatural and inadequate
way of representing our usual concept of "tall person". The dichotomous nature of the classical
set would classify a person 6.001 ft tall as a tall person, but not a person 5.999 ft tall. This
distinction is intuitively unreasonable. The flaw comes from the sharp transition between
inclusions and exclusions in a set.
In contrast to a classical set, a fuzzy set, as the name implies, is a set without a crisp boundary.
That is, the transition from "belongs to a set" to "does not belong to a set" is gradual, and this
smooth transition is characterized by membership functions that give sets flexibility in
modeling commonly used linguistic expressions such as "the water is hot" or "the temperature
is high". Let us introduce some basic definitions and their formalizations concerning fuzzy sets.
Let X be a space of objects and x be a generic element of X. A classical set A, A⊆X, is defined
as a collection of elements or objects X ∈ X such that each x can either belong or not belong
to the set A. By, defining a characteristic function for each element x in X, we can represent a
classical set A. by a set of ordered pairs (x, 0) or (x, 1), which indicates x ∉ A or x ∈ A,
respectively.
Unlike the aforementioned conventional set, a fuzzy set expresses the degree to which an
element belongs to a set. The characteristic function of a fuzzy set is allowed to have values
between 0 and 1, which denotes the degree of membership of an element in a given set. If X is
a collection of objects denoted generically by x, then a fuzzy set A in X is defined as a set of
ordered pairs:
where μA(x) is called the membership function (or MF for short) for the fuzzy set A. The
membership function maps each element of X to a membership grade (or membership value)
between 0 and 1.
Obviously, the definition of a fuzzy set is a simple extension of the definition of a classical set
in which the characteristic function is permitted to have any value between 0 and 1. If the value
of the membership function μA (x) is restricted to either 0 or 1, then A is reduced to a classic
set and μA (x) is the characteristic function of A. For clarity, we shall also refer to classical sets
as ordinary sets, crisp sets, nonfuzzy sets, or, just, sets.
Usually X is referred to as the universe of discourse, or, simply, the universe, and it may consist
of discrete (ordered or nonordered) objects or continuous space. This can be clarified by the
following examples. Let X = {San Francisco, Boston, Los Angeles} be the set of cities one
may choose to live in. The fuzzy set C = "desirable city to live in" may be described as follows:
The universe of discourse X is discrete and it contains nonordered objects: three big cities in
the United States. As one can see, the membership grades listed above are quite subjective;
anyone can come up with three different but legitimate values to reflect his or her preference.
25. In the next example, let X = {0, 1, 2, 3, 4, 5, 6} be a set of the number of children a family may
choose to have. Then the fuzzy set A = "sensible number of children in a family" may be
described as follows:
Or, in the notation that we will use through this chapter,
Here we have a discrete-order universe X; the membership function for the fuzzy set A is shown
in Figure 11.1 (a). Again, the membership grades of this fuzzy set are obviously subjective
measures.
Figure 11.1: Discrete and continuous representation of membership functions for given fuzzy sets
Finally, let X = R+
be the set of possible ages for human beings. Then the fuzzy set B = "about
50-years old" may be expressed as
where
This is illustrated in Figure 11.1 (b).
As mentioned earlier, a fuzzy set is completely characterized by its membership function. Since
many fuzzy sets in use have a universe of discourse X consisting of the real line R, it would be
impractical to list all the pairs defining a membership function. A more convenient and concise
way to define a membership function is to express it as a mathematical formula. Several classes
of parametrized membership functions are introduced, and in real-world applications of fuzzy
sets the shape of membership functions is usually restricted to a certain class of functions that
can be specified with only a few parameters. The most well known are triangular, trapezoidal,
and Gaussian; Figure 11.2 shows these commonly used shapes for membership functions.
26. Figure 11.2: Most commonly used shapes for membership functions
A triangular membership function is specified by three parameters {a, b, c} as follows:
The parameters {a, b, c}, with a < b < c, determine the x coordinates of the three corners of the
underlying triangular-membership function.
A trapezoidal membership function is specified by four parameters {a, b, c, d} as follows:
The parameters {a, b, c, d}, with a < b ≤: c < d, determine the x coordinates of the four corners
of the underlying trapezoidal membership function. A triangular membership function can be
seen as a special case of the trapezoidal form where b = c.
Finally, a Gaussian-membership function is specified by two parameters {c, σ}:
μ(x) = gaussian(x, c, σ) = e−1/2((x-c)/σ)2
A Gaussian membership function is determined completely by c and σ; c represents the
membership-function center, and of determines the membership-function, width. Figure 11.3
illustrates the three classes of parametrized membership functions.
Figure 11.3: Examples of parametrized membership functions
From the preceding examples, it is obvious that the construction of a fuzzy set depends on two
things: the identification of a suitable universe of discourse and the specification of an
appropriate membership function. The specification of membership function is subjective,
which means that the membership functions for the same concept (say, "sensible number of
children in a family") when specified by different persons may vary considerably. This
27. subjectivity comes from individual differences in perceiving or expressing abstract concepts
and has little to do with randomness. Therefore, the subjectivity and nonrandomness of fuzzy
sets is the primary difference between the study of fuzzy sets and the probability theory, which
deals with the objective treatment of random phenomena.
There are several parameters and characteristics of membership function that are used very
often in some fuzzy-set operations and fuzzy-set inference systems. We will define only some
of them that are, in our opinion, the most important:
1. Support - The support of a fuzzy set A is the set of all points x in the universe of discourse X
such that μA(x) > 0:
2. Core - The core of a fuzzy set A is the set of all points x in X such that μA(x) = 1:
3. Normalization - A fuzzy set A is normal if its core in nonempty. In other words, we can always
find a point x ∈ X such that μA(x) = 1.
4. Cardinality - Given a fuzzy set A in a finite universe X, its cardinality, denoted by Card(A), is
defined as
Often, Card(X) is referred to as the scalar cardinality or the count of A. For example, the fuzzy
set A = 0.1/1 + 0.3/2+ 0.6/3 + 1.0/4 + 0.4/5 in universe X = {1, 2, 3, 4, 5, 6} has a cardinality
Card(A) = 2.4.
5. α-cut - The α-cut or a-level set of a fuzzy set A is a crisp set defined by
6. Fuzzy number - Fuzzy numbers are a special type of fuzzy sets restricting the possible types of
membership functions:
a. The membership function must be normalized (i.e., the core is nonempty) and singular. This
results in precisely one point, which lies inside the core, modeling the typical value of the fuzzy
number. This point is called the modal value.
b. The membership function has to monotonically increase left of the core and monotonically
decrease on the right. This ensures that only one peak and, therefore, only one typical value
exists. The spread of the support (i.e., the nonzero area of the fuzzy set) describes the degree
of imprecision' expressed by the fuzzy number.
A graphical illustration of some of these basic concepts is given in Figure 11.4.
28. Figure 11.4: Core, support; and α-cut for fuzzy set A
< Free Open Study >
< Free Open Study >
Chapter 12: Visualization Methods
How are humans capable of recognizing hundreds of faces? What is our "channel capacity"
when dealing with the visual or any other of our senses? How many distinct visual icons and
orientations can humans accurately perceive? It is important to factor all these cognitive
limitations when designing a visualization technique that avoids delivering ambiguous or
misleading information. Categorization lays the foundation for a well-known cognitive
technique: the "chunking" phenomena. How many chunks can you hang onto? That varies
among people, but the typical range forms "the magical number seven, plus or minus two". The
process of reorganizing large amounts of data into fewer chunks with more bits of information
per chunk is known in cognitive science as "recoding". We expand our comprehension abilities
by reformatting problems into multiple dimensions or sequences of chunks, or by redefining
the problem in a way that invokes relative judgment, followed by a second focus of attention.
12.1 PERCEPTION AND VISUALIZATION
Perception is our chief means of knowing and understanding the world; images are the mental
pictures produced by this understanding. In perception as well as art, a meaningful whole is
created by the relationship of the parts to each other. Our ability to see patterns in things and
pull together parts into a meaningful whole is the key to perception and thought. As we view
our environment, we are actually performing the enormously complex task of deriving meaning
out of essentially separate and disparate sensory elements. The eye, unlike the camera, is not a
mechanism for capturing images so much as it is a complex processing unit that detects
changes, forms, and features, and selectively prepares data for the brain to interpret. The image
we perceive is a mental one, the result of gleaning what remains constant while the eye scans.
As we survey our three-dimensional ambient environment, properties such as contour, texture,
and regularity allow us to discriminate objects and see them as constants.
29. Human beings do not normally think in terms of data; they are inspired by and think in terms
of images-mental pictures of a given situation-and they assimilate information more quickly
and effectively as visual images than as textual or tabular forms. Human vision is still the most
powerful means of sifting out irrelevant information and detecting significant patterns. The
effectiveness of this process is based on a picture's submodalities (shape, color, luminance,
motion, vectors, texture). They depict abstract information as a visual grammar that integrates
different aspects of represented information. Visually presenting abstract information, using
graphical metaphors in an immersive 2D or 3D environment, increases one's ability to
assimilate many dimensions of the data in a broad and immediately comprehensible form. It
converts aspects of information into experiences our senses and mind can comprehend,
analyze, and act upon.
We have heard the phrase "Seeing is believing" many times, though merely seeing is not
enough. When you understand what you see, seeing becomes believing. Recently, scientists
discovered that seeing and understanding together enable humans to discover new knowledge
with deeper insight from large amounts of data. The approach integrates the human mind's
exploratory abilities with the enormous processing power of computers to form a powerful
visualization environment that capitalizes on the best of both worlds. A computer-based
visualization technique has to incorporate the computer less as a tool and more as a
communication medium. The power of visualization to exploit human perception offers both a
challenge and an opportunity. The challenge is to avoid visualizing incorrect patterns leading
to incorrect decisions and actions. The opportunity is to use knowledge about human perception
when designing visualizations. Visualization creates a feedback loop between perceptual
stimuli and the user's cognition.
Visual data-mining technology builds on visual and analytical processes developed in various
disciplines including scientific visualization, computer graphics, data mining, statistics, and
machine learning with custom extensions that handle very large multidimensional data sets
interactively. The methodologies are based on both functionality that characterizes structures
and displays data and human capabilities that perceives patterns, exceptions, trends, and
relationships.
< Free Open Study >
< Free Open Study >
Chapter 13: References
CHAPTER 1
1. Adriaans. P. and D. Zantinge, Data Mining, Addison-Wesley, New York, 1996.
2. Agosta. L., The Essential Guide to Data Warehousing, Prentice Hall, Upper Saddle River: N.J.,
2000.
30. 3. An, A. et al., Applying Knowledge Discovery to Predict Water-Supply Consumption, IEEE
Expert (July/Aug1997): 72-78.
4. Barquin, R. and H. Edelstein, Building, Using, and Managing the Data Warehouse, Prentice
Hall, Upper Saddle River: N.J., 1997.
5. Berson, A., S. Smith, K. Thearling, Building Data Mining Applications for CRM, McGraw-
Hill, New York, 2000.
6. Bischoff, J. and T. Alexander, Data Warehouse: Practical Advice from the Experts, Prentice
Hall, Upper Saddle River: N.J., 1997.
7. Brachman, R. J. et al., Mining Business Databases, CACM 39, no. 11 (1996): 42-48.
8. Djoko, S., D. J. Cook, L. B. Holder, An Empirical study of Domain Knowledge and its
Benefits to Substructure Discovery, IEEE Transactions on Knowledge and Data Engineering
9, no. 4 (1997): 575-585.
9. Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Advances in Knowledge
Discovery and Data Mining, AAAI Press/MIT Press, Cambridge, 1996.
10. Fayyad, U. M., G. P. Shapiro, P. Smyth, From Data Mining to Knowledge Discovery in
Databases, AI Magazine (Fall 1996): 37-53
11. Fayyad, U.M., G. P. Shapiro, P. Smyth, The KDD Process for Extracting Useful Knowledge
from Volumes of Data, CACM 39, no. 11 (1966): 27-34.
12. Friedland, L., Accessing the Data Warehouse: Designing Tools to Facilitate Business
Understanding, Interactions (Jan/Feb 1998): 25-36.
13. Ganti, V., J. Gehrke, R. Ramakrishnan, Mining Very Large Databases, Computer 32, no. 8
(1999): 38-45.
14. Groth, R., Data Mining: A Hands-On Approach for Business Professionals, Prentice Hall,
Upper Saddle River: N.J., 1998.
15. Han, J. and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San
Francisco, 2000.
16. Kaudel, A., M. Last, H. Bunke, eds., Data Mining and Computational Intelligence, Physica-
Verlag, Heidelberg: Germany, 2001.
17. Maxus Systems International, What is data mining, http://www.maxussystems.com/data-
mining.html.
18. Ramakrishnan, N. and A. Y. Grama, Data Mining: From Serendipity to Science, Computer
32, no. 8 (1999): 34-37.
19. Shapiro, G. P., The Data-Mining Industry Coming of Age, IEEE Intelligent Systems
(Nov/Dec 1999): 32-33.
20. Thomsen, E., OLAP Solution: Building Multidimensional Information System, John Wiley,
New York, 1997.
21. Thuraisingham, B., Data Mining: Technologies, Techniques, Tools, and Trends, CRC Press
LLC, Boca Raton: FL, 1999.
22. Tsur, S., Data Mining in the Bioinformatics Domain, Proceedings of the 26th
YLDB
Conference, Cairo: Egypt, (2000): 711-714.
23. Waltz, D. and S. J. Hong, Data Mining: A Long Term Dream, IEEE Intelligent Systems
(Nov/Dec 1999): 30-34.
< Free Open Study >
< Free Open Study >
31. Appendix A: Data-Mining Tools
A1 COMMERCIALLY AND PUBLICLY AVAILABLE TOOLS
This summary of some publicly available commercial data-mining products is being provided
to help readers better understand what software tools can be found on the market and what
their features are. It is not intended to endorse or critique any specific product. Potential users
will need to decide for themselves the suitability of each product for their specific applications
and data-mining environments. This is primarily intended as a starting point from which users
can obtain more information. There is a constant stream of new products appearing in the
market and hence this list is by no means comprehensive. Because these changes are very
frequent, the author suggests two Web sites for information about the latest tools and their
performances: http://www.kdnuggets.com and http://www.knowledgestorm.com.
AgentBase/Marketeer
AgentBase/Marketeer is, according to its designers, the industry's first second-generation data-
mining product. It is based on emerging intelligent-agent technology. The system comes with
a group of wizards to guide a user through different stages of data mining. This makes it easy
to use. AgentBase/Marketeer is primarily aimed at marketing applications. It uses several
datamining methodologies whose results are combined by intelligent agents. It can access data
from all major sources, and it runs on Windows95, Windows NT, and the Solaris operating
system.
ANGOSS Knowledge Miner
ANGOSS Knowledge Miner combines ANGOSS Knowledge Studio with proprietary
algorithms for clickstream analysis; it interfaces to Web logreporting tools.
Autoclass III
Autoclass is an unsupervised Bayesian classification system for independent data. It seeks a
maximum posterior probability to provide a simple approach to problems such as classification,
clustering, and general mixture separation. It works on Unix platforms.
BusinessMiner
BusinessMiner is a single-strategy, easy-to-use tool based on decision trees. It can access data
from multiple sources including Oracle, Sybase, SQL Server, and Teradata. BusinessMiner
runs on all Windows platforms, and it can be used stand-alone or in conjunction with OLAP
tools.
32. CART
CART is a robust data-mining tool that automatically searches for important patterns and
relationships in large data sets and quickly uncovers hidden structures even in highly complex
data sets. It works on the Windows, Mac, and Unix platforms.
Clementine
Clementine is a comprehensive toolkit for data mining. It uses neural networks and rule-
induction methodologies. The toolkit includes data manipulation and visualization capabilities.
It runs on Windows and Unix platforms and accepts the data from Oracle, Ingres, Sybase, and
Informix databases. A recent version offers sequence association and clustering for Web-data
analyses.
Darwin (now part of Oracle)
Darwin is an integrated, multiple-strategy tool that uses neural networks, classification and
regression trees, nearest-neighbor rule, and genetic algorithms. These techniques are
implemented in an open client/server architecture with a scalable, parallel-computing
implementation. The client-side unit can work on Windows and the server on Unix. Darwin
can access data from a variety of networked data sources including all major relational
databases. It is optimized for parallel servers.
DataEngine
DataEngine is a multiple-strategy data-mining tool for data modeling, combining conventional
data-analysis methods with fuzzy technology, neural networks, and advanced statistical
techniques. It works on the Windows platform.
Data Mining Suite
Data Mining Suite is a comprehensive and integrated set of data-mining tools. The main tools
are IDIS (Information Discovery System) for finding classification rules, IDIS-PM (Predictive
Modeler) for prediction and forecasting, and IDIS-Map for finding geographical patterns. Data
Mining Suite supports client/server architecture and runs on all major platforms with different
database-management systems. It also discovers patterns of users' activities on Web sites.
Data Surveyor
Data Surveyor is a single-strategy (classification) tool. It consists of two components: a front-
end a back-end. The front-end is responsible for data mining using the tree-generation
methodology. The back-end consists of a fast, parallel, database server where the data are
loaded from a user's databases. The back-end runs on parallel Unix servers and the front-end
works with Unix and Windows platforms.
DataMind
DataMind's architecture consists of two components: DataCruncher for serverside data mining
and DataMind Professional for client-side specification and viewing results. It can implement
33. classification, clustering, and association-rule technologies. DataMind can be set up to mine
data locally or on a remote server, where data are organized using any of the major relational
databases.
Datasage
Datasage is a comprehensive data-mining product whose architecture incorporates a data mart
in its data-mining server. The user accesses Datasage through an interface operating as a thin
client, using either a Windows client or a Java-enabled browser client.
DBMiner
DBMiner is a publicly available tool for data mining. It is a multiple-strategy tool and it
supports methodologies such as clustering, association rules, summarization, and visualization.
DBMiner uses Microsoft SQL Server 7.0 Plato and runs on different Windows platforms.
Decision Series
Decision Series is a multiple-strategy tool that uses artificial neural networks, clustering
algorithms, and genetic algorithms to perform data mining. It can operate on scalable, parallel
platforms to provide speedy solutions. It runs on standard industry platforms such as HP, SUN,
and DEC, and it supports most of the commercial, relational database-management systems.
Decisionhouse
Decisionhouse is a suite of tightly integrated tools that primarily support classification and
visualization processes. Various aspects of data preparation and reporting are included. It
works on the Unix platform.
Delta Miner
Delta Miner is a multiple-strategy tool supporting clustering, summarization, deviation-
detection, and visualization processes. A common application is the analysis of financial
controlling data. It runs on Windows platforms and it integrates new search techniques and
"business intelligence" methodologies into an OLAP front-end.
Emerald
Emerald is a publicly available tool still used as a research system. It consists of five different
machine-learning programs supporting clustering, classification, and summarization tasks.
Evolver
Evolver is a single-strategy tool. It uses genetic-algorithm technology to solve complex
optimization problems. This tool runs on all Windows platforms and it is based on data stored
in Microsoft Excel tables.
34. GainSmarts
GainSmarts uses predictive-modeling technology that can analyze past purchases and
demographic and lifestyle data to predict the likelihood of response and other characteristics
of customers.
IBM Datajoiner
Datajoiner allows the user to view multivendor-relational and nonrelational, local and remote-
geographically distributed databases as local databases to access and join tables without
knowing the source locations.
IBM Intelligent Miner
Intelligent Miner is an integrated and comprehensive set of data-mining tools. It uses decision
trees, neural networks, and clustering. The latest version includes a wide range of text-mining
tools. Most of its algorithms have been parallelized for scalability. A user can build models
using either a GUI or an API. It works only with DB2 databases.
KATE
KATE is a single, rule-based strategy tool consisting of four components: KATE-editor,
KATE-CBR, KATE-Datamining, and KATE-Runtime. It runs on Windows and Unix
platforms, and it is applicable to several databases.
Kensington 2000
Kensington 2000 is an internet-based knowledge-discovery and-management platform for the
analyses of large and distributed data sets.
Kepler
Kepler is an extensible, multiple-strategy data-mining system. The key element of its
architecture is extensibility through a "plug-in" interface for external tools without
redeveloping the system core. The tool supports datamining tasks such as classification,
clustering, regression, and visualization. It runs on Windows and Unix platforms.
Knowledge Seeker
Knowledge Seeker is a single-strategy desktop or client/server tool relying on a tree-based
methodology for data mining. It provides a nice GUI for model building and letting the user
explore data. It also allows users to export the discovered data model as text, SQL query, or
Prolog program. It runs on Windows and Unix platforms, and accepts data from a variety of
sources.
MATLAB NN Toolbox
A MATLAB extension implements an engineering environment (i.e. a computer-based
environment for engineers to help them solve their common tasks) for research in neural
35. networks and its design, simulation, and application. It offers various network architectures
and different learning strategies. Classification and function approximations are typical data-
mining problems that can be solved using this tool. It runs on Windows, Mac, and Unix
platforms.
Marksman
Marksman is a single-methodology tool based on artificial neural networks. It provides a
number of useful data-manipulation features, which are very important in preprocessing. Its
design is optimized for the database-analysis needs of direct-marketing professionals, and it
runs on PC/Windows platforms.
MARS
MARS is a logistic-regression tool for binary classification. It automatically handles missing
values, detection of interaction between input variables, and transformation of variables.
MineSet
MineSet is comprehensive tool for data mining. Its features include extensive data
manipulation and transformation capabilities, varius data-mining approaches, and powerful
visualization capabilities. MineSet supports client/server architecture and runs on Silicon
Graphics platforms.
NETMAP
NETMAP is a general purpose, information-visualization tool. It is most effective for large,
qualitative, text-based data sets. It runs on Unix workstations.
Neuro Net
NeuroNet is a publicly available software for experimentation with different artificial neural-
network architectures and types.
NeuroSolutions V3.0
NeuroSolutions V3.0 combines a modular, icon-based artificial neuralnetwork design, and it
solves data-mining problems such as classification, prediction, and function approximation. Its
implementations are based on advanced learning techniques such as recurrent backpropagation
and backpropagation through time. The tool runs on all Windows platforms.
OCI
OCI is publicly available software for data mining. It is specially designed as a decision tree
induction system for applications where the samples have continuous feature values.
36. OMEGA
OMEGA is a system for developing, evaluating, and implementing predictive models using the
genetic-programming approach. It is suitable for the classification and visualization of data. It
runs on all Windows platforms.
Partek
Partek is a multiple-strategy data-mining product. It is based on several methodologies
including statistical techniques, neural networks, fuzzy logic, genetic algorithms, and data
visualization. It runs on UNIX platforms.
Pattern Recognition Workbench (PRW)
Pattern Recognition Workbench (PRW) is a comprehensive multiple-strategy tool. It uses
neural networks, statistical pattern recognition, and machinelearning methodologies. It runs on
all Windows platforms using a spreadsheet-style GUI. PRW automatically generates
alternative models and searches for the best solution. It also provides a variety of visualization
tools to monitor model building and interpret results.
Readware Information Processor
Readware Information Processor is an integrated text-mining environment for intranets and the
Internet. It classifies documents by content, providing a literal and conceptual search. The tool
includes a ConceptBase with English, French, and German lexicons.
SAS (Enterprise Miner)
SAS (Enterprise Miner) represents one of the most comprehensive sets of integrated tools for
data mining. It also offers a variety of data manipulation and transformation features. In
addition to statistical methods, the SAS Data Mining Solution employs neural networks,
decision trees, and SAS Webhound that analyzes Web-site traffic. It runs on Windows and
Unix platforms and it provides a user-friendly GUI front-end to the SEMMA (Sample, Explore,
Modify, Model, Assess).
Scenario
Scenario is a single-strategy tool that uses the tree-based approach to data mining. The GUI
relies on wizards to guide a user through different tasks, and it is easy to use. It runs on
Windows platforms.
Sipina-W
Sipina-W is publicly available software that includes different traditional data-mining
techniques such as CART, Elisee, ID3, C4.5, and some new methods for generating decision
trees.
37. SNNS
SNNS is a publicly available software. It is a simulation environment for research on and
application of artificial neural networks. The environment is available on Unix and Windows
platforms.
SPIRIT
SPIRIT is a tool for exploration and modeling using Bayesian techniques. The system allows
communication with the user in the rich language of conditional events. It works on Windows
platforms.
SPSS
SPSS is one of the most comprehensive integrated tools for data mining. It has data-
management and data-summarization capabilities and includes tools for both discovery and
verification. The complete suite includes statistical methods, neural networks, and visualization
techniques. It is available on a variety of commercial platforms.
S-Plus
S-Plus is an interactive, object-oriented programming language for data mining. Its commercial
version supports clustering, classification, summarization, visualization, and regression
techniques. It works on Windows and Unix platforms.
STATlab
STATlab is a single-strategy tool that relies on interactive visualization to help a user perform
exploratory data analysis. It can import data from common relational databases and it runs on
Windows, Mac, and Unix platforms.
STATISTICA-Neural Networks
STATISTICA-Neural Networks is a single-strategy tool includes a standard backpropagation-
learning algorithm and iterative procedures such as Conjugate Gradient Descent and
Levenberg-Marquardt. It runs on all Windows platforms.
Strategist
Strategist is a tool based on Bayesian-network methodology to support different dependency
analyses. It provides the methodology for integration of expert judgments and data-mining
results, which are based on modeling of uncertainties and decision-making processes. It runs
on all Windows platforms.
Syllogic
Syllogic Data Mining Tool is a toolbox that combines many data-mining methodologies and
offers a variety of approaches to uncover hidden information. It includes several data-
39. Appendix B: Data-Mining Applications
Many businesses and scientific communities are currently employing data-mining technology.
Their number continues to grow, as more and more data-mining success stories become known.
Here we present a small collection of real-life examples of data-mining implementations from
the business and scientific world. We also present some pitfalls of data mining to make readers
aware that this process needs to be applied with care and knowledge (both, about the application
domain and about the methodology) to obtain useful results.
In the previous chapters of this book, we have studied the principles and methods of data
mining. Since data mining is a young discipline with wide and diverse applications, there is a
still a serious gap between the general principles of data mining and the domain-specific
knowledge required to apply it effectively. In this appendix, we examine a few application
domains illustrated by the results of data-mining systems that have been implemented.
B1 DATA MINING FOR FINANCIAL DATA ANALYSIS
Most banks and financial institutions offer a wide variety of banking services such as checking,
savings, business and individual customer transactions, investment services, credits, and loans.
Financial data, collected in the banking and financial industry, are often relatively complete,
reliable, and of a high quality, which facilitates systematic data analysis and data mining to
improve a company's competitiveness.
In the banking industry, data mining is used heavily in the areas of modeling and predicting
credit fraud, in evaluating risk, in performing trend analyses, in analyzing profitability, as well
as in helping with direct-marketing campaigns. In the financial markets, neural networks have
been used in forecasting stock prices, options trading, rating bonds, portfolio management,
commodity-price prediction, and mergers and acquisitions analyses; it has also been used in
forecasting financial disasters. Daiwa Securities, NEC Corporation, Carl & Associates, LBS
Capital Management, Walkrich Investment Advisors, and O'Sallivan Brothers Investments are
only a few of the financial companies who use neural-network technology for data mining. A
wide range of successful business applications has been reported, although the retrieval of
technical details is not always easy. The number of investment companies and banks that mine
data is far more extensive that the list mentioned earlier, but you will not often find them willing
to be referenced. Usually, they have policies not to discuss it. Therefore, finding articles about
banking companies who use data mining is not an easy task, unless you look at the US
Government Agency SEC reports of some of the data-mining companies who sell their tools
and services. There, you will find customers such as Bank of America, First USA Bank, Wells
Fargo Bank, and U.S. Bancorp.
The widespread use of data mining in banking has not been unnoticed. The journal Bank
Systems & Technology commented that data mining was the most important application in
financial services in 1996. For example, fraud costs industries billions of dollars, so it is not
surprising to see that systems have been developed to combat fraudulent activities in such areas
as credit card, stock market, and other financial transactions. Fraud is an extremely serious
40. problem for credit-card companies. For example, Visa and Master-Card lost over $700 million
in 1995 from fraud. A neural network-based credit card fraud-detection system implemented
in Capital One has been able to cut the company's losses from fraud by more than fifty percent.
Several successful data-mining systems are explained here to support the importance of data-
mining technology in financial institutions.
US Treasury Department
Worth particular mention is a system developed by the Financial Crimes Enforcement Network
(FINCEN) of the US Treasury Department called "FAIS". FAIS detects potential money-
laundering activities from a large number of big cash transactions. The Bank Secrecy Act of
1971 required the reporting of all cash transactions greater than $10,000, and these transactions,
of about fourteen million a year, are the basis for detecting suspicious financial activities. By
combining user expertise with the system's rule-based reasoner, visualization facilities, and
association-analysis module, FIAS uncovers previously unknown and potentially high-value
leads for possible investigation. The reports generated by the FIAS application have helped
FINCEN uncover more than 400 cases of money-laundering activities, involving more than $1
billion in potentially laundered funds. In addition, FAIS is reported to be able to discover
criminal activities that law enforcement in the field would otherwise miss, e.g., connections in
cases involving nearly 300 individuals, more than eighty front operations, and thousands of
cash transactions.
Mellon Bank, USA
Mellon Bank has used the data on existing credit-card customers to characterize their behavior
and they try to predict what they will do next. Using IBM Intelligent Miner, Mellon developed
a credit card-attrition model to predict which customers will stop using Mellon's credit card in
the next few months. Based on the prediction results, the bank can take marketing actions to
retain these customers' loyalty.
Capital One Financial Group
Financial companies are one of the biggest users of data-mining technology. One such user is
Capital One Financial Corp., one of the nation's largest credit-card issuers. It offers 3000
financial products, including secured, joint, co-branded, and college-student cards, Using data-
mining techniques, the company tries to help market and sell the most appropriate financial
product to 150 million potential prospects residing in its over 2-terabyte Oracle-based
datawarehouse. Even after a customer has signed up, Capital One continues to use data mining
for tracking the ongoing profitability and other characteristics of each of its customers. The use
of data mining and other strategies has helped Capital One expand from $1 billion to $12.8
billion in managed loans over eight years. An additional successful data-mining application at
Capital One is fraud detection.
American Express
Another example of data mining is at American Express, where dataware-housing and data
mining are being used to cut spending. American Express has created a single Microsoft SQL
Server database by merging its worldwide purchasing system, corporate purchasing card, and
corporate card databases. This allows American Express to find exceptions and patterns to
target for cost cutting.
41. MetLife, Inc.
MetLife's Intelligent Text Analyzer has been developed to help automate the underwriting of
260,000 life insurance applications received by the company every year. Automation is
difficult because the applications include many free form text fields. The use of keywords or
simple parsing techniques to understand the text fields has proven to be inadequate, while the
application of full semantic natural-language processing was perceived to be too complex and
unnecessary. As a compromise solution, the "information-extraction" approach was used in
which the input text is skimmed for specific information relevant to the particular application.
The system currently processes 20,000 life-insurance applications a month and it is reported
that 89% of the text fields processed by the system exceed the established confidence-level
threshold.
< Free Open Study >
< Free Open Study >
Index
A
A posterior distribution 96
A priori algorithm 167
Partition-based 170
Sampling-based 171
Sampling-based 171
Incremental 171
Concept hierarchy 171
A prior distribution 96
A priori knowledge 5, 58, 76
Approximating functions 73
Activation function 197
Agglomerative clustering algorithms 125
Aggregation 14
Allela 223
Alpha-cut 252
42. Alternation 232
Analysis of variance 104
Anchored visualization 284
Andrews's curve 282
ANOVA analysis 104
Approximate function 71
Approximate reasoning 261
Approximation by rounding 54
Artificial neural network (ANN) 82, 95
Artificial neural network, architecture 200
feedforward 200
recurrent 201
Artificial neuron 197
Association rules 82, 169
Asymptotic consistency 72
Autoassociation 205
Authorities 179
< Free Open Study >
< Free Open Study >
List of Figures
Chapter 1: Data-Mining Concepts
Figure 1.1: Block diagram for parameter identification
Figure 1.2: The data-mining process
Figure 1.3: Growth of Internet hosts
Figure 1.4: Tabular representation of a data set
Figure 1.5: A real system, besides input (independent) variables X and (dependent) outputs Y, often has
unobserved inputs Z
43. Chapter 2: Preparing the Data
Figure 2.1: High-dimensional data looks conceptually like a porcupine
Figure 2.2: Regions enclose 10% of the samples for 1-, 2-, and 3-dimensional spaces
Figure 2.3: Tabulation of time-dependent features a and b
Figure 2.4: Visualization of two-dimensional data set for outlier detection
Chapter 3: Data Reduction
Figure 3.1: A tabular representation of similarity measures S
Figure 3.2: The first principal component is an axis in the direction of maximum variance
Figure 3.3: Discretization of the age feature
Chapter 4: Learning from Data
Figure 4.1: Types of inference: induction, deduction, and transduction
Figure 4.2: A learning machine uses observations of the system to form an approximation of its output
Figure 4.3: Three hypotheses for a given data set
Figure 4.4: Asymptotic consistency of the ERM
Figure 4.5: Behavior of the growth function G(n)
Figure 4.6: Structure of a set of approximating functions
Figure 4.7: Empirical and true risk as a function of h (model complexity)
Figure 4.8: Two main types of inductive learning
Figure 4.9: Data samples = Points in n-dimensional space
Figure 4.10: Graphical interpretation of classification
Figure 4.11: Graphical interpretation of regression
Figure 4.12: Graphical interpretation of clustering
Figure 4.13: Graphical interpretation of summarization
Figure 4.14: Graphical interpretation of dependency-modeling task
Figure 4.15: Graphical interpretation of change and detection of deviation
Figure 4.16: The ROC curve shows the trade-off between sensitivity and' 1-specificity values
Chapter 5: Statistical Methods
Figure 5.1: A boxplot representation of the data set T based on mean value, variance, and min and max
values.
Figure 5.2: Linear regression for the data set given in Figure 4.3
Figure 5.3: Geometric interpretation of the discriminant score
Figure 5.4: Classification process in multiple discriminant analysis
Chapter 6: Cluster Analysis
Figure 6.1: Cluster analysis of points in a 2D-space
44. Figure 6.2: Different schemata for cluster representation
Figure 6.3: A and B are more similar than B and C using the MND measure
Figure 6.4: After changes in the context, B and C are more similar than A and B using the MND measure
Figure 6.5: Distances for a single-link and a complete-link clustering algorithm
Figure 6.6: Five two-dimensional samples for clustering
Figure 6.7: Dendrogram by single-link method for the data set in Figure 6.6
Figure 6.8: Dendrogram by complete-link method for the data set in Figure 6.6
Chapter 7: Decision Trees and Decision Rules
Figure 7.1: Classification of samples in a 2D space
Figure 7.2: A simple decision tree with the tests on attributes X and Y
Figure 7.3: Classification of a new sample based on the decision-tree model
Figure 7.4: Initial decision tree and subset cases for a database in Table 7.1
Figure 7.5: A final decision tree for database T given in Table 7.1
Figure 7.6: A decision tree in the form of pseudocode for the database T given in Table 7.1
Figure 7.7: Results of test x1 are subsets Ti (initial set T is with missing value).
Figure 7.8: Decision tree for the database T with missing values
Figure 7.9: Pruning a subtree by replacing it with one leaf node
Figure 7.10: Transformation of a decision tree into decision rules
Figure 7.11: Grouping attribute values can reduce decision-rules set
Figure 7.12: Approximation of nonorthogonal classification with hyperrectangles
Figure 7.13: FP-tree for the database in Table 7.4
Chapter 8: Association Rules
Figure 8.1: First iteration of the Apriori algorithm for a database DB
Figure 8.2: Second iteration of the Apriori algorithm for a database DB
Figure 8.3: Third iteration of the Apriori algorithm for a database DB
Figure 8.4: An example of concept hierarchy for mining multiple-level frequent itemsets
Figure 8.5: FP-tree for the database T in Table 8.2
Figure 8.6: A processing tree using the BUC algorithm for the database in Table 8.3
Figure 8.7: Initialization of the HITS algorithm
Figure 8.8: An example of traversal patterns
Figure 8.9: A text-mining framework
Chapter 9: Artificial Neural Networks
Figure 9.1: Model of an artificial neuron
45. Figure 9.2: Examples of artificial neurons and their interconnections
Figure 9.3: Typical architectures of artificial neural networks
Figure 9.4: XOR problem
Figure 9.5: XOR problem solution: The two-layer ANN with the hardlimit-activation function
Figure 9.6: Error-correction learning performed through weights adjustments
Figure 9.7: Initialization of the error correction-learning process for a single neuron
Figure 9.8: Block diagram of ANN-based feedback-control system
Figure 9.9: Block diagram of an ANN-based prediction
Figure 9.10: A graph of a multilayered perceptron architecture with two hidden layers
Figure 9.11: Generalization as a curve fitting problem
Figure 9.12: A graph of a simple competitive network architecture
Figure 9.13: Geometric interpretation of competitive learning
Figure 9.14: Example of a competitive neural network
Chapter 10: Genetic Algorithms
Figure 10.1: Crossover operators
Figure 10.2: Mutation operator
Figure 10.3: Major phases of a genetic algorithm
Figure 10.4: Roulette wheel for selection of the next population
Figure 10.5: f(x) = x2
: Search spaces for different schemata
Figure 10.6: Graphical representation of the traveling salesman problem with a corresponding optimal
solution
Chapter 11: Fuzzy Sets and Fuzzy Logic
Figure 11.1: Discrete and continuous representation of membership functions for given fuzzy sets
Figure 11.2: Most commonly used shapes for membership functions
Figure 11.3: Examples of parametrized membership functions
Figure 11.4: Core, support; and α-cut for fuzzy set A
Figure 11.5: Basic operations on fuzzy sets
Figure 11.6: Comparison of fuzzy sets representing linguistic terms, A = high speed and B = speed
around 80 km/h
Figure 11.7: Simple unary fuzzy operations
Figure 11.8: Typical membership functions for linguistic values "young", "middle aged", and "old"
Figure 11.9: The concept of A ⊆ B where A and B are fuzzy sets
Figure 11.10: The idea of the extension principle
46. Figure 11.11: Comparison of approximate reasoning result B′ with initially given fuzzy sets A′, A, and
B and the fuzzy rule A- > B
Figure 11.12: Fuzzy-communication channel with fuzzy encoding and decoding
Figure 11.13: Multifactorial evaluation model
Figure 11.14: A global granulation for a two-dimensional space using three membership functions for
x1 and two for x2
Figure 11.15: Granulation of a two-dimensional I/O space
Figure 11.16: Selection of characteristic points in a granulated space
Figure 11.17: Graphical representation of generated fuzzy rules and the resulting crisp approximation
Figure 11.18: The first step in automatically determining fuzzy granulation
Figure 11.19: The second step (first iteration) in automatically determining granulation
Figure 11.20: The second step (second iteration) in automatically determining granulation
Figure 11.21: The second step (third iteration) in automatically determining granulation
Chapter 12: Visualization Methods
Figure 12.1: A 4-dimensional Survey Plot
Figure 12.2: A star display for data on seven quality-of-life measures for three states
Figure 12.3: Data Constellations as a novel visual metaphor
Figure 12.4: Graphical representation of 6-dimesional samples from database given in Table 12.1 using
a parallel coordinate visualization technique
Figure 12.5: Parabox visualization of a database given in Table 12.2
Figure 12.6: Radial visualization for an 8-dimensional space
Figure 12.7: Sum of the spring forces for the given point P is equal to 0
Figure 12.8: Trained Kohonen SOM (119 samples) with nine outputs, represented as 3 × 3 grid (values
in nodes are the number of input samples triggered the given output)
Figure 12.9: An example of the need for general projections, which are not parallel to axes, to improve
clustering process
< Free Open Study >
< Free Open Study >
List of Tables
47. Chapter 2: Preparing the Data
Table 2.1: Transformation of Time Series to standard tabular form (window = 5)
Table 2.2: Time-series samples in standard tabular form (window = 5) with postponed predictions (j =
3)
Table 2.3: Table distances for data set S
Table 2.4: Number of points p with the distance greater then d for each given point in S
Chapter 3: Data Reduction
Table 3.1: Dataset with three features
Table 3.2: The correlation matrix for Iris data
Table 3.3: The eigenvalues for Iris data
Table 3.4: A contingency table for 2 × 2 categorical data
Table 3.5: Data on the sorted continuous feature F with corresponding classes K
Table 3.6: Contingency table for intervals [7.5, 8.5] and [8.5, 10]
Table 3.7: Contingency table for intervals [0, 7.5] and [7.5, 10]
Table 3.8: Contingency table for intervals [0, 10] and [10, 42]
Chapter 4: Learning from Data
Table 4.1: Confusion matrix for three classes
Chapter 5: Statistical Methods
Table 5.1: Training data set for a classification using Naïve Bayesian Classifier
Table 5.2: A database for the application of regression methods
Table 5.3: Some useful transformations to linearize regression
Table 5.4: ANOVA analysis for a data set with three inputs x1, x2, and x3
Table 5.5: A 2 × 2 contingency table for 1100 samples surveying attitudes about abortion.
Table 5.6: 2 × 2 contigency table of expected values for the data given in Table 5.5
Table 5.7: Contingency tables for categorical attributes with three values
Chapter 6: Cluster Analysis
Table 6.1: Sample set of clusters consisting of similar objects
Table 6.2: The 2 × 2 contingency table
Chapter 7: Decision Trees and Decision Rules
Table 7.1: A simple flat database of examples for training
Table 7.2: A simple flat database of examples with one missing value
Table 7.3: Contingency table for the rule R
Table 7.4: Training database T for the CMAR algorithm
48. Chapter 8: Association Rules
Table 8.1: A model of a simple transaction database
Table 8.2: The transactional database T
Table 8.3: Multidimensional-transactional database DB
Table 8.4: Transactions described by a set of URLs
Table 8.5: Representing URLs as vectors of transaction group activity
Table 8.6: A typical SOM generated by a description of URLs
Chapter 9: Artificial Neural Networks
Table 9.1: A neuron's common activation functions
Table 9.2: Adjustment of weight factors with training examples in Figure 9.7b
Chapter 10: Genetic Algorithms
Table 10.1: Basic concepts in genetic algorithms
Table 10.2: Evaluation of the initial population
Table 10.3: Evaluation of the second generation of chromosomes
Table 10.4: Attributes Ai with possible values for a given data set s
Chapter 12: Visualization Methods
Table 12.1: Database with 6 numeric attributes
Table 12.2: The database for visualization