5
Most read
6
Most read
11
Most read
Image Mining: A New Approach for Data Mining
Carlos Ordonez and Edward Omiecinski
College of Computing
Georgia Institute of Technology
Atlanta, Georgia 30332-0280
February 20, 1998
Abstract
We introduce a new focus for data mining, which is concerned with knowledge discovery in image
databases. We expect all aspects of data mining to be relevant to image mining but in this first work
we concentrate on the problem of finding associations. To that end, we present a data mining algorithm
to find association rules in 2-dimensional color images. The algorithm has four major steps: feature
extraction, object identification, auxiliary image creation and object mining. Our algorithm is general
in that it does not rely on any type of domain knowledge. A synthetic image set containing geometric
shapes was generated to test our initial algorithm implementation. Our experimental results show that
image mining is feasible. We also suggest several directions for future work in this area.
1 Introduction
Discovering knowledge from data stored in typical alphanumeric databases, such as relational databases,
has been the focal point of most of the work in database mining. However, with advances in secondary
and tertiary storage capacity, coupled with a relatively low storage cost, more and more non standard data
(e.g., in the form of images) is being accumulated. This vast collection of image data can also be mined to
discover new and valuable knowledge. The problem of image mining combines the areas of content-based
image retrieval, image understanding, data mining and databases. To our knowledge, no other work has
been done with regard to mining knowledge from a collection of images from a database perspective. An
initial step towards tapping into the undiscovered wealth of knowledge from mining image-bases is the
focus of this paper. This work can also be seen as a starting point for an as yet, unexplored area that that
can provide enormous benefits.
Image mining has two main themes. The first is mining large collections of images and the second is
the combined data mining of large collections of image and associated alphanumeric data. As of now we
have concentrated on mining only images; but our algorithm can be extended in a straightforward manner
to handle images and associated alphanumeric data. An example of the first case might involve a collection
of weather satellite imagery of various cities in the United States that has been recorded over an extended
period of time. The data mining objective might be to find if there is some pattern that exists for an individual
city (over time) or if there is some pattern that exists between different cities. An example of the second case
might involve medical imagery and patient (alphanumeric data) records. To develop an accurate diagnosis
or prognosis both image data (such as Xrays, SPECT, etc.) and patient data (such as weight, prior health
conditions, family history, etc.) can be examined together to find interesting associations.
1
Our data mining system is built on top of a content-based image retrieval system (CBIR), the “Blobworld”
system from the University of California at Berkeley (UCB). This CBIR system supports object-based queries
and thus eliminates the need for the manual indexing of image content. This is a major advantage since
manually indexing massive collections of images is impractical.
2 Applications & Motivation
Applications such as military reconnaissance, weather forcasting, the management of earth’s resources,
criminal investigation and medical imaging all require (or will in the near future) the storage and processing
of massive collections of images. For instance, NASA’s EOS system generates 1 terabyte of image data per
day. To take full advantage of these image-bases and the knowledge they contain, data mining techniques
will have to be applied. This opens up a new area with new problems (i.e., the area of image mining).
One of the typical data mining problems is to find association rules among data items in a database. In
a retail environment such as a grocery store, an association rule might state that customers who purchase
spaghetti and Italian sausage also purchase red wine. This can be determined by examining all the customer
transactions (i.e., purchases). In this case, the data is explicit, there is a specific data item for each of the three
grocery items and an individual customer transaction would include a subset of those items and in general
a subset of all the items sold by the store. In the case of image-bases, assumming that all the images have
been manually indexed (or their contents classified) may not be feasible. This presents one major deviation
(problem) from the typical data mining approach for numerical data. If images can efficiently be labelled by
a semantic descriptor, then the mining can be done on those high level concepts. However, with hundred’s
of thousands of images, this will become impossible. An alternative is to rely on automatic/semi-automatic
analysis of the image content and to do the mining on the generated descriptors. For example, color, texture,
shape and size can be determined automatically. Objects in an image can be determined by the similarity of
those attributes.
Consider the sample images shown in Figure 1. The color photographs, although shown in black and
white, are pictures of different rivers near populated areas. They contain buildings, roads, bridges and
water. Due to the different color and texture of these objects it is possible for an indexing algorithm to
determine distinct objects, although without the semantic label. With a CBIR system it would be possible
to retrieve images that contain similar objects. Hence an image mining system could determine whether
certain objects appear together in enough of the images to make that fact interesting. The association rule
and the supporting images (or a subset thereof) could be returned along with the information as to what
constitutes the derived objects from the photographs. So for the samples in Figure 1, a computed association
rule might be the following:  ¢¡¤£¦¥¨§©
1 determines  ¢¡¤£¥§©
2. The river might be represented by  ¢¡¤£¥§©
1 in the
different photographs and the populated area (buildings and roads) by  ¢¡¤£¥§©
2.
3 Content-based Image Retrieval
Content-based image retrieval (CBIR) systems will be needed to effectively and efficiently use large image
databases. With a CBIR system, users will be able to retrieve relevant images based on their contents. CBIR
researchers have typically followed two distinct directions [10] based on
 modelling the contents of the image as a set of attributes which is produced manually and stored, for
example in a relational database
2
Figure 1: City-River Images from the California Department of Water Resources
 using an integrated feature-extraction/object-recognition system.
More recent research has recognized the need to develop a solution that captures the essence of both
directions. However, there are still differences between the various current approaches. Mainly the
differences can be categorized in terms of image features extracted, their level of abstraction and the degree
of domain independence. Certainly tradeoffs must be made in building a CBIR system. For example, having
automatic feature extraction is achieved at the expense of domain independence. Having a high degree of
domain indepence is achieved by having a semiautomatic (or manual) feature extraction component.
With CBIR systems [10], querying is facilitated through generic query classes. Examples of some query
classes include color, texture, shape, attributes, text and domain concepts. Color and texture queries allow
users to formulate the description of the images to be retrieved in terms of like color and texture. Queries
can also be posed with regard to the text associated with the images. For instance, in a medical setting,
image retrieval is not only based on image content but also on the physician’s diagnosis, treatment, etc. (i.e.,
additional textual data). We should also point out that CBIR differs from traditional database systems in that
images are retrieved based on a degree of similarity and that records are usually retrieved from databases
because of exactly matching specified attribute values.
Various content-based retrieval systems (QBIC, Chabot and Photobook) have focused on material-
oriented queries and have used low-level image properties (e.g., color and texture) to implement these
queries. On the other hand, a content-based retrieval system developed at the University of California at
Berkeley [3, 5] focuses on object-oriented queries. That is, queries that search for images that contain
particular objects. The approach to object recognition at Berkeley is structured around a sequence of
increasingly specialized grouping activities that produces a “blobworld” representation of an image, which
is a transformation from the raw pixel data to a small set of localized coherent regions in color and textual
space. The “blobworld” representation is based on image segmentation using the Expectation-Maximization
3
algorithm on combined color and texture features.
The salient feature of the Berkeley work is their approach to object recognition. Their approach [3] is
based on the construction of a sequence of successively abstract descriptors through a hierarchy of grouping
and learning processes. The image descriptors at a low level are color, texture, contrast, polarity, etc. and
the grouping is based on spatiotemporal coherence of the local descriptors. The central notion in grouping
is coherence and four major issues have been identified in [8] which are segmenting images into coherent
regions based on integrated region and contour descriptors; fusing color, texture and shape information to
describe primitives; using learning techniques for developing the relationship between object classes and
color, texture and shape descriptors; and classifying objects based on primitive descriptors and relationships
between primitives.
4 Data Mining
Database mining, an important part of knowledge discovery, is defined as the automated discovery of
previously unknown, nontrivial, and potentially useful information from databases. The information is a
statement that describes the relationship among a set of objects contained in the database with a certain
confidence such that the statement is in some sense simpler than enumerating all the relationships between
the individual instances of objects [9]. For example, in a database of employees and their salaries, each
instance represents the relationship between an individual employee and his salary. A statement such as
“salaries of engineers is higher than the salaries of secretaries,” based on the instances of the database,
conveys information that is implicit and interesting than listing the salaries of all engineers and secretaries.
Database mining is the process of generating high-level patterns that have acceptable certainty and are also
interesting from a database of facts.
Knowledge Discovery has been applied effectively to solve many diverse problems. Examples include
discovering rules of mass spectrometry from spectrogram data [4], assessment of credit card applicants [6],
discovering rules for query optimization [17, 18], diagnosing diseases from medical data [14], inferring risk
profiles of insurance policy holders [19], and so on.
Knowledge discovery derives much of its success from reasoning techniques in artificial intelligence,
expert systems, machine learning and statistics. Many paradigms such as inductive learning [15], Bayesian
statistics [11], mathematical taxonomy [7], etc, have also been applied to knowledge discovery. In general,
knowledge discovery is an amalgamation of concepts from diverse fields.
Efficiency is, in general, important for any computational problem. However, for database mining it also
determines whether a particular technique can be applied or not. For example, the number of possible ways
to cluster  objects into  clusters in unsupervised learning is exponential in  [12]. Hence, an algorithm
which uses exhaustive search for clustering is impractical for real-world databases. In general any algorithm
which grows faster than   2! is unlikely to be useful for large databases. Over the years, database systems,
mainly relational, have made great strides in improving efficiency. The success of relational database
systems in the business community can be attributed to these improvements. Many techniques such as,
efficient access methods, buffer management, disk management, etc, are well understood. However, most
of these techniques have been developed for on-line transaction processing (OLTP) applications. The access
patterns for OLTP applications, which typically access a few hundred records, are considerably different
from database mining applications, where entire tables may need to be scanned. One of the challenges in
database mining is developing more efficient algorithms, better access structures, optimizing disk I/O, and
so on.
4
5 Image Mining
There are two major issues that will affect the image data mining process. One is the notion of similarity
matching and the other is the generality of the application area, that is, the breadth of usefulness of data
mining from a practical point of view. For a specific application area, associated domain knowledge can be
used to improve the data mining task. Since data mining relies on the underlying querying capability of the
CBIR system, which is based on similarity matching, user interaction will be necessary to refine the data
mining process.
With image mining we will consider the four broad problem areas associated with data mining: finding
associations, classification, sequential patterns and time series patterns. With all of these, the essential
component in image mining is identifying similar objects in different images. With typical basket-market
analysis, the data is usually constrained to a fixed set of items that are explicitly labelled. It is also quite
efficient to see if a transaction contains a particular item, i.e., requires an examination of the item labels
associated with a transaction. In some cases the data might be pre-processed into a fixed record format
where a field exists for each item in the domain and a Boolean value is associated with it, indicating the
presence or absence of that item in the transaction. This preprocessing can be done automatically. In a
general image mining setting, having a human label every possible object in a vast collection of images is a
daunting task. However, we intend to capitalize on the recent work in CBIR, in particular, Blobworld [8].
As discussed previously, we intend to build our data mining system on top of a content-based image
retrieval system. One premise behind supporting object-based queries in a CBIR system is to eliminate the
need for manual indexing of image content. The CBIR system we will use is from Berkeley [8]. We will
refer to it as the “Blobworld” system. This system produces a “blobworld” representation of each image.
A “blob” is just a 2-D ellipse which possesses a number of attributes. An image is made up of a collection
of blobs, usually less than ten. Each blob represents a region of the image which is relatively homogeneous
with respect to color and texture. A blob is described by its color, texture and spatial descriptors. The
descriptors are represented by multidimensional vectors.
At this point, we will consider in detail, the problem of finding associations. The problem of generating
association rules was first introduced in [1] and an algorithm called AIS was proposed for mining all
association rules. In [13], an algorithm called SETM was proposed to solve this problem using relational
operations. In [2], two algorithms called Apriori and AprioriTid were proposed. These algorithms achieved
significant improvements over the previous algorithms. The rule generation process was also extended
to include multiple items in the consequent and an efficient algorithm for generating the rules was also
presented. In [16], we presented an efficient algorithm for mining association rules that was fundamentally
different from prior algorithms. Compared to previous algorithms, our algorithm not only reduced the I/O
overhead significantly but also had lower CPU overhead for most cases.
These relationships will take the form of rules. An association rule is an implication of the form
$#%('
, where
0)1'3254
, and
367'8#@9
. A is the set of objects, also referred to as items.

is called
the antecedent and
'
is called the consequent of the rule. In general, a set of items (such as the antecedent
or the consequent of a rule) is called an itemset.
Each itemset has an associated measure of statistical significance called support. For an itemset
$2B4
,
support(

) = C , if the fraction of records in the database containing

equals C . A rule has a measure of its
strength called confidence defined as the ratio support(
8D7'
) / support(

).
The problem of mining association rules is to generate all rules that have support and confidence greater
than some user specified minimum support and minimum confidence thresholds, respectively. This problem
5
can be decomposed into the following subproblems:
1. All itemsets that have support above the user specified minimum support are generated. These itemset
are called the large itemsets. All others are said to be small.
2. For each large itemset, all the rules that have minimum confidence are generated as follows: for a
large itemset

and any
'E2F
, if support(

)/support(
3G5'
) H minimum confidence, then the
rule
IGP'@#Q%R'
is a valid rule.
6 Image Mining Algorithm steps
In this section, we present the algorithms needed to perform the mining of associations within the context
of images. The four major image mining steps are as follows:
1. Feature extraction. Segment images into regions identifiable by region descriptors (blobs). Ideally
one blob represents one object. This step is also called segmentation.
2. Object identification and record creation. Compare objects in one image to objects in every other
image. Label each object with an id. We call this step the preprocessing algorithm.
3. Create auxiliary images. Generate images with identified objects to interpret the association rules
obtained from the following step (html page creation).
4. Apply data mining algorithm to produce object association rules.
Here we explain the Image Mining processing in more detail. We keep I/O at a minimum. Images are
kept on disk. For feature extraction each image is accessed once. These features are stored in two files,
one is an image with all the blobs and the other with the blob descriptors. These blob descriptors are used
to build an array with all the features from all the images. Once features are extracted from images we
perform object identification using only their blob descriptors; this process is performed entirely in memory.
Auxiliary images are kept on disk; these images show each identified object.
Images are not indexed because it is not necessary to search their contents once thay are segmented.
Arrays of records are all that are needed to mine images once we have their features. Processing each image
is performed independently of each other for feature extraction and this is done sequentially.
At present the association rule algorithm is done entirely in memory; this is not a limitation because
the number of records we feed to this algorithm is not large right now and also it makes processing time
smaller. We store associations (which are simply sets) as bit vectors where each bit indicates the presence
or absence of an object. Identified objects, object associations and association rules are stored in sequential
text files for interpreting results but not for processing.
Segmentation Step
It is not our intention to describe in detail the feature extraction process from the blobworld system. We
will rather outline the main steps involved in identifying coherent image regions.
6
1. Estimate scale color selection S .
2. Produce 6-dimensional feature vectors. These vectors contain summary information about color and
texture only.
3. Produce several clusterings of feature vectors using the Expectation Maximization (EM) method. The
2 dominant colors are determined here. The number of groups in each clustering is called T .
4. Use the Minimum Description Length principle to decide which is the best T .
5. Segment the image into T regions using the spatial grouping of pixels. Each region is connected.
6. Apply a 3x3 max-vote filter to determine dominant colors.
7. Generate blobs with summary information about each region when such region has an area greater
than 2% of the image area.
Each blob has the most important information about each region. This information includes color,
texture, shape, area and position, but only the the first three are considered relevant.
Preprocessing Algorithm
The basic algorithm for finding associations between images/blobs is similar to our association finding
algorithm Partition [16], as long as we preprocess the image data. By preprocessing the image data, we
will identify and label objects contained in the images using the image query processing algorithm [5]. The
output of the preprocessing step will be a set of records, UWV , one for each image, containing the object
identifiers for the objects contained in the image. This step is quite intensive since it is a similarity search
between images, actually image descriptors. However, once this is accomplished, the actual data mining
step will not require the expensive similarity searching. Our preprocessing algorithm is shown in Figure 2.
First of all we initialize the  records, which store the object id’s for each of the  images. This
algorithm has four nested loops. These loops are required to compare every blob against the rest in the
image database. In this manner we do not miss any potential match. Note that comparisons are always
made between blobs in different images. Assuming the blob descriptor vector dimension is bounded, then
this algorithm is X  2! . This is a reasonable asumption since the segmentation step cannot produce a high
number of blobs. Nevertheless, if the number of objects in each image can also grow without bound this
algorithm is X  2
 2! for  the number of possible different objects and  the number of images. This can
render the algorithm slow if  and  are similar; in general this algorithm behaves well if `YaYb .
The variable cdfe¢C © © dg ¥ is used to generate new object id’s when one blob is matched for the first time.
This is necessary because a single object in one image may be similar to many other objects in the following
images and all these objects should have the same id.
When one blob turns out to be similar to another one we add the object id to their corresponding records.
The first is the record that is being compared against the rest and the second one is the record for which
a match was found. The second object in the comparison will be discarded avoiding a future unnecessary
comparison. The similarity function to compare blobs is expensive to compute as we will see. So this has
an impact on the overall performance of the algorithm.
Each segmented image is treated as a record and its transformation to a set of identified objects will also
produce a record. This representation will also give us a direct way to incorporate alphanumeric information
7
INPUT: h segmented images, iqp 1 r p 2 rtsssfr p¤uv ,
where pxw is a record containing: an image id and a blob descriptor vector y1€
OUTPUT: Set of h records, i‚ 1 r  2rstss¤r ƒu„v containing the object identifiers for the blobs
…‡†
‰ˆ 1  1 ‘“’ƒha”
†
 w 1  •–‡—
”
…˜†
0’‚yg™edfg‘ ˆg€  0
…‡†
‰ˆ 1  1 h
†
hWi 1 ”
†
…˜†
7™ 1  1 h
†Bj
ˆlkd¨mgp w s yn€poq
ˆlr
j
‘ ‘“ˆlstd  ‘“r‚u„d…‡†
0ˆ 2  ˆ 1 v 1 h
†
h
p
…
p w 2 s y1€w 2 ˆ
j
h’n‘xstye‘“f1z„dt€|{¨d¤‘xh~}
–‡—
p
…Bj
ˆlstˆ€gy¢r¢mgp w 1 s yn€tw 1 r p w 2 s y1€w 2 r
j
ˆlstˆ€gy¢r‚ˆ‘“{ ‘“z„r‚d
j
z‚’l€ r
j
‘“yph€¢y¢r‚€ €¢dƒˆlye‘“ˆg’‚ho„h~}
–‡—
p
… q
ˆgr
j
‘ ‘“ˆlstd7h~}
–‡—
’‚yg™dtfg‘ ˆl€  ’‚yg™dfg‘ ˆg€ v 1q
ˆlr
j
‘ ‘“ˆlstd 
q
y¨
j
d–˜—
”ap
…
ƒw 1  ƒw 1 … i†’‚yg™dtf“‘ ˆl€‡v
ƒw 2  ƒw 2 … i†’‚yg™dtf“‘ ˆl€‡v‡
y¢r‚ˆ‡p w 2 s y1€w 2 y
j
s‰ye‘“f1z‚d€–‡—
”ap
…
–‡—
”ap
…
–‡—
”
…‡†
‡
y¢r‚ˆ‡p w 1 s y1€w 1 y
j
s‰y‚‘“fnz‚d€‡ˆ
q
‘“z‚dr‚d‹Š~y
j
’‚hdŒstye‘“f1zaye‘gdy
j
‘–˜—
”
…‡†
–‡—
”
…˜†
…
ˆ€‘“dr~’‚uŽ‘Qu‚hŠƒyph‘“d€~s‰ye‘“f1z‚d€‡’‚yg™dfg‘
j
Figure 2: Preprocessing Algorithm
8
associated with the image into the image mining process. The algorithm can handle such information
without modification, provided those additional attributes are treated as boolean values.
Similarity Function
The similarityfunction [5] between two blobs is essential for our image mining program. This function takes
four parameters, the two blobs to be compared, a similarity threshold and a vector of standard deviations.
The similarity function is mathematically defined as:
C¨dfdg ‘‚e¢d ©¤’ # ¥¢“~”g•–g—˜¤™tšl›gœžgŸ¡   1 ¢ lŸ¡   2£
2
)
where
¤
dfC © ‘„ §‚¥  ¡   ¢¡
1
) ¡   ¢¡
2
! #@¥
 ¡   ¢¡
1
G ¡   ¢¡
2
!x¦ Σ“ 1
 ¡   ¢¡
1
G ¡   ¢¡
2
!“§ 1¨ 2©
In these formulas ¡   ¢¡
1 and ¡   ¢¡
2 are vectors containing summary features and Σ“ 1
represents the vector
containing the standard deviations allowed for matching on each desired feature. The
G
1 power means we
divide each distance by the corresponding entry of this vector; this is clarified in Figure 3. This similarity
measure is 1 if there is a perfect match on all desired features and approaches zero as the match becomes
worse. A low similarity can mean every object is similar to any other object.
It is important to note that the distance for color is computed in a special manner. Colors are stored as
three coordinates for a point located in a color-cone, referred as the hue saturation value (hsv). The distance
is computed as the minimumpairwise distance of the two dominant colors of the object. Each color is a point
in 3-dimensional space affecting its third coordinate by a weight of 0.5 and leaving the first two unchanged.
This is computed as a matrix product between the color vector and the weights. This distance constitutes
the first entry of the difference vector. For all the remaining 10 features the distance is just computed as the
difference between each blob entry.
The standard deviation vector permits adjusting parameters for the image mining process to use or
discard features in an easy way. If we want to pay close attention to one specific feature we set the standard
deviation to a value close to zero, but never zero. If we do not consider some feature to be relevant to
the mining process we set the standard deviation to a high value. For some of the features the blobworld
system requires standard deviations to be in specific ranges. More specifically standard deviations for color,
anisotropy, and contrast require standard deviations at a maximum value of roughly 0.5. For area at most
0.1 is permitted. For all the remaining features any positive standard deviation is legal. If some feature is
considered as completely irrelevant an infinity value is assigned to the corresponding entry.
A more detailed description of the similarity function between two blobs is given in Figure 3.
Auxiliary Image Creation
Here we talk about the auxiliary image creation step. It is important to mention that this step is necessary in
order for the user to make sense out of the image mining results. We use a web browser as the tool to have
an integrated view of images, image features (blobs), object ids, associations, and association rules.
The association rules are of the form:
9
p
—«ª­¬
h : yn€ 1 r y1€ 2 r
j
ˆlstˆ€gy¢r‚ˆ‘“{ ‘“z‚r‚d
j
z„’g€ r
j
‘“y¢h€¢y¢r‚€ €¢dtƒ¨ˆgye‘“ˆl’‚h†­¬
h
ª®¬
h : 1
q
’‚rƒy¯s‰ye‘“f1z r 0 ’n‘“z„dtr‚Šƒˆ
j
d
€ 11  mly¤g’‚y 1 s f1’‚hd 1 i°y¤g’‚y 2 s f1’‚hd 1o²±~³ 1 1 0s 5´
€ 22  mly¤g’‚y 1 s f1’‚hd 2 i°y¤g’‚y 2 s f1’‚hd 2o²±~³ 1 1 0s 5´
€ 1 ¶µ € 11µ²v ypy
j
µ € 22µ
€ 12  mgyxg’‚y 1 s f1’‚hd 1 iXyxg’‚y 2 s f1’‚hd 2o±~³ 1 1 0s 5´
€ 21  mgyxg’‚y 1 s f1’‚hd 2 iXyxg’‚y 2 s f1’‚hd 1o±~³ 1 1 0s 5´
€ 2  y¢y
j
µ € 12µv ypy
j
µ € 21µ
€¢ˆ
j
‘ 1  stˆlhmg€ 1r € 2o
€¢ˆ
j
‘ 2:11  y¤g’‚y 1 s
q
dtye‘“u„r‚d
j
i8y¤g’‚y 2 s
q
dtye‘“u„r‚d
j
j
f1’‚r‚d  d·¸²mi 0s 5 ±
j¹
rn‘tm Σ11
w»º 1 mgmg€¢ˆ
j
‘ w ¼
j
‘“yph€¢y¢r‚€ €¢dƒˆlye‘“ˆg’‚h w o 2
ogoo
r‚dx‘“u‚r‚h
j
f1’‚r‚dB½ 
j
ˆgs‰ˆ¾lypr‚ˆ»‘“{ ‘“z‚r‚d
j
z„’g€
Figure 3: Similarity function
¿
d
¤
1
)
d
¤
2À
% ¿
d
¤
3À
)
CÁÂà  e © #¶
%
) §‚  cd
¤ ¥  §‚¥ #¶'
%
For each image we show the original image with all the geometric shapes and then one blob image per
matched blob. Ideally, each blob should correspond to one shape but this did not always happen as we will
discuss. Each of of the blobs is labeled with the object id generated by the preprocessing algorithm. These
are the id’s that appear in the rule. Right now this step is somewhat slow because it involves generating one
image per matched blob, but this process is done only once if the image mining program is run several times
over the same image set. And also, this step is alleviated by the fact that unmatched blobs are not displayed
and thus no image for them is generated.
After showing all the images there are two links to view the association text file and the association rule
text file generated by the previous step. These files contain all the associations and rules as well as statistical
information that help the user interpret and verify the correctness of the experimental results.
Example
At this point we show a simple example illustrating how our image mining algorithms work with 
#
10.
The original images and their corresponding blobs are shown on Figure 4 and Figure 5. Association rules
corresponding to the identified objects are shown on Figure 6. We chose 10 representative images from the
image set we created for our experiments. This set of synthetic images is explained in detail in the next
section.
In Figures 4 and 5 we show the original image at the left with several geoemetric shapes and white
background. These images are labeled with an image id. These images are the only input data for our
program; no domain knowledge is used. Then we show a series of blob images, each containing one
blob. These images are labeled with the id obtained in the preprocessing algorithm. Each blob has a close
(most times equal) position to its corresponding geometric shape. There are some cases in which one blob
corresponds to several geometric shapes. For instance in image 013 object 2 corresponds to the triangle and
object 3 corresponds to the circle. In image 131 object 4 corresponds to all the shapes in the image.
As parameters for object identification we set color standard deviation to 0.29 and contrast standard
10
deviation to 0.5. We tuned these parameters after several experiments. The remaining parameters did not
improve object identification for this set of synthetic images so their values were set to infinity.
The data mining algorithm was run with a 20% support and 70% confidence. The output is a set of
association rules whose support and confidence are above these thresholds.
The 66 rules obtained by the program are shown on Figure 6. Let us analyze some of these rules. The
first rule
¿
3À
% ¿
2À means that if there is circle in the image then there is also a triangle. In fact with these
simple images, there was never a circle without a triangle. In this case the rule actually has a higher support
and a higher confidence than that obtained by the program (50% and 83 % respectively). This happened
because the circle had two different object identifiers: 3 and 7. The rule
¿
2
)
3
)
5À
% ¿
8À says if there is a
circle, a triangle and an hexagon then there is also a square. Once again images containing the first three
shapes always contained the square. Another interesting rule is
¿
3
)
11À
% ¿
2À . In this case the rule says
that if there are a circle and an ellipse then there is also a triangle; once again the rule is valid; note that this
rule has a low support.
It important to note that several incorrect or useless blob matches such as 9, 10, 13, 14, 16 are filtered
out by the 30% support. That is the case for images 029, 108, 119, 131 and 144. There are no rules that
involve these identified objects (matched blobs).
11
IMAGE MINING RESULTS
Number of mined images: 10
Image: 013 object: 2 object: 3 object: 4
Image: 018 object: 2 object: 5 object: 6
Image: 025 object: 2 object: 7 object: 8
Image: 029 object: 6 object: 7 object: 2 object: 9
Image: 033 object: 2 object: 10 object: 7 object: 8 object: 3
object: 11
Image: 103 object: 2 object: 5 object: 8 object: 11 object: 3
object: 12
Figure 4: First part of images and blobs
12
Image: 108 object: 2 object: 10 object: 4 object: 13 object: 14
object: 11 object: 5 object: 8 object: 3
Image: 119 object: 9 object: 7 object: 15
Image: 131 object: 2 object: 5 object: 16 object: 8 object: 3
Image: 144 object: 13 object: 14 object: 16 object: 8 object: 15
object: 5 object: 12 object: 3 object: 11
Figure 5: Second part of images and blobs
13
RULES GENERATED
Parameters:
Support: 20%
Confidence: 70%
Number of records: 10
Number of associations: 30
{3} = {2} s= 50%, c= 83% {3} = {8} s= 50%, c= 83%
{5} = {2} s= 40%, c= 80% {5} = {3} s= 40%, c= 80%
{5} = {8} s= 40%, c= 80% {5} = {3,8} s= 40%, c= 80%
{7} = {2} s= 30%, c= 75% {8} = {2} s= 50%, c= 83%
{8} = {3} s= 50%, c= 83% {11} = {2} s= 30%, c= 75%
{11} = {3} s= 40%, c=100% {11} = {5} s= 30%, c= 75%
{11} = {8} s= 40%, c=100% {11} = {2,3} s= 30%, c= 75%
{11} = {2,8} s= 30%, c= 75% {11} = {3,5} s= 30%, c= 75%
{11} = {3,8} s= 40%, c=100% {11} = {5,8} s= 30%, c= 75%
{11} = {3,5,8} s= 30%, c= 75% {11} = {2,3,8} s= 30%, c= 75%
{2,3} = {8} s= 40%, c= 80% {2,5} = {3} s= 30%, c= 75%
{2,5} = {8} s= 30%, c= 75% {2,5} = {3,8} s= 30%, c= 75%
{2,8} = {3} s= 40%, c= 80% {2,11} = {3} s= 30%, c=100%
{2,11} = {8} s= 30%, c=100% {2,11} = {3,8} s= 30%, c=100%
{3,5} = {2} s= 30%, c= 75% {3,5} = {8} s= 40%, c=100%
{3,5} = {11} s= 30%, c= 75% {3,5} = {2,8} s= 30%, c= 75%
{3,5} = {8,11} s= 30%, c= 75% {3,8} = {2} s= 40%, c= 80%
{3,8} = {5} s= 40%, c= 80% {3,8} = {11} s= 40%, c= 80%
{3,11} = {2} s= 30%, c= 75% {3,11} = {5} s= 30%, c= 75%
{3,11} = {8} s= 40%, c=100% {3,11} = {2,8} s= 30%, c= 75%
{3,11} = {5,8} s= 30%, c= 75% {5,8} = {2} s= 30%, c= 75%
{5,8} = {3} s= 40%, c=100% {5,8} = {11} s= 30%, c= 75%
{5,8} = {2,3} s= 30%, c= 75% {5,8} = {3,11} s= 30%, c= 75%
{5,11} = {3} s= 30%, c=100% {5,11} = {8} s= 30%, c=100%
{5,11} = {3,8} s= 30%, c=100% {8,11} = {2} s= 30%, c= 75%
{8,11} = {3} s= 40%, c=100% {8,11} = {5} s= 30%, c= 75%
{8,11} = {2,3} s= 30%, c= 75% {8,11} = {3,5} s= 30%, c= 75%
{2,8,11} = {3} s= 30%, c=100% {3,8,11} = {2} s= 30%, c= 75%
{3,8,11} = {5} s= 30%, c= 75% {5,8,11} = {3} s= 30%, c=100%
{3,5,11} = {8} s= 30%, c=100% {2,5,8} = {3} s= 30%, c=100%
{3,5,8} = {2} s= 30%, c= 75% {3,5,8} = {11} s= 30%, c= 75%
{2,3,11} = {8} s= 30%, c=100% {2,3,8} = {5} s= 30%, c= 75%
{2,3,8} = {11} s= 30%, c= 75% {2,3,5} = {8} s= 30%, c=100%
Number of rules generated: 66
Figure 6: Association rules for identified objects
14
Range Content
000-009 1 shape
010-019 2 shapes
020-029 3 shapes
030-049 4 shapes
100-109 5 shapes not overlapping without rectangle and L
110-119 5 shapes with some overlapping shapes
120-129 6 shapes not overlapping without L
130-139 6 shapes with some overlapping shapes
140-149 7 shapes
Table 1: Image Content
7 Experimental Results
Synthetic Image Generation
To test our image mining algorithm we created synthetic images. We used synthetic images as a starting
point in showing the feasibility of mining images. Also, with our constrained image set, we can more readily
determine the weaknesses and strengths of our approach. These images are 192x128 color JPEG format
because the Blobworld software from UCB needed this specific size. It is important to note that this format
uses a lossy compression scheme and thus image quality is deteriorated. This was not a problem for our
synthetic images, but it may be a problem for images with rich information content, such as photographs.
Our images contain a combination of plain geometric shapes. The seven shapes for our experiments
were: triangle, circle, square, rectangle, hexagon, ellipse and an irregular shape similar to the letter L.
Each of the shapes had a different uniform color and a black border. The background was always white.
The texture for each shape was uniform with one exception, the irregular shape. For technical reasons two
additional little objects were added to two opposing corners of each image to delimit its size. These little
objects are ignored by the feature extraction step because it discards objects whose area does not represent
more than 2% of the total image area.
Each geometric shape has the same size and color in each image where it appears. All shapes have a
similar size with respect to each other. However, their position and orientation can differ between images.
To make the mining process more interesting, in some cases we overlapped shapes or placed them very
close to each other so they would seem to be part of the same object by the segmentation algorithm.
With the guidelines mentioned above we manually generated 100 basic images and we replicated or
deleted some or all of these images to obtain larger or smaller image sets for our experiments. In Table 1
we show summary information for the images we created. We partitioned images into classes according to
their content complexity. Image id’s below 100 indicate easy to mine images and image id’s greater than
100 mean difficult images, that is, images with complex content. In the easy images we have no more than
4 shapes plus the background. Shapes are in different positions but they are not close to each other and they
are not overlapping either. For difficult images we have up to 7 shapes plus the background. In this case
shapes overlap and also may be close to each other. This certainly makes the image mining process more
difficult as we shall see.
15
Step/# of images 50 100 150 200
feature extraction 50292 80777 127038 185080
object identification 210 338 547 856
auxiliary image creation 3847 6911 10756 13732
association rules finding 6 3 6 4
Table 2: Measured times in seconds for each Image Mining step with different image set sizes
Hardware  Software
We ran our experiments on a Sun Multiprocessor (forge) computer with 4 processors (each running at 100
MHz) and 128 MB of RAM. The image mining program was written in Matlab and C.
The feature extraction process is done in Matlab by the software we obtained from UCB. Object
identification and record creation were also done in Matlab. An html page is created in Matlab to interpret
results. The asociation rules were obtained by a program written in C.
Performance Results
In this section we examine the performance of the various components of the image mining process, as
shown in Table 2 for several image set sizes. These times were obtained by averaging the ellapsed times of
executing the image mining program five times.
Running time
Feature extraction is slow and there are several reasons for this. Every image is a matrix and every entry
(pixel) is analyzed several times (about 9). If image size increases performance should degrade considerably
since feature extraction is quadratic in image size. Nevertheless, this step is done only once and does not
have to be repeated to run the image mining algorithm several times. So, this table shows the times assuming
we are mining a set of unprocessed images. In fact, the images will already be stored in some CBIR, like
the Blobworld system, and this cost should not be attributed to the mining process.
Our object identification step does not use images but rather the blob descriptors produced by the feature
extraction step. These blob descriptors vary from two to about double the number of objects contained
in the image. For instance it was common to have two blob descriptors for the square, two for the circle,
etc. In most cases at least one was matched. This was a disadvantage to match objects when they were
apart from each other, but it turned out to be an advantage when objects were very close or overlapping.
It also happened that one blob represented several objects, with one being the dominant geometric shape;
for instance look at Figure 5 with image 131 and object 2. Right now for the sake of accuracy every blob
descriptor is compared against the rest not compared yet; we skip an object (blob descriptor) if it has already
been matched. If there are a few matched objects the time for the object identification step is almost linear
since the similarity function is computed fewer times as the program runs. For our experimental results time
for this step scales up well.
16
Auxiliary image creation is relatively slow and its time linear since it is done on a per image basis.
However, in this step auxiliary images are created only once if we run the program several times, as in the
feature extraction step. Therefore, when the integrated view of results is created we only relabel each blob
image with its corresponding object id and running time becomes comparable to the associaton rules finding
times, which is in the order of a few seconds rather than thousands of seconds.
Our association rules program is fast given these image set sizes. The time it takes to find rules is the
lowest among all steps. The time for 50 images is as high as the other ones because there were many more
associations found. For the reamining sizes times are similar because the number of associations found
decreases as size grows. It is important to note that given larger image set sizes we do not expect this step
to take much longer if there are a few identified objects, as it is the case with our experimental images. On
the other hand if there are many identified objects and we run this step with a low support it may take longer
since the number of associations found grows in a combinatorial manner as we lower the support threshold.
In short, if the image mining program is run several times over the same image set only the times for
the second and the fourth step should be considered since image features already exist and auxiliary images
have already been created.
Quality of results
We should mention that there were no false positives. It did not happen that an object was incorrectly
identified. For instance when we found a match between two objects they were the same shape. No triangle
was mistakenly taken as a square, or a square as a rectangle.
The background represents an object itself. Since association rules with the background were not
interesting for our purposes it was eliminated from consideration by the object identification step. It is
important to note that this is done after objects have been identified.
We tuned the object identification step to find similar objects changing values for several parameters
in the following manner. The most important features used from each object were color and contrast. We
allowed little variance for color (0.3) and the maximum allowed variance for contrast (0.5). We ignored
shape, because objects could be partially hidden and rotated. Position was considered unimportant because
objects could be anywhere in each image. Anisotropy and polarity were ignored because almost all our
shapes had uniform texture. Area was given no weight because objects could be overlapping, and thus their
area diminished; this can be useful to make perfect matches when objects are apart from each other.
We lowered the support down to 20% and confidence was raised to 80%. A few rules had high support.
One problem that arose during our experiments was that the same shape could have two different blob
descriptors, and these blob descriptors could not be matched with two other descriptors for the same shape
in another image. This caused two problems. First, a rule could be repeated because it related the same
shapes. Second, a rule did not have enough support and/or confidence and therefore was discarded. So, the
rules found were correct and in many cases had an actual higher support and also higher confidence.
To our surprise in some cases there were no object matches because an object was very close to another
one or was located in a corner of the image. When two or more objects were overlapping or very close
they were identified as a single object. This changed the features stored in the blob. The problem was due
to the ellipsoidal shape of the blobs and the fact that when a geometric shape was located in a corner thta
changed its anysotropy and polarity descriptors. Given a blob for an object very close to one corner means
determining an adequate radius for the blob (i.e., ellipse).
17
Regular shapes such as the triangle, square and hexagon were easily matched across images. This is a
direct consequence of the circular blob representation produced when the image is segmented. In this case
neither position nor rotation affect the mining process at all. It was surprising that in some cases there were
no matches for the circle; in these cases it was in a corner or some other shape was very close or overlapping.
Another important aspect about shape is that we do not use it as a parameter to mine images, but shape plays
an important role during the segmentation step. So, shape does affect the image mining results quality.
The rectangle and the ellipse are the next shapes that are easily matched even though we did not use the
shape feature. The most complicated shape was the L. In this case a number of factors affected matches.
When this shape was overlapped with other shapes a few matches were found because a big blob was
generated. Also, orientation changed dominant colors and contrast. When the L was close to another shape
its colors were merged making it dissimilar to other L shaped objects. This suggests that irregular shapes in
general make image mining difficult.
We worked with color images but it is also possible to use black and white images. Color and texture
were important in mining the geometric shapes we created. However, we ignored shape as mentioned above.
Shape may be more important for black and white images but more accurate shape descriptors, than those
provided by the blobs.
8 Future Work
Results obtained so far look promising but we need to improve several aspects in our research effort. We
are currently working on the following tasks. This list may look comprehensive but it is the result of the
assessment of our initial implementation and experimental results. These tasks are sorted by difficulty. We
start with the easiest ones, ending with longer term goals.
We need to incorporate objects in different sizes. This does not seem a difficult thing to try since we rely
on color and texture to match objects, Object sizes should represent at least 2% of the image are in order to
be considered by the feature extraction step.
We are in the process of improving auxiliary image generation to treat blob descriptors and blobs
themselves just as internal information and display object identifiers (oid’s) directly on each image; this
makes result interpretation easier and this process faster since only one image is generated. As explained
before we have displayed one image per blob as produced by the Berkeley system. The drawback of this
latter approach is that we require human intervention to interpret the discovered rules.
Images of different size and resolution will be analyzed. This looks like a complex problem because of
a variety of reasons. First, if we try to mine photographs information content is in general very high and it
is likely we should try to use some type of domain knowledge. Second, since images are treated as matrices
performance will always be an issue. Higher resolution means slower feature extraction and possibly more
identified objects. On the other hand, lower resolution images have faster feature extraction, but provide
less knowledge to be discovered. Right now, we re depending on the constrainst of the Blobworld system.
We also need to analyze images with repeated geometric shapes. If we want to obtain simple association
rules this can make our program more general. This can be done without further modification to what is
working. However, if we want to mine for more specific rules then we would need to modify our algorithm.
For instance, we could try to get rules like the following: if there are two rectangles and one square then we
are likely to find three triangles. The issues are the combinatorial growth of all the possibilities to mine and
also a more complex type of condition. We will also study more deeply the problem of mining images with
18
more complex shapes such as the irregular one similar to the letter L.
We need a systematic approach to determine an optimal similarity threshold or at least a close one. A
very high threshold means only perfect matches are accepted. On the other hand, a very low similarity
threshold may mean any object is similar to any other object. Finding the right similarity threshold for each
image type looks like an interesting problem. Right now it is provided by the user but it can be changed to
be tuned by the algorithm itself. Also, there are many ways to tune the eleven parameters to match blobs
and the optimal tuning may be specific to image type.
There also exists the possibilityof using other segmentation algorithms that could perform faster or better
feature extraction. It is important to note that these algorithms should give a means to compare segmented
regions and provide suitable parameters to perform object matching in order to be useful for image mining.
From our experimental results it is clear that this step is a bottleneck for the overall performance of image
mining.
We can change the object identification algorithms to generate overlapping object associations using
more features. Our algorithm currently generates partititons of objects, that is, if one object is considered
similar to another one, the latter one will not be compared again. By generating overlapping associations
we can find even more rules. For instance a red rectangular object may be considered similar to another
rectangular object and at the same time be similar to another red object. Mining by position is also possible;
for instance two objects in a certain position may imply another object to be in some other position. Since
the software we are using for feature extraction produces eleven parameters to describe blobs we have 211
possibilites to match objects.
We need to minimize I/O for large collections of images. Having 200 images represents a small image
database used in real life. We intend to extend our algorithm to handle arbitrary collections of images.
Suitable buffering capabilities would be needed for our modified version of the image mining process. An
approach could be to process blob descriptors by pairs of blocks to minimize disk access. This would require
major modifications for our object identification algorithm since independent object id’s could be generated
for the same object. A straightforward extension of the same algorithm is discarded since disk access would
be X  2! and this number does not scale up for heavy disk access.
If the image set grows considerably it may be advisable to use image indexing and buffering to make
retrieval faster. An image can be indexed in a variety of ways. By id, by color, by texture, by size, by shape,
etc. And also, each parameter in the standard deviation vector is a floating point problem. So, there are
many ways to find the best match between objects. This looks like an NP-hard problem in  , the number of
images.
Other directions for future research include usingparallel processing applied to some steps of the process.
This includes using partitioning or even pipelining techniques to accelerate image processing. The problem
becomes more complex if we try to mine images stored in different computers in a distributed fashion.
Parallel processing for feature extraction is straightrforward since each image is processed independently.
However, parallel processing of object identification is an interesting problem if we also want to minimize
disk access as mentioned above.
We can perform incremental mining when a few images are added to an already mined image set. The
possibility of reusing work performed in one image to save work in another image when image contents are
very similar is an option. However, this could imply using domain knowledge rendering the algorithm less
general.
We can incorporate a time dimension to associations and rules when there are historical images on the
19
same scenario. Develop a multiple pass algorithm to match objects using different parameter values. Use
real images to perform experiments. Apply other data mining techniques such as classifier trees, machine
learning and advanced statiscal analysis. Mine images and textual information asociated with the image;
right now we are only extracting knowledge from the image. In real image databases there is frequently
information about each image which could improve the mining process and produce more complex rules.
Right now we have feature extraction, object identification and object data mining as separate steps. It is
not clear if we can merge these steps to do processing faster or more accurate. The most promising research
direction would be to perform image mining directly on segmented images, that is, merging the second and
third steps.
9 Conclusions
We presented a new algorithm to perform data mining on images and an initial performance study. Our
algorithm is general because it does not use any type of domain knowledge. The algorithm performed well
with the synthetic image set we generated. The ideas we present are interesting from both a theoretical and
a practical point of view. We studied this problem in the context of data mining for databases. We used
existing results in computer vision theory at the minimum necessary. Our image mining algorithm has 4
major steps: feature extraction, object identification, auxiliary image creation and identified object mining.
The slowest part of image mining is the feature extraction step, which is really a part of the process of storing
images in a CBIR system; and is done only once. The next slowest operation is creating the auxiliary blob
images which is also done once. Object identification and association rule finding are fairly fast and scale up
well with image set size. We also presented several improvements to our initial approach of image mining.
Our experimental results are promising and show a good deal of potential for future study. No false rules
are found. Rules referring to specific objects are obtained regardless of object position, object orientation,
and even object shape when one object is partially hidden. Nevertheless, these features can be used to
extract more rules as we explained before.
Images contain a great deal of information, and thus the amount of knowledge that we can extract from
them is enormous. We believe that this work will open a new research direction in the area of knowledge
discovery in databases.
Acknowledgments
We thank Chad Carson from the University of California at Berkeley for helping us setup the Blobworld
system. We also thank Sham Navathe for his comments to improve the presentation of this paper.
References
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large
databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of
Data, pages 207–216, Washington, DC, May 26-28 1993.
20
[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In
Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, August
29-September 1 1994.
[3] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Recognition of images in large databases using a
learning framework. Technical Report TR 97-939, U.C. Berkeley, CS Division, 1997.
[4] B.G. Buchanan and T.M. Mitchell. Pattern-Directed Inference Systems, chapter Model-Directed
Learning of Production Rules, pages 297–312. Academic Press, New York, 1978.
[5] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In IEEE Workshop
on Content-Based Access of Image and Video Libraries, 1997.
[6] C. Carter and J. Catlett. Assessing credit card applications using machine learning. IEEE Expert,
2(3):71–79, Fall 1987.
[7] G. Dunn and B. S. Everitt. An Introduction to Mathematical Taxonomy. Cambridge University Press,
New York, 1982.
[8] D. Forsyth, J. Malik, M. Fleck, H. Greenspan, T. Leung, S. Belongie, C. Carson, and C. Bregler. Finding
pictures of objects in large collections of images. Technical report, U.C. Berkeley, CS Division, 1997.
[9] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge Discovery in Databases, chapter
Knowledge Discovery in Databases: An Overview, pages 1 – 27. MIT Press, 1991.
[10] V. Gudivada and V. Raghavan. Content-based image retrieval systems. IEEE Computer, 28(9):18–22,
September 1995.
[11] R. Hanson, J. Stutz, and P. Cheeseman. Bayesian classification theory. Technical Report FIA-90-12-7-
01, Artificial Intelligence Research Branch, NASA Ames Research Center, Moffet Field, CA 94035,
1990.
[12] M. Holsheimer and A. Siebes. Data mining: The search for knowledge in databases. Technical Report
CS-R9406, CWI, Amsterdam, The Netherlands, 1993.
[13] M. Houtsma and A. Swami. Set-oriented mining of association rules. Technical Report RJ 9567, IBM,
October 1993.
[14] M. McLeish, P. Yao, M. Garg, and T. Stirtzinger. Knowledge Discovery in Databases, chapter
Discovery of Medical Diagnostic Information: An Overview of Methods and Results, pages 477–490.
AAAI Press / The Mit Press, Menlo Park, California, 1991.
[15] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
[16] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules. In
Proceedings of the VLDB Conference, pages 432 – 444, Zurich, Switzerland, September 1995.
[17] M. Seigel, E. Sciore, and S. Salveter. Knowledge Discovery in Databases, chapter Rule Discovery for
Query Optimization, pages 411–430. AAAI Press / The Mit Press, Menlo Park, California, 1991.
[18] S. Shekhar, B. Hamidzadeh, A. Kohli, and M. Coyle. Learning transformation rules for semantic query
optimization: A data-driven approach. IEEE Transactions on Knowledge and Data Engineering,
5(6):950–964, December 1993.
21
[19] A. Siebes. Homogeneous discoveries contain no surprises: Inferring risk-profiles from large databases.
Technical Report CS-R9430, CWI, Amsterdam, The Netherlands, 1994.
22

More Related Content

PPTX
Multimedia content based retrieval in digital libraries
DOC
CONTENT BASED IMAGE RETRIEVAL SYSTEM
PPT
PPT
Data mining techniques unit 1
PPTX
Mining single dimensional boolean association rules from transactional
PPTX
Data mining tasks
PDF
CS8592 Object Oriented Analysis & Design - UNIT IV
PPT
Codd's rules
Multimedia content based retrieval in digital libraries
CONTENT BASED IMAGE RETRIEVAL SYSTEM
Data mining techniques unit 1
Mining single dimensional boolean association rules from transactional
Data mining tasks
CS8592 Object Oriented Analysis & Design - UNIT IV
Codd's rules

What's hot (20)

PPTX
Network packet analysis -capture and Analysis
PPTX
Mobile dbms
PPT
Information Retrieval Models
PPTX
data wrangling (1).pptx kjhiukjhknjbnkjh
PPTX
Data Integration and Transformation in Data mining
PDF
The CAP Theorem
PPTX
Leaky Bucket & Tocken Bucket - Traffic shaping
PPT
Map reduce in BIG DATA
PPTX
Iso layers
PDF
Hybrid wireless protocols
PDF
Slide05 Message Passing Architecture
PDF
Cloud-forensics
PPTX
Distributed Query Processing
PDF
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
PPTX
Routing algorithm
PPTX
Secure Hash Algorithm
PPTX
Multiprocessor
PPTX
lazy learners and other classication methods
PPTX
Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...
PPT
Chapter1 computer introduction note
Network packet analysis -capture and Analysis
Mobile dbms
Information Retrieval Models
data wrangling (1).pptx kjhiukjhknjbnkjh
Data Integration and Transformation in Data mining
The CAP Theorem
Leaky Bucket & Tocken Bucket - Traffic shaping
Map reduce in BIG DATA
Iso layers
Hybrid wireless protocols
Slide05 Message Passing Architecture
Cloud-forensics
Distributed Query Processing
Lecture #8: Clark-Wilson & Chinese Wall Model for Multilevel Security
Routing algorithm
Secure Hash Algorithm
Multiprocessor
lazy learners and other classication methods
Transport layer protocols : Simple Protocol , Stop and Wait Protocol , Go-Bac...
Chapter1 computer introduction note
Ad

Similar to Image mining (20)

PDF
Volume 2-issue-6-1974-1978
PDF
Volume 2-issue-6-1974-1978
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Design and Development of an Algorithm for Image Clustering In Textile Image ...
PDF
SIGNIFICANCE OF DIMENSIONALITY REDUCTION IN IMAGE PROCESSING
PDF
Ijaems apr-2016-16 Active Learning Method for Interactive Image Retrieval
PDF
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
PDF
C OMPARATIVE S TUDY OF D IMENSIONALITY R EDUCTION T ECHNIQUES U SING PCA AND ...
PDF
Multivariate feature descriptor based cbir model to query large image databases
PDF
Et35839844
PDF
SEMANTIC IMAGE RETRIEVAL USING MULTIPLE FEATURES
PDF
A Comparative Study of Content Based Image Retrieval Trends and Approaches
PDF
An Unsupervised Cluster-based Image Retrieval Algorithm using Relevance Feedback
PDF
Ijarcet vol-2-issue-7-2287-2291
PDF
Ijarcet vol-2-issue-7-2287-2291
PDF
Volume 2-issue-6-2077-2080
PDF
Volume 2-issue-6-2077-2080
PDF
Global Descriptor Attributes Based Content Based Image Retrieval of Query Images
PDF
A Novel Method for Content Based Image Retrieval using Local Features and SVM...
Volume 2-issue-6-1974-1978
Volume 2-issue-6-1974-1978
International Journal of Computational Engineering Research(IJCER)
Design and Development of an Algorithm for Image Clustering In Textile Image ...
SIGNIFICANCE OF DIMENSIONALITY REDUCTION IN IMAGE PROCESSING
Ijaems apr-2016-16 Active Learning Method for Interactive Image Retrieval
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
C OMPARATIVE S TUDY OF D IMENSIONALITY R EDUCTION T ECHNIQUES U SING PCA AND ...
Multivariate feature descriptor based cbir model to query large image databases
Et35839844
SEMANTIC IMAGE RETRIEVAL USING MULTIPLE FEATURES
A Comparative Study of Content Based Image Retrieval Trends and Approaches
An Unsupervised Cluster-based Image Retrieval Algorithm using Relevance Feedback
Ijarcet vol-2-issue-7-2287-2291
Ijarcet vol-2-issue-7-2287-2291
Volume 2-issue-6-2077-2080
Volume 2-issue-6-2077-2080
Global Descriptor Attributes Based Content Based Image Retrieval of Query Images
A Novel Method for Content Based Image Retrieval using Local Features and SVM...
Ad

Recently uploaded (20)

PPTX
GSA Content Generator Crack (2025 Latest)
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
Microsoft Office 365 Crack Download Free
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PPTX
Introduction to Windows Operating System
PPTX
CNN LeNet5 Architecture: Neural Networks
PPTX
Computer Software - Technology and Livelihood Education
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
iTop VPN Crack Latest Version Full Key 2025
DOC
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
PDF
Visual explanation of Dijkstra's Algorithm using Python
PPTX
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
GSA Content Generator Crack (2025 Latest)
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
BoxLang Dynamic AWS Lambda - Japan Edition
Microsoft Office 365 Crack Download Free
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
MCP Security Tutorial - Beginner to Advanced
Salesforce Agentforce AI Implementation.pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
Advanced SystemCare Ultimate Crack + Portable (2025)
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Introduction to Windows Operating System
CNN LeNet5 Architecture: Neural Networks
Computer Software - Technology and Livelihood Education
How to Use SharePoint as an ISO-Compliant Document Management System
Full-Stack Developer Courses That Actually Land You Jobs
iTop VPN Crack Latest Version Full Key 2025
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
Visual explanation of Dijkstra's Algorithm using Python
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx

Image mining

  • 1. Image Mining: A New Approach for Data Mining Carlos Ordonez and Edward Omiecinski College of Computing Georgia Institute of Technology Atlanta, Georgia 30332-0280 February 20, 1998 Abstract We introduce a new focus for data mining, which is concerned with knowledge discovery in image databases. We expect all aspects of data mining to be relevant to image mining but in this first work we concentrate on the problem of finding associations. To that end, we present a data mining algorithm to find association rules in 2-dimensional color images. The algorithm has four major steps: feature extraction, object identification, auxiliary image creation and object mining. Our algorithm is general in that it does not rely on any type of domain knowledge. A synthetic image set containing geometric shapes was generated to test our initial algorithm implementation. Our experimental results show that image mining is feasible. We also suggest several directions for future work in this area. 1 Introduction Discovering knowledge from data stored in typical alphanumeric databases, such as relational databases, has been the focal point of most of the work in database mining. However, with advances in secondary and tertiary storage capacity, coupled with a relatively low storage cost, more and more non standard data (e.g., in the form of images) is being accumulated. This vast collection of image data can also be mined to discover new and valuable knowledge. The problem of image mining combines the areas of content-based image retrieval, image understanding, data mining and databases. To our knowledge, no other work has been done with regard to mining knowledge from a collection of images from a database perspective. An initial step towards tapping into the undiscovered wealth of knowledge from mining image-bases is the focus of this paper. This work can also be seen as a starting point for an as yet, unexplored area that that can provide enormous benefits. Image mining has two main themes. The first is mining large collections of images and the second is the combined data mining of large collections of image and associated alphanumeric data. As of now we have concentrated on mining only images; but our algorithm can be extended in a straightforward manner to handle images and associated alphanumeric data. An example of the first case might involve a collection of weather satellite imagery of various cities in the United States that has been recorded over an extended period of time. The data mining objective might be to find if there is some pattern that exists for an individual city (over time) or if there is some pattern that exists between different cities. An example of the second case might involve medical imagery and patient (alphanumeric data) records. To develop an accurate diagnosis or prognosis both image data (such as Xrays, SPECT, etc.) and patient data (such as weight, prior health conditions, family history, etc.) can be examined together to find interesting associations. 1
  • 2. Our data mining system is built on top of a content-based image retrieval system (CBIR), the “Blobworld” system from the University of California at Berkeley (UCB). This CBIR system supports object-based queries and thus eliminates the need for the manual indexing of image content. This is a major advantage since manually indexing massive collections of images is impractical. 2 Applications & Motivation Applications such as military reconnaissance, weather forcasting, the management of earth’s resources, criminal investigation and medical imaging all require (or will in the near future) the storage and processing of massive collections of images. For instance, NASA’s EOS system generates 1 terabyte of image data per day. To take full advantage of these image-bases and the knowledge they contain, data mining techniques will have to be applied. This opens up a new area with new problems (i.e., the area of image mining). One of the typical data mining problems is to find association rules among data items in a database. In a retail environment such as a grocery store, an association rule might state that customers who purchase spaghetti and Italian sausage also purchase red wine. This can be determined by examining all the customer transactions (i.e., purchases). In this case, the data is explicit, there is a specific data item for each of the three grocery items and an individual customer transaction would include a subset of those items and in general a subset of all the items sold by the store. In the case of image-bases, assumming that all the images have been manually indexed (or their contents classified) may not be feasible. This presents one major deviation (problem) from the typical data mining approach for numerical data. If images can efficiently be labelled by a semantic descriptor, then the mining can be done on those high level concepts. However, with hundred’s of thousands of images, this will become impossible. An alternative is to rely on automatic/semi-automatic analysis of the image content and to do the mining on the generated descriptors. For example, color, texture, shape and size can be determined automatically. Objects in an image can be determined by the similarity of those attributes. Consider the sample images shown in Figure 1. The color photographs, although shown in black and white, are pictures of different rivers near populated areas. They contain buildings, roads, bridges and water. Due to the different color and texture of these objects it is possible for an indexing algorithm to determine distinct objects, although without the semantic label. With a CBIR system it would be possible to retrieve images that contain similar objects. Hence an image mining system could determine whether certain objects appear together in enough of the images to make that fact interesting. The association rule and the supporting images (or a subset thereof) could be returned along with the information as to what constitutes the derived objects from the photographs. So for the samples in Figure 1, a computed association rule might be the following:  ¢¡¤£¦¥¨§© 1 determines  ¢¡¤£¥§© 2. The river might be represented by  ¢¡¤£¥§© 1 in the different photographs and the populated area (buildings and roads) by  ¢¡¤£¥§© 2. 3 Content-based Image Retrieval Content-based image retrieval (CBIR) systems will be needed to effectively and efficiently use large image databases. With a CBIR system, users will be able to retrieve relevant images based on their contents. CBIR researchers have typically followed two distinct directions [10] based on modelling the contents of the image as a set of attributes which is produced manually and stored, for example in a relational database 2
  • 3. Figure 1: City-River Images from the California Department of Water Resources using an integrated feature-extraction/object-recognition system. More recent research has recognized the need to develop a solution that captures the essence of both directions. However, there are still differences between the various current approaches. Mainly the differences can be categorized in terms of image features extracted, their level of abstraction and the degree of domain independence. Certainly tradeoffs must be made in building a CBIR system. For example, having automatic feature extraction is achieved at the expense of domain independence. Having a high degree of domain indepence is achieved by having a semiautomatic (or manual) feature extraction component. With CBIR systems [10], querying is facilitated through generic query classes. Examples of some query classes include color, texture, shape, attributes, text and domain concepts. Color and texture queries allow users to formulate the description of the images to be retrieved in terms of like color and texture. Queries can also be posed with regard to the text associated with the images. For instance, in a medical setting, image retrieval is not only based on image content but also on the physician’s diagnosis, treatment, etc. (i.e., additional textual data). We should also point out that CBIR differs from traditional database systems in that images are retrieved based on a degree of similarity and that records are usually retrieved from databases because of exactly matching specified attribute values. Various content-based retrieval systems (QBIC, Chabot and Photobook) have focused on material- oriented queries and have used low-level image properties (e.g., color and texture) to implement these queries. On the other hand, a content-based retrieval system developed at the University of California at Berkeley [3, 5] focuses on object-oriented queries. That is, queries that search for images that contain particular objects. The approach to object recognition at Berkeley is structured around a sequence of increasingly specialized grouping activities that produces a “blobworld” representation of an image, which is a transformation from the raw pixel data to a small set of localized coherent regions in color and textual space. The “blobworld” representation is based on image segmentation using the Expectation-Maximization 3
  • 4. algorithm on combined color and texture features. The salient feature of the Berkeley work is their approach to object recognition. Their approach [3] is based on the construction of a sequence of successively abstract descriptors through a hierarchy of grouping and learning processes. The image descriptors at a low level are color, texture, contrast, polarity, etc. and the grouping is based on spatiotemporal coherence of the local descriptors. The central notion in grouping is coherence and four major issues have been identified in [8] which are segmenting images into coherent regions based on integrated region and contour descriptors; fusing color, texture and shape information to describe primitives; using learning techniques for developing the relationship between object classes and color, texture and shape descriptors; and classifying objects based on primitive descriptors and relationships between primitives. 4 Data Mining Database mining, an important part of knowledge discovery, is defined as the automated discovery of previously unknown, nontrivial, and potentially useful information from databases. The information is a statement that describes the relationship among a set of objects contained in the database with a certain confidence such that the statement is in some sense simpler than enumerating all the relationships between the individual instances of objects [9]. For example, in a database of employees and their salaries, each instance represents the relationship between an individual employee and his salary. A statement such as “salaries of engineers is higher than the salaries of secretaries,” based on the instances of the database, conveys information that is implicit and interesting than listing the salaries of all engineers and secretaries. Database mining is the process of generating high-level patterns that have acceptable certainty and are also interesting from a database of facts. Knowledge Discovery has been applied effectively to solve many diverse problems. Examples include discovering rules of mass spectrometry from spectrogram data [4], assessment of credit card applicants [6], discovering rules for query optimization [17, 18], diagnosing diseases from medical data [14], inferring risk profiles of insurance policy holders [19], and so on. Knowledge discovery derives much of its success from reasoning techniques in artificial intelligence, expert systems, machine learning and statistics. Many paradigms such as inductive learning [15], Bayesian statistics [11], mathematical taxonomy [7], etc, have also been applied to knowledge discovery. In general, knowledge discovery is an amalgamation of concepts from diverse fields. Efficiency is, in general, important for any computational problem. However, for database mining it also determines whether a particular technique can be applied or not. For example, the number of possible ways to cluster objects into clusters in unsupervised learning is exponential in [12]. Hence, an algorithm which uses exhaustive search for clustering is impractical for real-world databases. In general any algorithm which grows faster than 2! is unlikely to be useful for large databases. Over the years, database systems, mainly relational, have made great strides in improving efficiency. The success of relational database systems in the business community can be attributed to these improvements. Many techniques such as, efficient access methods, buffer management, disk management, etc, are well understood. However, most of these techniques have been developed for on-line transaction processing (OLTP) applications. The access patterns for OLTP applications, which typically access a few hundred records, are considerably different from database mining applications, where entire tables may need to be scanned. One of the challenges in database mining is developing more efficient algorithms, better access structures, optimizing disk I/O, and so on. 4
  • 5. 5 Image Mining There are two major issues that will affect the image data mining process. One is the notion of similarity matching and the other is the generality of the application area, that is, the breadth of usefulness of data mining from a practical point of view. For a specific application area, associated domain knowledge can be used to improve the data mining task. Since data mining relies on the underlying querying capability of the CBIR system, which is based on similarity matching, user interaction will be necessary to refine the data mining process. With image mining we will consider the four broad problem areas associated with data mining: finding associations, classification, sequential patterns and time series patterns. With all of these, the essential component in image mining is identifying similar objects in different images. With typical basket-market analysis, the data is usually constrained to a fixed set of items that are explicitly labelled. It is also quite efficient to see if a transaction contains a particular item, i.e., requires an examination of the item labels associated with a transaction. In some cases the data might be pre-processed into a fixed record format where a field exists for each item in the domain and a Boolean value is associated with it, indicating the presence or absence of that item in the transaction. This preprocessing can be done automatically. In a general image mining setting, having a human label every possible object in a vast collection of images is a daunting task. However, we intend to capitalize on the recent work in CBIR, in particular, Blobworld [8]. As discussed previously, we intend to build our data mining system on top of a content-based image retrieval system. One premise behind supporting object-based queries in a CBIR system is to eliminate the need for manual indexing of image content. The CBIR system we will use is from Berkeley [8]. We will refer to it as the “Blobworld” system. This system produces a “blobworld” representation of each image. A “blob” is just a 2-D ellipse which possesses a number of attributes. An image is made up of a collection of blobs, usually less than ten. Each blob represents a region of the image which is relatively homogeneous with respect to color and texture. A blob is described by its color, texture and spatial descriptors. The descriptors are represented by multidimensional vectors. At this point, we will consider in detail, the problem of finding associations. The problem of generating association rules was first introduced in [1] and an algorithm called AIS was proposed for mining all association rules. In [13], an algorithm called SETM was proposed to solve this problem using relational operations. In [2], two algorithms called Apriori and AprioriTid were proposed. These algorithms achieved significant improvements over the previous algorithms. The rule generation process was also extended to include multiple items in the consequent and an efficient algorithm for generating the rules was also presented. In [16], we presented an efficient algorithm for mining association rules that was fundamentally different from prior algorithms. Compared to previous algorithms, our algorithm not only reduced the I/O overhead significantly but also had lower CPU overhead for most cases. These relationships will take the form of rules. An association rule is an implication of the form $#%(' , where 0)1'3254 , and 367'8#@9 . A is the set of objects, also referred to as items. is called the antecedent and ' is called the consequent of the rule. In general, a set of items (such as the antecedent or the consequent of a rule) is called an itemset. Each itemset has an associated measure of statistical significance called support. For an itemset $2B4 , support( ) = C , if the fraction of records in the database containing equals C . A rule has a measure of its strength called confidence defined as the ratio support( 8D7' ) / support( ). The problem of mining association rules is to generate all rules that have support and confidence greater than some user specified minimum support and minimum confidence thresholds, respectively. This problem 5
  • 6. can be decomposed into the following subproblems: 1. All itemsets that have support above the user specified minimum support are generated. These itemset are called the large itemsets. All others are said to be small. 2. For each large itemset, all the rules that have minimum confidence are generated as follows: for a large itemset and any 'E2F , if support( )/support( 3G5' ) H minimum confidence, then the rule IGP'@#Q%R' is a valid rule. 6 Image Mining Algorithm steps In this section, we present the algorithms needed to perform the mining of associations within the context of images. The four major image mining steps are as follows: 1. Feature extraction. Segment images into regions identifiable by region descriptors (blobs). Ideally one blob represents one object. This step is also called segmentation. 2. Object identification and record creation. Compare objects in one image to objects in every other image. Label each object with an id. We call this step the preprocessing algorithm. 3. Create auxiliary images. Generate images with identified objects to interpret the association rules obtained from the following step (html page creation). 4. Apply data mining algorithm to produce object association rules. Here we explain the Image Mining processing in more detail. We keep I/O at a minimum. Images are kept on disk. For feature extraction each image is accessed once. These features are stored in two files, one is an image with all the blobs and the other with the blob descriptors. These blob descriptors are used to build an array with all the features from all the images. Once features are extracted from images we perform object identification using only their blob descriptors; this process is performed entirely in memory. Auxiliary images are kept on disk; these images show each identified object. Images are not indexed because it is not necessary to search their contents once thay are segmented. Arrays of records are all that are needed to mine images once we have their features. Processing each image is performed independently of each other for feature extraction and this is done sequentially. At present the association rule algorithm is done entirely in memory; this is not a limitation because the number of records we feed to this algorithm is not large right now and also it makes processing time smaller. We store associations (which are simply sets) as bit vectors where each bit indicates the presence or absence of an object. Identified objects, object associations and association rules are stored in sequential text files for interpreting results but not for processing. Segmentation Step It is not our intention to describe in detail the feature extraction process from the blobworld system. We will rather outline the main steps involved in identifying coherent image regions. 6
  • 7. 1. Estimate scale color selection S . 2. Produce 6-dimensional feature vectors. These vectors contain summary information about color and texture only. 3. Produce several clusterings of feature vectors using the Expectation Maximization (EM) method. The 2 dominant colors are determined here. The number of groups in each clustering is called T . 4. Use the Minimum Description Length principle to decide which is the best T . 5. Segment the image into T regions using the spatial grouping of pixels. Each region is connected. 6. Apply a 3x3 max-vote filter to determine dominant colors. 7. Generate blobs with summary information about each region when such region has an area greater than 2% of the image area. Each blob has the most important information about each region. This information includes color, texture, shape, area and position, but only the the first three are considered relevant. Preprocessing Algorithm The basic algorithm for finding associations between images/blobs is similar to our association finding algorithm Partition [16], as long as we preprocess the image data. By preprocessing the image data, we will identify and label objects contained in the images using the image query processing algorithm [5]. The output of the preprocessing step will be a set of records, UWV , one for each image, containing the object identifiers for the objects contained in the image. This step is quite intensive since it is a similarity search between images, actually image descriptors. However, once this is accomplished, the actual data mining step will not require the expensive similarity searching. Our preprocessing algorithm is shown in Figure 2. First of all we initialize the records, which store the object id’s for each of the images. This algorithm has four nested loops. These loops are required to compare every blob against the rest in the image database. In this manner we do not miss any potential match. Note that comparisons are always made between blobs in different images. Assuming the blob descriptor vector dimension is bounded, then this algorithm is X 2! . This is a reasonable asumption since the segmentation step cannot produce a high number of blobs. Nevertheless, if the number of objects in each image can also grow without bound this algorithm is X 2 2! for the number of possible different objects and the number of images. This can render the algorithm slow if and are similar; in general this algorithm behaves well if `YaYb . The variable cdfe¢C © © dg ¥ is used to generate new object id’s when one blob is matched for the first time. This is necessary because a single object in one image may be similar to many other objects in the following images and all these objects should have the same id. When one blob turns out to be similar to another one we add the object id to their corresponding records. The first is the record that is being compared against the rest and the second one is the record for which a match was found. The second object in the comparison will be discarded avoiding a future unnecessary comparison. The similarity function to compare blobs is expensive to compute as we will see. So this has an impact on the overall performance of the algorithm. Each segmented image is treated as a record and its transformation to a set of identified objects will also produce a record. This representation will also give us a direct way to incorporate alphanumeric information 7
  • 8. INPUT: h segmented images, iqp 1 r p 2 rtsssfr p¤uv , where pxw is a record containing: an image id and a blob descriptor vector y1€ OUTPUT: Set of h records, i‚ 1 r  2rstss¤r ƒu„v containing the object identifiers for the blobs …‡† ‰ˆ 1  1 ‘“’ƒha” †  w 1  •–‡— ” …˜† 0’‚yg™edfg‘ ˆg€  0 …‡† ‰ˆ 1  1 h † hWi 1 ” † …˜† 7™ 1  1 h †Bj ˆlkd¨mgp w s yn€poq ˆlr j ‘ ‘“ˆlstd  ‘“r‚u„d…‡† 0ˆ 2  ˆ 1 v 1 h † h p … p w 2 s y1€w 2 ˆ j h’n‘xstye‘“f1z„dt€|{¨d¤‘xh~} –‡— p …Bj ˆlstˆ€gy¢r¢mgp w 1 s yn€tw 1 r p w 2 s y1€w 2 r j ˆlstˆ€gy¢r‚ˆ‘“{ ‘“z„r‚d j z‚’l€ r j ‘“yph€¢y¢r‚€ €¢dƒˆlye‘“ˆg’‚ho„h~} –‡— p … q ˆgr j ‘ ‘“ˆlstd7h~} –‡— ’‚yg™dtfg‘ ˆl€  ’‚yg™dfg‘ ˆg€ v 1q ˆlr j ‘ ‘“ˆlstd  q y¨ j d–˜— ”ap … ƒw 1  ƒw 1 … i†’‚yg™dtf“‘ ˆl€‡v ƒw 2  ƒw 2 … i†’‚yg™dtf“‘ ˆl€‡v‡ y¢r‚ˆ‡p w 2 s y1€w 2 y j s‰ye‘“f1z‚d€–‡— ”ap … –‡— ”ap … –‡— ” …‡† ‡ y¢r‚ˆ‡p w 1 s y1€w 1 y j s‰y‚‘“fnz‚d€‡ˆ q ‘“z‚dr‚d‹Š~y j ’‚hdŒstye‘“f1zaye‘gdy j ‘–˜— ” …‡† –‡— ” …˜† … ˆ€‘“dr~’‚uŽ‘Qu‚hŠƒyph‘“d€~s‰ye‘“f1z‚d€‡’‚yg™dfg‘ j Figure 2: Preprocessing Algorithm 8
  • 9. associated with the image into the image mining process. The algorithm can handle such information without modification, provided those additional attributes are treated as boolean values. Similarity Function The similarityfunction [5] between two blobs is essential for our image mining program. This function takes four parameters, the two blobs to be compared, a similarity threshold and a vector of standard deviations. The similarity function is mathematically defined as: C¨dfdg ‘‚e¢d ©¤’ # ¥¢“~”g•–g—˜¤™tšl›gœžgŸ¡   1 ¢ lŸ¡   2£ 2 ) where ¤ dfC © ‘„ §‚¥ ¡   ¢¡ 1 ) ¡   ¢¡ 2 ! #@¥ ¡   ¢¡ 1 G ¡   ¢¡ 2 !x¦ Σ“ 1 ¡   ¢¡ 1 G ¡   ¢¡ 2 !“§ 1¨ 2© In these formulas ¡   ¢¡ 1 and ¡   ¢¡ 2 are vectors containing summary features and Σ“ 1 represents the vector containing the standard deviations allowed for matching on each desired feature. The G 1 power means we divide each distance by the corresponding entry of this vector; this is clarified in Figure 3. This similarity measure is 1 if there is a perfect match on all desired features and approaches zero as the match becomes worse. A low similarity can mean every object is similar to any other object. It is important to note that the distance for color is computed in a special manner. Colors are stored as three coordinates for a point located in a color-cone, referred as the hue saturation value (hsv). The distance is computed as the minimumpairwise distance of the two dominant colors of the object. Each color is a point in 3-dimensional space affecting its third coordinate by a weight of 0.5 and leaving the first two unchanged. This is computed as a matrix product between the color vector and the weights. This distance constitutes the first entry of the difference vector. For all the remaining 10 features the distance is just computed as the difference between each blob entry. The standard deviation vector permits adjusting parameters for the image mining process to use or discard features in an easy way. If we want to pay close attention to one specific feature we set the standard deviation to a value close to zero, but never zero. If we do not consider some feature to be relevant to the mining process we set the standard deviation to a high value. For some of the features the blobworld system requires standard deviations to be in specific ranges. More specifically standard deviations for color, anisotropy, and contrast require standard deviations at a maximum value of roughly 0.5. For area at most 0.1 is permitted. For all the remaining features any positive standard deviation is legal. If some feature is considered as completely irrelevant an infinity value is assigned to the corresponding entry. A more detailed description of the similarity function between two blobs is given in Figure 3. Auxiliary Image Creation Here we talk about the auxiliary image creation step. It is important to mention that this step is necessary in order for the user to make sense out of the image mining results. We use a web browser as the tool to have an integrated view of images, image features (blobs), object ids, associations, and association rules. The association rules are of the form: 9
  • 10. p —«ª­¬ h : yn€ 1 r y1€ 2 r j ˆlstˆ€gy¢r‚ˆ‘“{ ‘“z‚r‚d j z„’g€ r j ‘“y¢h€¢y¢r‚€ €¢dtƒ¨ˆgye‘“ˆl’‚h†­¬ h ª®¬ h : 1 q ’‚rƒy¯s‰ye‘“f1z r 0 ’n‘“z„dtr‚Šƒˆ j d € 11  mly¤g’‚y 1 s f1’‚hd 1 i°y¤g’‚y 2 s f1’‚hd 1o²±~³ 1 1 0s 5´ € 22  mly¤g’‚y 1 s f1’‚hd 2 i°y¤g’‚y 2 s f1’‚hd 2o²±~³ 1 1 0s 5´ € 1 ¶µ € 11µ²v ypy j µ € 22µ € 12  mgyxg’‚y 1 s f1’‚hd 1 iXyxg’‚y 2 s f1’‚hd 2o±~³ 1 1 0s 5´ € 21  mgyxg’‚y 1 s f1’‚hd 2 iXyxg’‚y 2 s f1’‚hd 1o±~³ 1 1 0s 5´ € 2  y¢y j µ € 12µv ypy j µ € 21µ €¢ˆ j ‘ 1  stˆlhmg€ 1r € 2o €¢ˆ j ‘ 2:11  y¤g’‚y 1 s q dtye‘“u„r‚d j i8y¤g’‚y 2 s q dtye‘“u„r‚d j j f1’‚r‚d  d·¸²mi 0s 5 ± j¹ rn‘tm Σ11 w»º 1 mgmg€¢ˆ j ‘ w ¼ j ‘“yph€¢y¢r‚€ €¢dƒˆlye‘“ˆg’‚h w o 2 ogoo r‚dx‘“u‚r‚h j f1’‚r‚dB½  j ˆgs‰ˆ¾lypr‚ˆ»‘“{ ‘“z‚r‚d j z„’g€ Figure 3: Similarity function ¿ d ¤ 1 ) d ¤ 2À % ¿ d ¤ 3À ) CÁÂà  e © #¶ % ) §‚  cd ¤ ¥ §‚¥ #¶' % For each image we show the original image with all the geometric shapes and then one blob image per matched blob. Ideally, each blob should correspond to one shape but this did not always happen as we will discuss. Each of of the blobs is labeled with the object id generated by the preprocessing algorithm. These are the id’s that appear in the rule. Right now this step is somewhat slow because it involves generating one image per matched blob, but this process is done only once if the image mining program is run several times over the same image set. And also, this step is alleviated by the fact that unmatched blobs are not displayed and thus no image for them is generated. After showing all the images there are two links to view the association text file and the association rule text file generated by the previous step. These files contain all the associations and rules as well as statistical information that help the user interpret and verify the correctness of the experimental results. Example At this point we show a simple example illustrating how our image mining algorithms work with # 10. The original images and their corresponding blobs are shown on Figure 4 and Figure 5. Association rules corresponding to the identified objects are shown on Figure 6. We chose 10 representative images from the image set we created for our experiments. This set of synthetic images is explained in detail in the next section. In Figures 4 and 5 we show the original image at the left with several geoemetric shapes and white background. These images are labeled with an image id. These images are the only input data for our program; no domain knowledge is used. Then we show a series of blob images, each containing one blob. These images are labeled with the id obtained in the preprocessing algorithm. Each blob has a close (most times equal) position to its corresponding geometric shape. There are some cases in which one blob corresponds to several geometric shapes. For instance in image 013 object 2 corresponds to the triangle and object 3 corresponds to the circle. In image 131 object 4 corresponds to all the shapes in the image. As parameters for object identification we set color standard deviation to 0.29 and contrast standard 10
  • 11. deviation to 0.5. We tuned these parameters after several experiments. The remaining parameters did not improve object identification for this set of synthetic images so their values were set to infinity. The data mining algorithm was run with a 20% support and 70% confidence. The output is a set of association rules whose support and confidence are above these thresholds. The 66 rules obtained by the program are shown on Figure 6. Let us analyze some of these rules. The first rule ¿ 3À % ¿ 2À means that if there is circle in the image then there is also a triangle. In fact with these simple images, there was never a circle without a triangle. In this case the rule actually has a higher support and a higher confidence than that obtained by the program (50% and 83 % respectively). This happened because the circle had two different object identifiers: 3 and 7. The rule ¿ 2 ) 3 ) 5À % ¿ 8À says if there is a circle, a triangle and an hexagon then there is also a square. Once again images containing the first three shapes always contained the square. Another interesting rule is ¿ 3 ) 11À % ¿ 2À . In this case the rule says that if there are a circle and an ellipse then there is also a triangle; once again the rule is valid; note that this rule has a low support. It important to note that several incorrect or useless blob matches such as 9, 10, 13, 14, 16 are filtered out by the 30% support. That is the case for images 029, 108, 119, 131 and 144. There are no rules that involve these identified objects (matched blobs). 11
  • 12. IMAGE MINING RESULTS Number of mined images: 10 Image: 013 object: 2 object: 3 object: 4 Image: 018 object: 2 object: 5 object: 6 Image: 025 object: 2 object: 7 object: 8 Image: 029 object: 6 object: 7 object: 2 object: 9 Image: 033 object: 2 object: 10 object: 7 object: 8 object: 3 object: 11 Image: 103 object: 2 object: 5 object: 8 object: 11 object: 3 object: 12 Figure 4: First part of images and blobs 12
  • 13. Image: 108 object: 2 object: 10 object: 4 object: 13 object: 14 object: 11 object: 5 object: 8 object: 3 Image: 119 object: 9 object: 7 object: 15 Image: 131 object: 2 object: 5 object: 16 object: 8 object: 3 Image: 144 object: 13 object: 14 object: 16 object: 8 object: 15 object: 5 object: 12 object: 3 object: 11 Figure 5: Second part of images and blobs 13
  • 14. RULES GENERATED Parameters: Support: 20% Confidence: 70% Number of records: 10 Number of associations: 30 {3} = {2} s= 50%, c= 83% {3} = {8} s= 50%, c= 83% {5} = {2} s= 40%, c= 80% {5} = {3} s= 40%, c= 80% {5} = {8} s= 40%, c= 80% {5} = {3,8} s= 40%, c= 80% {7} = {2} s= 30%, c= 75% {8} = {2} s= 50%, c= 83% {8} = {3} s= 50%, c= 83% {11} = {2} s= 30%, c= 75% {11} = {3} s= 40%, c=100% {11} = {5} s= 30%, c= 75% {11} = {8} s= 40%, c=100% {11} = {2,3} s= 30%, c= 75% {11} = {2,8} s= 30%, c= 75% {11} = {3,5} s= 30%, c= 75% {11} = {3,8} s= 40%, c=100% {11} = {5,8} s= 30%, c= 75% {11} = {3,5,8} s= 30%, c= 75% {11} = {2,3,8} s= 30%, c= 75% {2,3} = {8} s= 40%, c= 80% {2,5} = {3} s= 30%, c= 75% {2,5} = {8} s= 30%, c= 75% {2,5} = {3,8} s= 30%, c= 75% {2,8} = {3} s= 40%, c= 80% {2,11} = {3} s= 30%, c=100% {2,11} = {8} s= 30%, c=100% {2,11} = {3,8} s= 30%, c=100% {3,5} = {2} s= 30%, c= 75% {3,5} = {8} s= 40%, c=100% {3,5} = {11} s= 30%, c= 75% {3,5} = {2,8} s= 30%, c= 75% {3,5} = {8,11} s= 30%, c= 75% {3,8} = {2} s= 40%, c= 80% {3,8} = {5} s= 40%, c= 80% {3,8} = {11} s= 40%, c= 80% {3,11} = {2} s= 30%, c= 75% {3,11} = {5} s= 30%, c= 75% {3,11} = {8} s= 40%, c=100% {3,11} = {2,8} s= 30%, c= 75% {3,11} = {5,8} s= 30%, c= 75% {5,8} = {2} s= 30%, c= 75% {5,8} = {3} s= 40%, c=100% {5,8} = {11} s= 30%, c= 75% {5,8} = {2,3} s= 30%, c= 75% {5,8} = {3,11} s= 30%, c= 75% {5,11} = {3} s= 30%, c=100% {5,11} = {8} s= 30%, c=100% {5,11} = {3,8} s= 30%, c=100% {8,11} = {2} s= 30%, c= 75% {8,11} = {3} s= 40%, c=100% {8,11} = {5} s= 30%, c= 75% {8,11} = {2,3} s= 30%, c= 75% {8,11} = {3,5} s= 30%, c= 75% {2,8,11} = {3} s= 30%, c=100% {3,8,11} = {2} s= 30%, c= 75% {3,8,11} = {5} s= 30%, c= 75% {5,8,11} = {3} s= 30%, c=100% {3,5,11} = {8} s= 30%, c=100% {2,5,8} = {3} s= 30%, c=100% {3,5,8} = {2} s= 30%, c= 75% {3,5,8} = {11} s= 30%, c= 75% {2,3,11} = {8} s= 30%, c=100% {2,3,8} = {5} s= 30%, c= 75% {2,3,8} = {11} s= 30%, c= 75% {2,3,5} = {8} s= 30%, c=100% Number of rules generated: 66 Figure 6: Association rules for identified objects 14
  • 15. Range Content 000-009 1 shape 010-019 2 shapes 020-029 3 shapes 030-049 4 shapes 100-109 5 shapes not overlapping without rectangle and L 110-119 5 shapes with some overlapping shapes 120-129 6 shapes not overlapping without L 130-139 6 shapes with some overlapping shapes 140-149 7 shapes Table 1: Image Content 7 Experimental Results Synthetic Image Generation To test our image mining algorithm we created synthetic images. We used synthetic images as a starting point in showing the feasibility of mining images. Also, with our constrained image set, we can more readily determine the weaknesses and strengths of our approach. These images are 192x128 color JPEG format because the Blobworld software from UCB needed this specific size. It is important to note that this format uses a lossy compression scheme and thus image quality is deteriorated. This was not a problem for our synthetic images, but it may be a problem for images with rich information content, such as photographs. Our images contain a combination of plain geometric shapes. The seven shapes for our experiments were: triangle, circle, square, rectangle, hexagon, ellipse and an irregular shape similar to the letter L. Each of the shapes had a different uniform color and a black border. The background was always white. The texture for each shape was uniform with one exception, the irregular shape. For technical reasons two additional little objects were added to two opposing corners of each image to delimit its size. These little objects are ignored by the feature extraction step because it discards objects whose area does not represent more than 2% of the total image area. Each geometric shape has the same size and color in each image where it appears. All shapes have a similar size with respect to each other. However, their position and orientation can differ between images. To make the mining process more interesting, in some cases we overlapped shapes or placed them very close to each other so they would seem to be part of the same object by the segmentation algorithm. With the guidelines mentioned above we manually generated 100 basic images and we replicated or deleted some or all of these images to obtain larger or smaller image sets for our experiments. In Table 1 we show summary information for the images we created. We partitioned images into classes according to their content complexity. Image id’s below 100 indicate easy to mine images and image id’s greater than 100 mean difficult images, that is, images with complex content. In the easy images we have no more than 4 shapes plus the background. Shapes are in different positions but they are not close to each other and they are not overlapping either. For difficult images we have up to 7 shapes plus the background. In this case shapes overlap and also may be close to each other. This certainly makes the image mining process more difficult as we shall see. 15
  • 16. Step/# of images 50 100 150 200 feature extraction 50292 80777 127038 185080 object identification 210 338 547 856 auxiliary image creation 3847 6911 10756 13732 association rules finding 6 3 6 4 Table 2: Measured times in seconds for each Image Mining step with different image set sizes Hardware Software We ran our experiments on a Sun Multiprocessor (forge) computer with 4 processors (each running at 100 MHz) and 128 MB of RAM. The image mining program was written in Matlab and C. The feature extraction process is done in Matlab by the software we obtained from UCB. Object identification and record creation were also done in Matlab. An html page is created in Matlab to interpret results. The asociation rules were obtained by a program written in C. Performance Results In this section we examine the performance of the various components of the image mining process, as shown in Table 2 for several image set sizes. These times were obtained by averaging the ellapsed times of executing the image mining program five times. Running time Feature extraction is slow and there are several reasons for this. Every image is a matrix and every entry (pixel) is analyzed several times (about 9). If image size increases performance should degrade considerably since feature extraction is quadratic in image size. Nevertheless, this step is done only once and does not have to be repeated to run the image mining algorithm several times. So, this table shows the times assuming we are mining a set of unprocessed images. In fact, the images will already be stored in some CBIR, like the Blobworld system, and this cost should not be attributed to the mining process. Our object identification step does not use images but rather the blob descriptors produced by the feature extraction step. These blob descriptors vary from two to about double the number of objects contained in the image. For instance it was common to have two blob descriptors for the square, two for the circle, etc. In most cases at least one was matched. This was a disadvantage to match objects when they were apart from each other, but it turned out to be an advantage when objects were very close or overlapping. It also happened that one blob represented several objects, with one being the dominant geometric shape; for instance look at Figure 5 with image 131 and object 2. Right now for the sake of accuracy every blob descriptor is compared against the rest not compared yet; we skip an object (blob descriptor) if it has already been matched. If there are a few matched objects the time for the object identification step is almost linear since the similarity function is computed fewer times as the program runs. For our experimental results time for this step scales up well. 16
  • 17. Auxiliary image creation is relatively slow and its time linear since it is done on a per image basis. However, in this step auxiliary images are created only once if we run the program several times, as in the feature extraction step. Therefore, when the integrated view of results is created we only relabel each blob image with its corresponding object id and running time becomes comparable to the associaton rules finding times, which is in the order of a few seconds rather than thousands of seconds. Our association rules program is fast given these image set sizes. The time it takes to find rules is the lowest among all steps. The time for 50 images is as high as the other ones because there were many more associations found. For the reamining sizes times are similar because the number of associations found decreases as size grows. It is important to note that given larger image set sizes we do not expect this step to take much longer if there are a few identified objects, as it is the case with our experimental images. On the other hand if there are many identified objects and we run this step with a low support it may take longer since the number of associations found grows in a combinatorial manner as we lower the support threshold. In short, if the image mining program is run several times over the same image set only the times for the second and the fourth step should be considered since image features already exist and auxiliary images have already been created. Quality of results We should mention that there were no false positives. It did not happen that an object was incorrectly identified. For instance when we found a match between two objects they were the same shape. No triangle was mistakenly taken as a square, or a square as a rectangle. The background represents an object itself. Since association rules with the background were not interesting for our purposes it was eliminated from consideration by the object identification step. It is important to note that this is done after objects have been identified. We tuned the object identification step to find similar objects changing values for several parameters in the following manner. The most important features used from each object were color and contrast. We allowed little variance for color (0.3) and the maximum allowed variance for contrast (0.5). We ignored shape, because objects could be partially hidden and rotated. Position was considered unimportant because objects could be anywhere in each image. Anisotropy and polarity were ignored because almost all our shapes had uniform texture. Area was given no weight because objects could be overlapping, and thus their area diminished; this can be useful to make perfect matches when objects are apart from each other. We lowered the support down to 20% and confidence was raised to 80%. A few rules had high support. One problem that arose during our experiments was that the same shape could have two different blob descriptors, and these blob descriptors could not be matched with two other descriptors for the same shape in another image. This caused two problems. First, a rule could be repeated because it related the same shapes. Second, a rule did not have enough support and/or confidence and therefore was discarded. So, the rules found were correct and in many cases had an actual higher support and also higher confidence. To our surprise in some cases there were no object matches because an object was very close to another one or was located in a corner of the image. When two or more objects were overlapping or very close they were identified as a single object. This changed the features stored in the blob. The problem was due to the ellipsoidal shape of the blobs and the fact that when a geometric shape was located in a corner thta changed its anysotropy and polarity descriptors. Given a blob for an object very close to one corner means determining an adequate radius for the blob (i.e., ellipse). 17
  • 18. Regular shapes such as the triangle, square and hexagon were easily matched across images. This is a direct consequence of the circular blob representation produced when the image is segmented. In this case neither position nor rotation affect the mining process at all. It was surprising that in some cases there were no matches for the circle; in these cases it was in a corner or some other shape was very close or overlapping. Another important aspect about shape is that we do not use it as a parameter to mine images, but shape plays an important role during the segmentation step. So, shape does affect the image mining results quality. The rectangle and the ellipse are the next shapes that are easily matched even though we did not use the shape feature. The most complicated shape was the L. In this case a number of factors affected matches. When this shape was overlapped with other shapes a few matches were found because a big blob was generated. Also, orientation changed dominant colors and contrast. When the L was close to another shape its colors were merged making it dissimilar to other L shaped objects. This suggests that irregular shapes in general make image mining difficult. We worked with color images but it is also possible to use black and white images. Color and texture were important in mining the geometric shapes we created. However, we ignored shape as mentioned above. Shape may be more important for black and white images but more accurate shape descriptors, than those provided by the blobs. 8 Future Work Results obtained so far look promising but we need to improve several aspects in our research effort. We are currently working on the following tasks. This list may look comprehensive but it is the result of the assessment of our initial implementation and experimental results. These tasks are sorted by difficulty. We start with the easiest ones, ending with longer term goals. We need to incorporate objects in different sizes. This does not seem a difficult thing to try since we rely on color and texture to match objects, Object sizes should represent at least 2% of the image are in order to be considered by the feature extraction step. We are in the process of improving auxiliary image generation to treat blob descriptors and blobs themselves just as internal information and display object identifiers (oid’s) directly on each image; this makes result interpretation easier and this process faster since only one image is generated. As explained before we have displayed one image per blob as produced by the Berkeley system. The drawback of this latter approach is that we require human intervention to interpret the discovered rules. Images of different size and resolution will be analyzed. This looks like a complex problem because of a variety of reasons. First, if we try to mine photographs information content is in general very high and it is likely we should try to use some type of domain knowledge. Second, since images are treated as matrices performance will always be an issue. Higher resolution means slower feature extraction and possibly more identified objects. On the other hand, lower resolution images have faster feature extraction, but provide less knowledge to be discovered. Right now, we re depending on the constrainst of the Blobworld system. We also need to analyze images with repeated geometric shapes. If we want to obtain simple association rules this can make our program more general. This can be done without further modification to what is working. However, if we want to mine for more specific rules then we would need to modify our algorithm. For instance, we could try to get rules like the following: if there are two rectangles and one square then we are likely to find three triangles. The issues are the combinatorial growth of all the possibilities to mine and also a more complex type of condition. We will also study more deeply the problem of mining images with 18
  • 19. more complex shapes such as the irregular one similar to the letter L. We need a systematic approach to determine an optimal similarity threshold or at least a close one. A very high threshold means only perfect matches are accepted. On the other hand, a very low similarity threshold may mean any object is similar to any other object. Finding the right similarity threshold for each image type looks like an interesting problem. Right now it is provided by the user but it can be changed to be tuned by the algorithm itself. Also, there are many ways to tune the eleven parameters to match blobs and the optimal tuning may be specific to image type. There also exists the possibilityof using other segmentation algorithms that could perform faster or better feature extraction. It is important to note that these algorithms should give a means to compare segmented regions and provide suitable parameters to perform object matching in order to be useful for image mining. From our experimental results it is clear that this step is a bottleneck for the overall performance of image mining. We can change the object identification algorithms to generate overlapping object associations using more features. Our algorithm currently generates partititons of objects, that is, if one object is considered similar to another one, the latter one will not be compared again. By generating overlapping associations we can find even more rules. For instance a red rectangular object may be considered similar to another rectangular object and at the same time be similar to another red object. Mining by position is also possible; for instance two objects in a certain position may imply another object to be in some other position. Since the software we are using for feature extraction produces eleven parameters to describe blobs we have 211 possibilites to match objects. We need to minimize I/O for large collections of images. Having 200 images represents a small image database used in real life. We intend to extend our algorithm to handle arbitrary collections of images. Suitable buffering capabilities would be needed for our modified version of the image mining process. An approach could be to process blob descriptors by pairs of blocks to minimize disk access. This would require major modifications for our object identification algorithm since independent object id’s could be generated for the same object. A straightforward extension of the same algorithm is discarded since disk access would be X 2! and this number does not scale up for heavy disk access. If the image set grows considerably it may be advisable to use image indexing and buffering to make retrieval faster. An image can be indexed in a variety of ways. By id, by color, by texture, by size, by shape, etc. And also, each parameter in the standard deviation vector is a floating point problem. So, there are many ways to find the best match between objects. This looks like an NP-hard problem in , the number of images. Other directions for future research include usingparallel processing applied to some steps of the process. This includes using partitioning or even pipelining techniques to accelerate image processing. The problem becomes more complex if we try to mine images stored in different computers in a distributed fashion. Parallel processing for feature extraction is straightrforward since each image is processed independently. However, parallel processing of object identification is an interesting problem if we also want to minimize disk access as mentioned above. We can perform incremental mining when a few images are added to an already mined image set. The possibility of reusing work performed in one image to save work in another image when image contents are very similar is an option. However, this could imply using domain knowledge rendering the algorithm less general. We can incorporate a time dimension to associations and rules when there are historical images on the 19
  • 20. same scenario. Develop a multiple pass algorithm to match objects using different parameter values. Use real images to perform experiments. Apply other data mining techniques such as classifier trees, machine learning and advanced statiscal analysis. Mine images and textual information asociated with the image; right now we are only extracting knowledge from the image. In real image databases there is frequently information about each image which could improve the mining process and produce more complex rules. Right now we have feature extraction, object identification and object data mining as separate steps. It is not clear if we can merge these steps to do processing faster or more accurate. The most promising research direction would be to perform image mining directly on segmented images, that is, merging the second and third steps. 9 Conclusions We presented a new algorithm to perform data mining on images and an initial performance study. Our algorithm is general because it does not use any type of domain knowledge. The algorithm performed well with the synthetic image set we generated. The ideas we present are interesting from both a theoretical and a practical point of view. We studied this problem in the context of data mining for databases. We used existing results in computer vision theory at the minimum necessary. Our image mining algorithm has 4 major steps: feature extraction, object identification, auxiliary image creation and identified object mining. The slowest part of image mining is the feature extraction step, which is really a part of the process of storing images in a CBIR system; and is done only once. The next slowest operation is creating the auxiliary blob images which is also done once. Object identification and association rule finding are fairly fast and scale up well with image set size. We also presented several improvements to our initial approach of image mining. Our experimental results are promising and show a good deal of potential for future study. No false rules are found. Rules referring to specific objects are obtained regardless of object position, object orientation, and even object shape when one object is partially hidden. Nevertheless, these features can be used to extract more rules as we explained before. Images contain a great deal of information, and thus the amount of knowledge that we can extract from them is enormous. We believe that this work will open a new research direction in the area of knowledge discovery in databases. Acknowledgments We thank Chad Carson from the University of California at Berkeley for helping us setup the Blobworld system. We also thank Sham Navathe for his comments to improve the presentation of this paper. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, DC, May 26-28 1993. 20
  • 21. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, August 29-September 1 1994. [3] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Recognition of images in large databases using a learning framework. Technical Report TR 97-939, U.C. Berkeley, CS Division, 1997. [4] B.G. Buchanan and T.M. Mitchell. Pattern-Directed Inference Systems, chapter Model-Directed Learning of Production Rules, pages 297–312. Academic Press, New York, 1978. [5] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997. [6] C. Carter and J. Catlett. Assessing credit card applications using machine learning. IEEE Expert, 2(3):71–79, Fall 1987. [7] G. Dunn and B. S. Everitt. An Introduction to Mathematical Taxonomy. Cambridge University Press, New York, 1982. [8] D. Forsyth, J. Malik, M. Fleck, H. Greenspan, T. Leung, S. Belongie, C. Carson, and C. Bregler. Finding pictures of objects in large collections of images. Technical report, U.C. Berkeley, CS Division, 1997. [9] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge Discovery in Databases, chapter Knowledge Discovery in Databases: An Overview, pages 1 – 27. MIT Press, 1991. [10] V. Gudivada and V. Raghavan. Content-based image retrieval systems. IEEE Computer, 28(9):18–22, September 1995. [11] R. Hanson, J. Stutz, and P. Cheeseman. Bayesian classification theory. Technical Report FIA-90-12-7- 01, Artificial Intelligence Research Branch, NASA Ames Research Center, Moffet Field, CA 94035, 1990. [12] M. Holsheimer and A. Siebes. Data mining: The search for knowledge in databases. Technical Report CS-R9406, CWI, Amsterdam, The Netherlands, 1993. [13] M. Houtsma and A. Swami. Set-oriented mining of association rules. Technical Report RJ 9567, IBM, October 1993. [14] M. McLeish, P. Yao, M. Garg, and T. Stirtzinger. Knowledge Discovery in Databases, chapter Discovery of Medical Diagnostic Information: An Overview of Methods and Results, pages 477–490. AAAI Press / The Mit Press, Menlo Park, California, 1991. [15] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. [16] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules. In Proceedings of the VLDB Conference, pages 432 – 444, Zurich, Switzerland, September 1995. [17] M. Seigel, E. Sciore, and S. Salveter. Knowledge Discovery in Databases, chapter Rule Discovery for Query Optimization, pages 411–430. AAAI Press / The Mit Press, Menlo Park, California, 1991. [18] S. Shekhar, B. Hamidzadeh, A. Kohli, and M. Coyle. Learning transformation rules for semantic query optimization: A data-driven approach. IEEE Transactions on Knowledge and Data Engineering, 5(6):950–964, December 1993. 21
  • 22. [19] A. Siebes. Homogeneous discoveries contain no surprises: Inferring risk-profiles from large databases. Technical Report CS-R9430, CWI, Amsterdam, The Netherlands, 1994. 22