Hardoon Image Ranking With Implicit Feedback From Eye Movements

Image Ranking with Implicit Feedback from Eye Movements
David R. Hardoon∗ Kitsuchart Pasupa†
Data Mining Department School of Electronics & Computer Science
Institute for Infocomm Research (I2 R) University of Southampton

Abstract dure. Moreover, it is far from an ideal situation as both formulating
an initial query and navigating the large number of retrieved hits
In order to help users navigate an image search system, one could is a difﬁcult. One image retrieval methodology which attempts to
provide explicit information on a small set of images as to which address these issues, and has been a research topic since the early
of them are relevant or not to their task. These rankings are learned 1990’s, is the so-called “Content-Based Image Retrieval”(CBIR).
in order to present a user with a new set of images that are rel- The search of a CBIR system is analysed from the actual content
evant to their task. Requiring such explicit information may not of the image which may includes colour, shape, and texture rather
be feasible in a number of cases, we consider the setting where than using a textual annotation associated (if at all) with the image.
the user provides implicit feedback, eye movements, to assist when
performing such a task. This paper explores the idea of implic- Relevance feedback, which is explicitly provided by the user while
itly incorporating eye movement features in an image ranking task performing a search query on the quality of the retrieved images,
where only images are available during testing. Previous work had has shown to be able to improve on the performance of CBIR sys-
demonstrated that combining eye movement and image features im- tems, as it is able to handle the large variability in semantic in-
proved on the retrieval accuracy when compared to using each of terpretation of images across users. Relevance feedback will iter-
the sources independently. Despite these encouraging results the atively guide the system to retrieve images the user is genuinely
proposed approach is unrealistic as no eye movements will be pre- interested in. Many systems rely on an explicit feedback mech-
sented a-priori for new images (i.e. only after the ranked images are anism, where the user explicitly indicates which images are rele-
presented would one be able to measure a user’s eye movements vant for their search query and which ones are not. One can then
on them). We propose a novel search methodology which com- use a machine learning algorithm to try and present a new set of
bines image features together with implicit feedback from users’ images to the users which are more relevant - thus helping them
eye movements in a tensor ranking Support Vector Machine and navigate the large number of hits. An example of such systems
show that it is possible to extract the individual source-speciﬁc is PicSOM [Laaksonen et al. 2000]. However, providing explicit
weight vectors. Furthermore, we demonstrate that the decomposed feedback is also a laborious process as it requires continues user re-
image weight vector is able to construct a new image-based seman- sponse. Alternatively, it is possible to use implicit feedback to infer
tic space that outperforms the retrieval accuracy than when solely relevance of images. Examples of implicit feedback are eye move-
using the image-features. ments, mouse pointer movements, blood pressure, gestures, etc. In
other words, user responses that are implicitly related to the task
performed.
CR Categories: G.3 [Probability and Statistics]: Multivariate
Statistics—; H.3.3 [Information Search and Retrieval]: Retrieval In this study we explore the use of eye movements as a particu-
models—Relevance feedback lar source of implicit feedback to assist a user when performing
such a task (i.e. image retrieval). Eye movements can be treated
Keywords: Image Retrieval, Implicit Feedback, Tensor, Ranking, as an implicit relevance feedback when the user is not consciously
Support Vector Machine aware of their eye movements being tracked. Eye movement as
implicit feedback has recently been used in the image retrieval set-
ting [Oyekoya and Stentiford 2007; Klami et al. 2008; Pasupa et al.
1 Introduction 2009]. [Oyekoya and Stentiford 2007; Klami et al. 2008] used eye
movements to infer a binary judgement of relevance while [Pasupa
In recent years large digital image collections have been created in et al. 2009] makes the task more complex and realistic for search-
numerous areas, examples of these include the commercial, aca- based task by asking the user to give multiple judgement of rele-
demic, and medical domains. Furthermore, these databases also in- vance. Furthermore, earlier studies of Hardoon et al. [2007] and
clude the digitisation of analogue photographs, paintings and draw- Ajanki et al. [2009] explored the problem of where an implicit in-
ings. Conventionally, the images collected are manually tagged formation retrieval query is inferred from eye movements measured
with various descriptors to allow retrieval to be performed over the during a reading task. The result of their empirical study is that it
annotated words. However, the process of manually tagging images is possible to learn the implicit query from a small set of read doc-
is an extremely laborious, time consuming and an expensive proce- uments, such that relevance predictions for a large set of unseen
∗ e-mail:
documents are ranked better than by random guessing. More re-
drhardoon@i2r.a-star.edu.sg
† e-mail: cently, Pasupa et al. [2009] demonstrated that ranking of images
kp2@ecs.soton.ac.uk
can be inferred from eye movements using Ranking Support Vector
Machine (Ranking SVM). Their experiment shows that the perfor-
mance of the search can be improved when simple images features
namely histograms are fused with the eye movement features.
Copyright © 2010 by the Association for Computing Machinery, Inc. Despite Pasupa et al.’s [2009] encouraging results, their proposed
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
approach is largely unrealistic as they combine image and eye fea-
for commercial advantage and that copies bear this notice and the full citation on the tures for both training and testing. Whereas in a real scenario no
first page. Copyrights for components of this work owned by others than ACM must be eye movements will be presented a-priori for new images. In other
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on words, only after the ranked images are presented to a user, would
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
one be able to measure the users’ eye movements on them. There-
permissions@acm.org. fore, we propose a novel search methodology which combines im-
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

291

age features together with implicit feedback from users’ eye move- with label ck and m being the total number of new samples. Fi-
ments during training, such that we are able to rank new images nally, we quote from Cristianini and Shawe-Taylor [2000] the gen-
with only using image features. We believe it is indeed more real- eral dual SVM optimisation as
istic to have images and eye-movements during the training phase
m m
as these could be acquired deliberately to train up such a system. 1
max W (α) = αi − αi αj ci cj κ(xi , xj ) (3)
α 2
For this purpose, we propose using tensor kernels in the ranking i=1 i,j=1
SVM framework. Tensors have been used in the machine learning m
literature as a means of predicting edges in a protein interaction or subject to i=1 αi ci = 0 and αi ≥ 0 i = 1, . . . , m,
co-complex network by using the tensor product transformation to
where we again use ci to represent the label and κ(xi , xj ) to be the
derive a kernel on protein pairs from a kernel on individual pro-
kernel function between φ(x)i and φ(x)j , where φ(·) is a mapping
teins [Ben-Hur and Noble 2005; Martin et al. 2005; Qiu and No-
from X (or Y ) to an (inner product) feature space F .
ble 2008]. In this study we use the tensor product to constructed
a joined semantics space by combining eye movements and im-
age features. Furthermore, we continue to show that the combined 3 Tensor Ranking SVM
learnt semantic space can be efficiently decomposed into its con-
tributing sources (i.e. images and eye movements), which in turn In the following section we propose to construct a tensor kernel
can be used independently. on the ranked image and eye movements features, i.e. following
equation (1), to then to train an SVM. Therefore, let X ∈ Rn×m
The paper is organised as follows. In Section 2 we give a brief intro- and Y ∈ Rℓ×m be the matrix of sample vectors, x and y, for the
duction to the ranking SVM methodology and continue to develop image and eye movements respectively, where n is the number of
in Section 3 our proposed tensor ranking SVM and the efficient de- image features and ℓ is the number of eye movement features and
composition of the joint semantic space into the individual sources. m are the total number of samples. We continue to define K x , K y
In Section 5 we give our experimental set up whereas in Section 6 as the kernel matrices for the ranked images and eye movements
we discusses the feature extraction and representation of the images respectively. In our experiments we use linear kernels, i.e. K x =
and eye movements. In section 7 we bring forward our experiments X ′ X and K y = Y ′ Y . The resulting kernel matrix of the tensor
on page ranking for individual users as well as a feasibility study on T = X ◦Y can be expressed as pair-wise product (see [Pulmannov´ a
user generalisation. Finally, we conclude our study with discussion 2004] for more details)
on our present methodology and results in Section 8.
¯
Kij = (T ′ T )ij = Kij Kij .
x y

2 Ranking SVM ¯
We use K in conjunction with the vanilla SVM formulation as
given in equation (3). Whereas the set up and training are straight
The Ranking Support Vector Machine (SVM) was proposed by forward, the underlying problem is that for testing we do not have
Joachims [2002] which was adapted from ordinal regression [Her- the eye movements. Therefore we propose to decompose the re-
brich et al. 2000]. It is a pair-wise approach where the solution is sulting weight matrix from its corresponding image and eye com-
a binary classification problem. Let xi denote some feature vector ponents such that each can be used independently.
and let ri denote the ranking assigned to xi . If r1 ≻ r2 , it means
that x1 is more relevance than x2 . Consider a linear ranking func- The goal is to decompose the weight matrix W given by a dual
tion, representation
xi ≻ xj ⇐⇒ w, xi − w, xj > 0, m

where w is a weight vector and ·, · denotes dot product between W = αi ci φx (xi ) ◦ φy (yi )
i
vectors. This can be placed in a binary SVM classification frame-
work where let ck be the new label indicating the quality of kth without accessing the feature space. Given the paired samples x, y
rank pair, the decision function in equation is

w, xi − xj =
ck = +1 if ri ≻ rj
, (1) f (x, y) = W ◦ φx (x)φy (y)′ (4)
ck = −1 if rj ≻ ri m
= αi ci κx (xi , x)κy (yi , y).
which can be solved by the following optimisation problem, i=1

1
min w, w + C ξk (2) 4 Decomposition
2
k
The resulting decision function in equation (4) requires both image
subject to the following constrains: and eye movement (xi , yi ) data for training and testing. We want
to be able to test our model only using the image data. Therefore,
∀(i, j) ∈ r (k) : ck ( w, xi − xj + b) ≥ 1 − ξk we want to decompose the weight matrix (again without accessing
∀(k) : ξk ≥ 0 the feature space) into a sum of tensor products of corresponding
weight components for the images and eye movements
where r (k) = [r1 , r2 , . . . , rt ] for t rank values, furthermore C is T
a hyper-parameter which allows trade-off between margin size and ′
W ≈ WT = t t
wx wy , (5)
training error, and ξk is training error. Alternatively, we are repre- t=1
sent the ranking SVM as a vanilla SVM where we re-represent our
samples as t
such that the weights are a linear combination of the data, i.e. wx =
m
φ(x)k = xi − xj i=1 βi φx (xi ) and wy = m γi φy (yi ) where β t , γ t are the
t t
i=1
t

292

t t
dual variables of wx , wy . We proceed to define our decomposition 5 Experimental Setup
procedure such that we do not need to compute the (potentially non-
linear) feature projection φ. We compute Our experimental set-up is as follows: Users are shown 10 images
m on a single page as a five by two (5x2) grid and are asked to rank
WW′ = αi αj ci cj κy (yi , yj )φx (xi )φx (xj )′ (6) the top five images in order of relevance to the topic of “Transport”.
i,j
This concept is deliberately slightly ambiguous given the context
of images that were displayed. Each displayed page contained 1–3
and are able to express K y = (κy (yi , yj ))m = clearly relevant images (e.g. a freight train, cargo ship or airliner),
i,j=1
K k′
2–3 either borderline or marginally relevant images (e.g. bicycle or
k=1 λk uk u = U ΛU ′ , where U = (u1 , . . . , uK ) by perform- baby carrier), and the rest are non-relevant images (e.g. images of
ing an eigenvalue decomposition of the kernel matrix K y with en- people sitting at a dining room table, or a picture of a cat).
y
tries Kij = κy (yi , yj ). Substituting back into equation (6) gives
The experiment had 30 pages in total, each showing 10 images from
K m the PASCAL Visual Objects Challenge 2007 database [Everingham
′ ′
WW = λk αi αj ci cj uk uk φx (xi )φx (xj )′ .
i j et al. ]. The interface consisted of selecting radio buttons (labelled
k i,j 1st to 5th under each image) then clicking on next to retrieve the
next page. This represents data for a ranking task where explicit
m k ′
Letting hk = i=1 αi ci ui φx (xi ) we have W W = ranks are given to compliment any implicit information contained in
K √ √
′ ′
λk hk hk = HH where H = λ1 h1 , . . . , λK hK . We the eye movements. An example of each page is shown in figure 1.
k
would like to find the singular value decomposition of H = V ΥZ ′ .
Consider for A = diag(α) and C = diag(c) we have The experiment was performed by six different users, with their eye
movements recorded by a Tobii X120 eye tracker which was con-
nected to a PC using a 19-inch monitor (resolution of 1280x1024).
H ′H kℓ
= λk λℓ αi αj ci cj uk uℓ κx (xi , xj )
i j The eye tracker has approximately 0.5 degrees of accuracy with a
ij
sample rate of 120 Hz and used infrared lens to detect pupil centres
1 ′ 1
= CAU Λ 2 K x CAU Λ 2 , and corneal reflection. The final data collected per user is illus-
kℓ trated in table 1. Any pages that contained less than five images
with gaze points (for example due to the subject moving and the
which is computable without accessing the feature space. Perform- eye-tracker temporarily losing track of the subject’s eyes) were dis-
ing an eigenvalue decomposition on H ′ H we have carded. Hence, only 29 and 20 pages were valid for users 4 and 5,
respectively.
H ′H = ZΥV ′ V ΥZ ′ = ZΥ2 Z ′ (7)

with Υ a matrix with υt on the diagonal truncated after the J’th Table 1: The data collected per user. ∗ Pages with less than five
1
eigenvalue, which gives the dual representation of vt = υt Hzt for images with gaze points were removed. Therefore users 4 and 5
′ 2 only have 29 and 20 pages viewed respectively.
t = 1, . . . , T , and since H Hzt = υt zt we are able to verify that

1 User # Pages Viewed
2
W W ′ vt = HH ′ vt = HH ′ Hzt = υt Hzt = υt vt . User 1 30
υt
User 2 30
User 3 30
Restricting to the first T singular vectors allows us to express W ≈ User 4∗ 29
′
W T = T vt (W ′ vt ) , which in turn results in User 5∗ 20
t=1
User 6 30
m
t 1 t
wx = vt = Hzt = βi φx (xi ),
υt i=1
5.1 Performance Measure
t 1 T √
where βi = αc
υt i i k=1 λk zt uk
k i . We can now also express
We use the Normalised Discount Cumulative Gain
m (NDCG) [J¨ rvelin and Kek¨ l¨ inen 2000] as our performance
a aa
1 ′ metric, due to our task involving multiple ranks rather than a binary
wy = W ′ vt =
t
W Hzt = t
γi φy (yi ),
υt i=1
choice. NDCG measures the usefulness, or gain, of a retrieved
item based on its position in the result list. NDCG is designed for
where γi = m αi ci βj κx (xi , xj ) are the dual variables of wy .
t t t tasks which have more than two levels of relevance judgement, and
j=1
We are therefore now able to decompose W into Wx , Wy without is defined as,
accessing the feature space giving us the desired result.
k
1
We are now able to compute, for a given t, the ranking scores NDCGk (r) = D(ri )ϕ(gi )
t ˆ Nn
in the linear discriminant analysis form s = wx φ(X) = i=1
m t ˆ ˆ
βi κx (xi , X) for new test images X. These are in turn sorted
i=1
1
in order of magnitude (importance). Equally, we can project our with D(r) = log (1+r) and ϕ(g) = 2g − 1, where for a given
2
data into the new defined semantic space β where we train and test page r is rank position and k is a truncation level (position), N is a
˜
an SVM. i.e. we compute φ(x) = K x β, for the training samples, normalising constant which gives the perfect ranking (based on gi )
˜ x
and φ(xt ) = Kt β for our test samples. We explore both these equal to one, and gi is the categorical grade; e.g. grade is equal to 5
approaches in our experiments. for the 1st rank and 0 for the 6th .

293

Figure 1: An example illustrating the outlay of the interface displaying the 10 images with the overlaid eye movement measurements. The
circles indicate fixations.

6 Feature extraction 0.65
Random Performance

0.6

0.55

0.5

In the following experiments we use standard image histograms and 0.45

NDCGk
features collected from eye-tracking. We compute a 256-bin grey 0.4

scale histogram on the whole image as the feature representation. 0.35

0.3
These features are intentionally kept relatively simple. Although, 0.25

a possible extension of the current representation is to segment the 0.2

image and only use regions that have gaze information. We intend 1 2 3 4 5 6 7 8 9 10
Positions
to explore this extension in a future study.
Figure 2: NDCG performance for predicting random rankings.
The eye movement features are computed using only on the eye
trajectory and locations of the images in the page. This type of
features are general-purpose and are easily applicable to all appli-
cation scenarios. The features are divided into two categories; the mended for media with mixed content1 .
first category uses the raw measurements obtained from the eye- Some of the features are not invariant to the location of the image
tracker, whereas the second category is based on fixations estimated on the screen. For example, the typical pattern of moving from
from the raw data. A fixation means a period in which a user main- left to right means that the horizontal co-ordinate of the first fixa-
tains their gaze around a given point. These are important as most tion for the left-most image of each row typically differs from the
visual processing happens during fixations, due to blur and sac- corresponding measure on the other images. Features that were ob-
cadic suppression during the rapid saccades between fixations (see, served to be position-dependent were normalised by removing the
e.g. [Hammoud 2008]). Often visual attention features are based mean of all observations sharing the same position, and are marked
solely on fixations and the relation between them [Rayner 1998]. in Table 2. Finally, each feature was normalised to have unit vari-
However, raw measurement data might be able to overcome possi- ance and zero mean.
ble problems caused by imperfect fixation detection.

7 Experiments
In table 2 we list the candidate features we have considered. Most of
the listed features are motivated by earlier studies in text retrieval
We evaluate two different scenarios for learning the ranking of im-
[Saloj¨ rvi et al. 2005]. The features cover the three main types
a
age based on image and eye features; 1. Predicting rankings on a
of information typically considered in reading studies: fixations,
page given only other data from a single specific user. 2. A global
regressions (fixations to previously seen images), and re-fixations
model using data from other users to predict rankings for a new
(multiple fixations within the same image). However, the features
unseen user.
have been tailored to be more suitable for images, trying to include
measures for things that are not relevant for text, such as the cover We compare our proposed tensor Ranking SVM algorithm which
of the image. Similarly to the image features, the eye movement combines both information from eye movements and image his-
features are intentionally kept relatively simple with the intent that togram features to a Ranking SVM using histogram features and
they are more likely to generalise over different users. Fixations to a Ranking SVM using eye movements alone. We emphasis that
were detected using the standard ClearView fixation filter provided
with the Tobii eye-tracking software, with settings “radius 30 pix- 1 Tobii Technology, Ltd. Tobii Studio Help. url:
els, minimum duration 100 ms”. These are also the settings recom- http://studiohelp.tobii.com/StudioHelp 1.2/

294

Table 2: We list the eye movement features considered in this study. The first 16 features are computed from the raw data, whereas the
remainder are based on pre-detected fixations. We point out to the reader that features number 2 and 3 use both types of data since they are
based on raw measurements not belonging to fixations. All the features are computed separately for each image. Features marked with ∗ are
normalised for each image location.

Number Name Description

Raw data features
1 numMeasurements total number of measurements
2 numOutsideFix total number of measurements outside fixations
3 ratioInsideOutside percentage of measurements inside/outside fixations
4 xSpread difference between largest and smallest x-coordinate
5 ySpread difference between largest and smallest y-coordinate
6 elongation ySpread/xSpread
7 speed average distance between two consecutive measurements
8 coverage number of subimages covered by measurements1
9 normCoverage coverage normalized by numMeasurements
10∗ landX x-coordinate of the first measurement
11∗ landY y-coordinate of the first measurement
12∗ exitX x-coordinate of the last measurement
13∗ exitY y-coordinate of the last measurement
14 pupil maximal pupil diameter during viewing
15∗ nJumps1 number of breaks longer than 60 ms2
16∗ nJumps2 number of breaks longer than 600 ms2

Fixation features
17 numFix total number of fixations
18 meanFixLen mean length of fixations
19 totalFixLen total length of fixations
20 fixPrct percentage of time spent in fixations
21∗ nJumpsFix number of re-visits to the image
22 maxAngle maximal angle between two consecutive saccades3
23∗ landXFix x-coordinate of the first fixation
24∗ landYFix y-coordinate of the first fixation
25∗ exitXFix x-coordinate of the last fixation
26∗ exitYFix y-coordinate of the last fixation
27 xSpreadFix difference between largest and smallest x-coordinate
28 ySpreadFix difference between largest and smallest y-coordinate
29 elongationFix ySpreadFix/xSpreadFix
30 firstFixLen length of the first fixation
31 firstFixNum number of fixations during the first visit
32 distPrev distance to the fixation before the first
33 durPrev duration of the fixation before the first
1
The image was divided into a regular grid of 4x4 subimages.
2
A sequence of measurements outside the image occurring between two consecutive measure-
ments within the image.
3
A transition from one fixation to another.

training and testing a model using only eye movements is not re- remaining pages, from the same user, are used for training.
alistic as there are no eye movements presented a-priori for new
images, i.e. one can not test. This comparison provides us with We evaluate the proposed approach with the following four setting:
a baseline as to how much it may be possible to improve on the • T 1: using the largest component of tensor decomposition in
performance using eye movements. Furthermore, we are unable to the form of a linear discriminator. We use the weight vec-
make direct comparison to [2009] as they had used an online learn- tor corresponding to the largest eigenvalue (as we have a t
ing algorithm with different image features. weights).
In the experiments we use a linear kernel function. Although, it is • T 2: we project the image features into the learnt semantic
possible to use a non-linear kernel on the eye movement features as space (i.e. the decomposition on the image source) and train
this would not effect the decomposition for the image weights (as- and test within the projected space a secondary Ranking SVM
suming that φx (xi ) are taken as the image features in equation (6)). (Ranking SVM).
In figure 2 we give the NDCG performance for predicting random
ranking. • T 1all : similar to T 1 although here we use all t weight vectors
and take the mean value across as the final score.

7.1 Page Generalisation • T 1opt : similar to T 1 although here we use the n-largest com-
ponents of the decomposition. i.e. we select n weight vectors
to use and take the mean value across as the final score.
In the following section we focus on predicting rankings on a page
given only other data from a single specific user (we repeat this We use a leave-one-out cross-validation for T 1opt to obtain the op-
for all users). We employ a leave-page-out routine where at each timal model for the later case which are selected based on maximum
iteration a page, from a given user, is withheld for testing and the average NDCG across 10 positions.

295

Average
0.7
• T 2: we project the image features into the learnt seman-
tic space (i.e. the decomposition on the image source) and
0.6 train and test within the projected space a secondary Ranking
SVM.
0.5
We plot in figure 5 the resulting NDCG performance for the leave-
NDCGk

0.4
user-out routine. We are able to observe, with the exclusion of user
2 in figure 5(b), that T 2 is able to outperform the Ranking SVM
0.3 Images
on image features. Indicating that it is possible to generalise our
Eyes proposed approach across new unseen users. Furthermore, it is in-
T1
0.2 T2
teresting to observe that T 2 achieves a similar performance to that
T1
all
of a Ranking SVM trained and tested on the eye features. Finally,
opt
T1
0.1
even though we do not improve when testing on data from user 2,
1 2 3 4 5 6 7 8 9 10 we are able to observe that we perform as-good-as the baselines. In
Positions
figure 5(e) we plot the average NDCG performance on the leave-
user-out routine, demonstrating that on average we improve on the
Figure 4: Average NDCG performance across all users for predict- ranking of new images for new users and that we perform better
ing rankings on a page given only other data from a single specific than random (figure 2).
user.

8 Discussion
We plot the user specific leave-page-out NDCG performances in
figure 3 where we are able to observe that T 2 consistently outper- Improving search and content based retrieval systems with implicit
forms the image feature Ranking SVM across all users, demonstrat- feedback is an attractive possibility given that a user is not required
ing that it is indeed possible to improve on the image ranking with to explicitly provide information to then improve, and personalise,
the incorporation of eye movement features during training. Fur- their search strategy. This, in turn, can render such a system more
thermore, it is interesting to observe that for certain users T 1opt user-friendly and simple to use (at least from the users’ perspec-
improves on the ranking performance, suggesting that there is an tive). Although, achieving such a goal is non-trivial as one needs
optimal combination of the decomposed features that may further to be able to combine the implicit feedback information into the
improve on the results. search system in a manner that does not then require the implicit
information for testing. In our study we focus on implicit feedback
In figure 4 we plot the average performance across all users. The in the form of eye movements, as these are easily available and can
figure shows that T 1 and T 1all are slightly worse than using image be measured in a non-intrusive manner.
histogram alone. However, when selecting using cross-validation
the number of largest components in tensor decomposition, the per- Previous studies [Hardoon et al. 2007; Ajanki et al. 2009] have
formance of the classifier is improved and outperforms the Ranking shown the feasibility of such systems using eye moments for a tex-
SVM with eye movements. Furthermore, we are able to observe tual search task. Demonstrating that it is indeed possible to ‘en-
that we perform better than random (figure 2). Using classifier T 2, rich’ a textual search with eye features. Their proposed approach
the performance is improved above the Ranking SVM with image is computationally complex since it requires the construction of a
features and it is competitive with Ranking SVM with eye move- regression function on eye measurements on each word. This was
ments features. not realistic in our setting. Furthermore, Pasupa et al. [2009] had
extend the underlying methodology of using eye movement as im-
7.2 User Generalisation plicit feedback to an image retrieval system, combining eye move-
ments with image features to improve the ranking of retrieved im-
In the following section we focus on learning a global model using ages. Although, still, the proposed approach required eye features
data from other users to predict rankings for a new unseen user. for the test images which would not be practical in a real system.
Although, as the experiment is set up such that all users view the In this paper we present a novel search strategy for combining eye
same pages, we employ a leave-user-leave-page-out routine, i.e; movements and image features with a tensor product kernel used in
For all users a ranking support vector machine framework. We continue to show
Withhold data from user i that the joint learnt semantic space of eye and image features can
For all pages be efficiently decomposed into its independent sources allowing us
Withhold page j from all users to further test or train only using images. We explored two different
Train on all pages-j from all users - i search scenarios for learning the ranking of images based on image
and eye features. The first was predicting ranking on a page given
Test on page j from user i only other data from a single specific user. This experiment was
Endfor to test the fundamental question of whether eye movement are able
Endfor to improve ranking for a user. Demonstrating that it was indeed
possible to improve in the single subject setting, we then proceeded
Therefore we only use the users from table 1 who viewed the same to our second setting where we constructed a global model across
number of pages, i.e. users 1, 2, 3 and 6, which we refer to hence- users in attempt to generalise on data from a new user. Again our
forth as users 1-4. results demonstrated that we are able to generalise out model to new
We evaluate the proposed approach with the following two setting: users. Despite these promising results, it was also clear that using
a single direction (weight vector) does not necessarily improve on
• T 1: using the largest component of tensor decomposition in the baseline result. Motivating the need for a more sophisticated
the form of a linear discriminator. We use the weight vec- combination of the resulting weights. This, as well as extending
tor corresponding to the largest eigenvalue (as we have a t our experiment to a much larger number of users, will be addressed
weights). in a future study. Finally, we would also explore the notion of image

296

User 1 User 2 User 3
0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4
NDCGk

NDCGk

NDCGk
0.3 0.3 0.3

Images Images Images
0.2 Eyes 0.2 Eyes 0.2 Eyes
T1 T1 T1
T2 T2 T2
0.1 all 0.1 all 0.1 all
T1 T1 T1
opt opt opt
T1 T1 T1
0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Positions Positions Positions

(a) NDCG performance within user 1 (b) NDCG performance within user 2 (c) NDCG performance within user 3
User 4 User 5 User 6
0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4
NDCGk

NDCGk

NDCGk
0.3 0.3 0.3

Images Images Images
0.2 Eyes 0.2 Eyes 0.2 Eyes
T1 T1 T1
T2 T2 T2
0.1 all 0.1 all 0.1 all
T1 T1 T1
opt opt opt
T1 T1 T1
0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(d) NDCG performance within user 4 (e) NDCG performance within user 5 (f) NDCG performance within user 6

Figure 3: In the following sub-figures 3(a)-3(f) we illustrate the NDCG performance for each user in a leave-page-out routine, i.e. here
we aim to generalise over new pages rather new users. We are able to observe that T 2 and T 1opt routinely outperform the ranking with
only using image features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if
eye-features were indeed available a-priori for new images.

segmentation and the use of more sophisticated image features that C RISTIANINI , N., AND S HAWE -TAYLOR , J. 2000. An Introduc-
are easily computable. tion to Support Vector Machines and other kernel-based learning
methods. Cambridge University Press.
Acknowledgements E VERINGHAM , M., VAN G OOL , L., W ILLIAMS , C. K. I.,
W INN , J., AND Z ISSERMAN , A. The PASCAL Visual Object
The authors would like to acknowledge financial support from Classes Challenge 2007 (VOC2007) Results. http://www.pascal-
the European Community’s Seventh Framework Programme network.org/challenges/VOC/voc2007/workshop/index.html.
(FP7/2007–2013) under grant agreement n◦ 216529, Personal In-
formation Navigator Adapting Through Viewing (PinView) project H AMMOUD , R. 2008. Passive Eye Monitoring: Algorithms, Appli-
(http://www.pinview.eu). The authors would also like to thank cations and Experiments. Springer-Verlag.
Craig Saunders for data collection. ¨
H ARDOON , D. R., A JANKI , A., P UOLAM AKI , K., S HAWE -
TAYLOR , J., AND K ASKI , S. 2007. Information retrieval
References by inferring implicit queries from eye movements. In Pro-
ceedings of the 11th International Conference on Artificial In-
¨ telligence and Statistics (AISTATS), Electronic proceedings at
A JANKI , A., H ARDOON , D. R., K ASKI , S., P UOLAM AKI , K.,
www.stat.umn.edu/ aistat/proceedings/start.htm.
AND S HAWE -TAYLOR , J. 2009. Can eyes reveal interest?
implicit queries from gaze patterns. User Modeling and User- H ERBRICH , R., G RAEPEL , T., AND O BERMAYER , K. 2000.
Adapted Interaction 19, 4, 307–339. Large margin rank boundaries for ordinal regression. MIT
Press, Cambridge, MA.
B EN -H UR , A., AND N OBLE , W. S. 2005. Kernel methods for
predicting protein-protein interactions. Bioinformatics 21, i38– ¨ ¨ ¨
J ARVELIN , K., AND K EK AL AINEN , J. 2000. IR evaluation meth-
i46. ods for retrieving highly relevant documents. In SIGIR ’00:

297

User 1 − withheld (trained on 3 users) User 2 − withheld (trained on 3 users) User 3 − withheld (trained on 3 users)
0.65 0.65 0.8

Images Images
0.6 0.6 Images
Eyes 0.7 Eyes
Eyes
T1 T1
T1
0.55 T2 0.55 T2
T2
0.6
0.5 0.5

0.45 0.45 0.5
k

NDCGk

NDCGk
NDCG

0.4 0.4 0.4

0.35 0.35
0.3
0.3 0.3

0.2
0.25 0.25

0.2 0.2 0.1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(a) NDCG performance on user 1 (b) NDCG performance on user 2 (c) NDCG performance on user 3

User 4 − withheld (trained on 3 users) Leave User Out − Average
0.65 0.65
Images Images
0.6 Eyes 0.6 Eyes
T1 T1
0.55 T2 T2
0.55

0.5
0.5
0.45
0.45
NDCGk

0.4 NDCGk

0.4
0.35
0.35
0.3

0.3
0.25

0.2 0.25

0.2
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Positions Positions

(d) NDCG performance on user 4 (e) Average NDCG performance

Figure 5: In the following sub-figures 5(a)–5(d) we illustrate the NDCG performance in a leave-user-out (leave-page-out) routine. The
average NDCG performance is given in sub-figure 5(e) where we are able to observe that T 2 outperforms the ranking of only using image
features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if eye-features were
indeed available a-priori for new images.

Proceedings of the 23rd annual international ACM SIGIR con- OYEKOYA , O., AND S TENTIFORD , F. 2007. Perceptual image re-
ference on Research and development in information retrieval, trieval using eye movements. International Journal of Computer
ACM, New York, NY, USA, 41–48. Mathematics 84, 9, 1379–1391.

J OACHIMS , T. 2002. Optimizing search engines using clickthrough PASUPA , K., S AUNDERS , C., S ZEDMAK , S., K LAMI , A., K ASKI ,
data. In KDD ’02: Proceedings of the 8th ACM SIGKDD inter- S., AND G UNN , S. 2009. Learning to rank images from eye
national conference on Knowledge discovery and data mining, movements. In HCI ’09: Proceeding of the IEEE 12th Inter-
ACM Press, New York, NY, USA, 133–142. national Conference on Computer Vision Workshops on Human-
Computer Interaction, 2009–2016.
K LAMI , A., S AUNDERS , C., DE C AMPOS , T. E., AND K ASKI , S. ´
P ULMANNOV A , S. 2004. Tensor products of hilbert space effect
2008. Can relevance of images be inferred from eye movements? algebras. Reports on Mathematical Physics 53(2), 301–316.
In MIR ’08: Proceeding of the 1st ACM international confer-
ence on Multimedia information retrieval, ACM, New York, NY, Q IU , J., AND N OBLE , W. S. 2008. Predicting co-complexed pro-
USA, 134–140. tein pairs from heterogeneous data. PLoS Computational Biol-
ogy 4(4), e1000054.
L AAKSONEN , J., KOSKELA , M., L AAKSO , S., AND O JA , R AYNER , K. 1998. Eye movements in reading and information
E. 2000. Picsom–content-based image retrieval with self- processing: 20 years of research. Psychological Bulletin 124, 3
organizing maps. Pattern Recognition Letter 21, 13–14, 1199– (November), 372–422.
1207.
¨ ¨
S ALOJ ARVI , J., P UOLAM AKI , K., S IMOLA , J., KOVANEN , L.,
M ARTIN , S., ROE , D., AND FAULON , J.-L. 2005. Predicting KOJO , I., AND K ASKI , S. 2005. Inferring relevance from eye
protein-protein interactions using signature products. Bioinfor- movements: Feature extraction. Tech. Rep. A82, Computer and
matics 21, 218–226. Information Science, Helsinki University of Technology.

298

Hardoon Image Ranking With Implicit Feedback From Eye Movements

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Hardoon Image Ranking With Implicit Feedback From Eye Movements (20)

More from Kalle (20)

Hardoon Image Ranking With Implicit Feedback From Eye Movements