SlideShare a Scribd company logo
Image Ranking with Implicit Feedback from Eye Movements
                                      David R. Hardoon∗                                                    Kitsuchart Pasupa†
                                  Data Mining Department                                       School of Electronics & Computer Science
                           Institute for Infocomm Research (I2 R)                                     University of Southampton


Abstract                                                                                           dure. Moreover, it is far from an ideal situation as both formulating
                                                                                                   an initial query and navigating the large number of retrieved hits
In order to help users navigate an image search system, one could                                  is a difficult. One image retrieval methodology which attempts to
provide explicit information on a small set of images as to which                                  address these issues, and has been a research topic since the early
of them are relevant or not to their task. These rankings are learned                              1990’s, is the so-called “Content-Based Image Retrieval”(CBIR).
in order to present a user with a new set of images that are rel-                                  The search of a CBIR system is analysed from the actual content
evant to their task. Requiring such explicit information may not                                   of the image which may includes colour, shape, and texture rather
be feasible in a number of cases, we consider the setting where                                    than using a textual annotation associated (if at all) with the image.
the user provides implicit feedback, eye movements, to assist when
performing such a task. This paper explores the idea of implic-                                    Relevance feedback, which is explicitly provided by the user while
itly incorporating eye movement features in an image ranking task                                  performing a search query on the quality of the retrieved images,
where only images are available during testing. Previous work had                                  has shown to be able to improve on the performance of CBIR sys-
demonstrated that combining eye movement and image features im-                                    tems, as it is able to handle the large variability in semantic in-
proved on the retrieval accuracy when compared to using each of                                    terpretation of images across users. Relevance feedback will iter-
the sources independently. Despite these encouraging results the                                   atively guide the system to retrieve images the user is genuinely
proposed approach is unrealistic as no eye movements will be pre-                                  interested in. Many systems rely on an explicit feedback mech-
sented a-priori for new images (i.e. only after the ranked images are                              anism, where the user explicitly indicates which images are rele-
presented would one be able to measure a user’s eye movements                                      vant for their search query and which ones are not. One can then
on them). We propose a novel search methodology which com-                                         use a machine learning algorithm to try and present a new set of
bines image features together with implicit feedback from users’                                   images to the users which are more relevant - thus helping them
eye movements in a tensor ranking Support Vector Machine and                                       navigate the large number of hits. An example of such systems
show that it is possible to extract the individual source-specific                                  is PicSOM [Laaksonen et al. 2000]. However, providing explicit
weight vectors. Furthermore, we demonstrate that the decomposed                                    feedback is also a laborious process as it requires continues user re-
image weight vector is able to construct a new image-based seman-                                  sponse. Alternatively, it is possible to use implicit feedback to infer
tic space that outperforms the retrieval accuracy than when solely                                 relevance of images. Examples of implicit feedback are eye move-
using the image-features.                                                                          ments, mouse pointer movements, blood pressure, gestures, etc. In
                                                                                                   other words, user responses that are implicitly related to the task
                                                                                                   performed.
CR Categories: G.3 [Probability and Statistics]: Multivariate
Statistics—; H.3.3 [Information Search and Retrieval]: Retrieval                                   In this study we explore the use of eye movements as a particu-
models—Relevance feedback                                                                          lar source of implicit feedback to assist a user when performing
                                                                                                   such a task (i.e. image retrieval). Eye movements can be treated
Keywords: Image Retrieval, Implicit Feedback, Tensor, Ranking,                                     as an implicit relevance feedback when the user is not consciously
Support Vector Machine                                                                             aware of their eye movements being tracked. Eye movement as
                                                                                                   implicit feedback has recently been used in the image retrieval set-
                                                                                                   ting [Oyekoya and Stentiford 2007; Klami et al. 2008; Pasupa et al.
1     Introduction                                                                                 2009]. [Oyekoya and Stentiford 2007; Klami et al. 2008] used eye
                                                                                                   movements to infer a binary judgement of relevance while [Pasupa
In recent years large digital image collections have been created in                               et al. 2009] makes the task more complex and realistic for search-
numerous areas, examples of these include the commercial, aca-                                     based task by asking the user to give multiple judgement of rele-
demic, and medical domains. Furthermore, these databases also in-                                  vance. Furthermore, earlier studies of Hardoon et al. [2007] and
clude the digitisation of analogue photographs, paintings and draw-                                Ajanki et al. [2009] explored the problem of where an implicit in-
ings. Conventionally, the images collected are manually tagged                                     formation retrieval query is inferred from eye movements measured
with various descriptors to allow retrieval to be performed over the                               during a reading task. The result of their empirical study is that it
annotated words. However, the process of manually tagging images                                   is possible to learn the implicit query from a small set of read doc-
is an extremely laborious, time consuming and an expensive proce-                                  uments, such that relevance predictions for a large set of unseen
    ∗ e-mail:
                                                                                                   documents are ranked better than by random guessing. More re-
                drhardoon@i2r.a-star.edu.sg
    † e-mail:                                                                                      cently, Pasupa et al. [2009] demonstrated that ranking of images
                kp2@ecs.soton.ac.uk
                                                                                                   can be inferred from eye movements using Ranking Support Vector
                                                                                                   Machine (Ranking SVM). Their experiment shows that the perfor-
                                                                                                   mance of the search can be improved when simple images features
                                                                                                   namely histograms are fused with the eye movement features.
Copyright © 2010 by the Association for Computing Machinery, Inc.                                  Despite Pasupa et al.’s [2009] encouraging results, their proposed
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
                                                                                                   approach is largely unrealistic as they combine image and eye fea-
for commercial advantage and that copies bear this notice and the full citation on the             tures for both training and testing. Whereas in a real scenario no
first page. Copyrights for components of this work owned by others than ACM must be                eye movements will be presented a-priori for new images. In other
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on         words, only after the ranked images are presented to a user, would
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
                                                                                                   one be able to measure the users’ eye movements on them. There-
permissions@acm.org.                                                                               fore, we propose a novel search methodology which combines im-
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

                                                                                             291
age features together with implicit feedback from users’ eye move-                with label ck and m being the total number of new samples. Fi-
ments during training, such that we are able to rank new images                   nally, we quote from Cristianini and Shawe-Taylor [2000] the gen-
with only using image features. We believe it is indeed more real-                eral dual SVM optimisation as
istic to have images and eye-movements during the training phase
                                                                                                                m                  m
as these could be acquired deliberately to train up such a system.                                                            1
                                                                                         max W (α) =                  αi −                αi αj ci cj κ(xi , xj )   (3)
                                                                                           α                                  2
For this purpose, we propose using tensor kernels in the ranking                                                i=1               i,j=1
SVM framework. Tensors have been used in the machine learning                                             m
literature as a means of predicting edges in a protein interaction or                   subject to        i=1   αi ci = 0 and αi ≥ 0 i = 1, . . . , m,
co-complex network by using the tensor product transformation to
                                                                                  where we again use ci to represent the label and κ(xi , xj ) to be the
derive a kernel on protein pairs from a kernel on individual pro-
                                                                                  kernel function between φ(x)i and φ(x)j , where φ(·) is a mapping
teins [Ben-Hur and Noble 2005; Martin et al. 2005; Qiu and No-
                                                                                  from X (or Y ) to an (inner product) feature space F .
ble 2008]. In this study we use the tensor product to constructed
a joined semantics space by combining eye movements and im-
age features. Furthermore, we continue to show that the combined                  3   Tensor Ranking SVM
learnt semantic space can be efficiently decomposed into its con-
tributing sources (i.e. images and eye movements), which in turn                  In the following section we propose to construct a tensor kernel
can be used independently.                                                        on the ranked image and eye movements features, i.e. following
                                                                                  equation (1), to then to train an SVM. Therefore, let X ∈ Rn×m
The paper is organised as follows. In Section 2 we give a brief intro-            and Y ∈ Rℓ×m be the matrix of sample vectors, x and y, for the
duction to the ranking SVM methodology and continue to develop                    image and eye movements respectively, where n is the number of
in Section 3 our proposed tensor ranking SVM and the efficient de-                 image features and ℓ is the number of eye movement features and
composition of the joint semantic space into the individual sources.              m are the total number of samples. We continue to define K x , K y
In Section 5 we give our experimental set up whereas in Section 6                 as the kernel matrices for the ranked images and eye movements
we discusses the feature extraction and representation of the images              respectively. In our experiments we use linear kernels, i.e. K x =
and eye movements. In section 7 we bring forward our experiments                  X ′ X and K y = Y ′ Y . The resulting kernel matrix of the tensor
on page ranking for individual users as well as a feasibility study on            T = X ◦Y can be expressed as pair-wise product (see [Pulmannov´  a
user generalisation. Finally, we conclude our study with discussion               2004] for more details)
on our present methodology and results in Section 8.
                                                                                                          ¯
                                                                                                          Kij = (T ′ T )ij = Kij Kij .
                                                                                                                              x   y


2    Ranking SVM                                                                            ¯
                                                                                  We use K in conjunction with the vanilla SVM formulation as
                                                                                  given in equation (3). Whereas the set up and training are straight
The Ranking Support Vector Machine (SVM) was proposed by                          forward, the underlying problem is that for testing we do not have
Joachims [2002] which was adapted from ordinal regression [Her-                   the eye movements. Therefore we propose to decompose the re-
brich et al. 2000]. It is a pair-wise approach where the solution is              sulting weight matrix from its corresponding image and eye com-
a binary classification problem. Let xi denote some feature vector                 ponents such that each can be used independently.
and let ri denote the ranking assigned to xi . If r1 ≻ r2 , it means
that x1 is more relevance than x2 . Consider a linear ranking func-               The goal is to decompose the weight matrix W given by a dual
tion,                                                                             representation
               xi ≻ xj ⇐⇒ w, xi − w, xj > 0,                                                                     m

where w is a weight vector and ·, · denotes dot product between                                      W =               αi ci φx (xi ) ◦ φy (yi )
                                                                                                                  i
vectors. This can be placed in a binary SVM classification frame-
work where let ck be the new label indicating the quality of kth                  without accessing the feature space. Given the paired samples x, y
rank pair,                                                                        the decision function in equation is

           w, xi − xj =
                                 ck = +1         if ri ≻ rj
                                                            ,        (1)                       f (x, y)     =         W ◦ φx (x)φy (y)′                             (4)
                                 ck = −1         if rj ≻ ri                                                           m
                                                                                                            =               αi ci κx (xi , x)κy (yi , y).
which can be solved by the following optimisation problem,                                                            i=1

                              1
                      min       w, w + C         ξk                  (2)          4   Decomposition
                              2
                                             k
                                                                                  The resulting decision function in equation (4) requires both image
subject to the following constrains:                                              and eye movement (xi , yi ) data for training and testing. We want
                                                                                  to be able to test our model only using the image data. Therefore,
       ∀(i, j) ∈ r (k)    :    ck ( w, xi − xj + b) ≥ 1 − ξk                      we want to decompose the weight matrix (again without accessing
                ∀(k)      :    ξk ≥ 0                                             the feature space) into a sum of tensor products of corresponding
                                                                                  weight components for the images and eye movements
where r (k) = [r1 , r2 , . . . , rt ] for t rank values, furthermore C is                                                         T
a hyper-parameter which allows trade-off between margin size and                                                                               ′
                                                                                                          W ≈ WT =                       t  t
                                                                                                                                        wx wy ,                     (5)
training error, and ξk is training error. Alternatively, we are repre-                                                            t=1
sent the ranking SVM as a vanilla SVM where we re-represent our
samples as                                                                                                                                          t
                                                                                  such that the weights are a linear combination of the data, i.e. wx =
                                                                                    m
                           φ(x)k = xi − xj                                          i=1  βi φx (xi ) and wy = m γi φy (yi ) where β t , γ t are the
                                                                                          t                t
                                                                                                                    i=1
                                                                                                                        t




                                                                            292
t   t
dual variables of wx , wy . We proceed to define our decomposition                                         5     Experimental Setup
procedure such that we do not need to compute the (potentially non-
linear) feature projection φ. We compute                                                                  Our experimental set-up is as follows: Users are shown 10 images
                               m                                                                          on a single page as a five by two (5x2) grid and are asked to rank
         WW′ =                       αi αj ci cj κy (yi , yj )φx (xi )φx (xj )′               (6)         the top five images in order of relevance to the topic of “Transport”.
                               i,j
                                                                                                          This concept is deliberately slightly ambiguous given the context
                                                                                                          of images that were displayed. Each displayed page contained 1–3
and are able to express K y                                   =      (κy (yi , yj ))m         =           clearly relevant images (e.g. a freight train, cargo ship or airliner),
                                                                                    i,j=1
  K                   k′
                                                                                                          2–3 either borderline or marginally relevant images (e.g. bicycle or
  k=1  λk uk u = U ΛU ′ , where U = (u1 , . . . , uK ) by perform-                                        baby carrier), and the rest are non-relevant images (e.g. images of
ing an eigenvalue decomposition of the kernel matrix K y with en-                                         people sitting at a dining room table, or a picture of a cat).
       y
tries Kij = κy (yi , yj ). Substituting back into equation (6) gives
                                                                                                          The experiment had 30 pages in total, each showing 10 images from
                           K           m                                                                  the PASCAL Visual Objects Challenge 2007 database [Everingham
                 ′                                            ′
        WW =                     λk          αi αj ci cj uk uk φx (xi )φx (xj )′ .
                                                          i j                                             et al. ]. The interface consisted of selecting radio buttons (labelled
                           k           i,j                                                                1st to 5th under each image) then clicking on next to retrieve the
                                                                                                          next page. This represents data for a ranking task where explicit
                         m          k                            ′
Letting hk =             i=1 αi ci ui φx (xi ) we have W W           =                                    ranks are given to compliment any implicit information contained in
  K                                       √                √
             ′         ′
     λk hk hk = HH where H =                λ1 h1 , . . . , λK hK . We                                    the eye movements. An example of each page is shown in figure 1.
  k
would like to find the singular value decomposition of H = V ΥZ ′ .
Consider for A = diag(α) and C = diag(c) we have                                                          The experiment was performed by six different users, with their eye
                                                                                                          movements recorded by a Tobii X120 eye tracker which was con-
                                                                                                          nected to a PC using a 19-inch monitor (resolution of 1280x1024).
        H ′H         kℓ
                           =             λk λℓ           αi αj ci cj uk uℓ κx (xi , xj )
                                                                      i j                                 The eye tracker has approximately 0.5 degrees of accuracy with a
                                                   ij
                                                                                                          sample rate of 120 Hz and used infrared lens to detect pupil centres
                                                     1    ′                     1
                           =             CAU Λ 2              K x CAU Λ 2                 ,               and corneal reflection. The final data collected per user is illus-
                                                                                     kℓ                   trated in table 1. Any pages that contained less than five images
                                                                                                          with gaze points (for example due to the subject moving and the
which is computable without accessing the feature space. Perform-                                         eye-tracker temporarily losing track of the subject’s eyes) were dis-
ing an eigenvalue decomposition on H ′ H we have                                                          carded. Hence, only 29 and 20 pages were valid for users 4 and 5,
                                                                                                          respectively.
                      H ′H           =       ZΥV ′ V ΥZ ′ = ZΥ2 Z ′                           (7)

with Υ a matrix with υt on the diagonal truncated after the J’th                                          Table 1: The data collected per user. ∗ Pages with less than five
                                                           1
eigenvalue, which gives the dual representation of vt = υt Hzt for                                        images with gaze points were removed. Therefore users 4 and 5
                              ′       2                                                                   only have 29 and 20 pages viewed respectively.
t = 1, . . . , T , and since H Hzt = υt zt we are able to verify that

                                                  1                                                                             User #     Pages Viewed
                                                                          2
   W W ′ vt           =        HH ′ vt =             HH ′ Hzt = υt Hzt = υt vt .                                                User 1          30
                                                  υt
                                                                                                                                User 2          30
                                                                                                                                User 3          30
Restricting to the first T singular vectors allows us to express W ≈                                                             User 4∗         29
                           ′
W T = T vt (W ′ vt ) , which in turn results in                                                                                 User 5∗         20
           t=1
                                                                                                                                User 6          30
                                                              m
                       t                     1                       t
                      wx = vt =                 Hzt =               βi φx (xi ),
                                             υt               i=1
                                                                                                          5.1   Performance Measure
         t           1               T       √
where   βi   =         αc
                     υt i i          k=1         λk zt uk
                                                     k i      . We can now also express
                                                                                                          We use the Normalised Discount Cumulative Gain
                                                                    m                                     (NDCG) [J¨ rvelin and Kek¨ l¨ inen 2000] as our performance
                                                                                                                      a                  aa
                                             1 ′                                                          metric, due to our task involving multiple ranks rather than a binary
             wy = W ′ vt =
              t
                                                W Hzt =                    t
                                                                          γi φy (yi ),
                                             υt                     i=1
                                                                                                          choice. NDCG measures the usefulness, or gain, of a retrieved
                                                                                                          item based on its position in the result list. NDCG is designed for
where γi = m αi ci βj κx (xi , xj ) are the dual variables of wy .
        t                 t                                    t                                          tasks which have more than two levels of relevance judgement, and
                j=1
We are therefore now able to decompose W into Wx , Wy without                                             is defined as,
accessing the feature space giving us the desired result.
                                                                                                                                                 k
                                                                                                                                            1
We are now able to compute, for a given t, the ranking scores                                                             NDCGk (r) =                 D(ri )ϕ(gi )
                                                         t    ˆ                                                                            Nn
in the linear discriminant analysis form s = wx φ(X) =                                                                                          i=1
   m     t         ˆ                      ˆ
       βi κx (xi , X) for new test images X. These are in turn sorted
   i=1
                                                                                                                               1
in order of magnitude (importance). Equally, we can project our                                           with D(r) = log (1+r) and ϕ(g) = 2g − 1, where for a given
                                                                                                                             2
data into the new defined semantic space β where we train and test                                         page r is rank position and k is a truncation level (position), N is a
                             ˜
an SVM. i.e. we compute φ(x) = K x β, for the training samples,                                           normalising constant which gives the perfect ranking (based on gi )
     ˜             x
and φ(xt ) = Kt β for our test samples. We explore both these                                             equal to one, and gi is the categorical grade; e.g. grade is equal to 5
approaches in our experiments.                                                                            for the 1st rank and 0 for the 6th .


                                                                                                    293
Figure 1: An example illustrating the outlay of the interface displaying the 10 images with the overlaid eye movement measurements. The
circles indicate fixations.


6   Feature extraction                                                                                 0.65
                                                                                                                              Random Performance


                                                                                                        0.6

                                                                                                       0.55

                                                                                                        0.5

In the following experiments we use standard image histograms and                                      0.45




                                                                                               NDCGk
features collected from eye-tracking. We compute a 256-bin grey                                         0.4


scale histogram on the whole image as the feature representation.                                      0.35

                                                                                                        0.3
These features are intentionally kept relatively simple. Although,                                     0.25

a possible extension of the current representation is to segment the                                    0.2

image and only use regions that have gaze information. We intend                                              1   2   3   4       5         6      7   8     9      10
                                                                                                                                   Positions
to explore this extension in a future study.
                                                                                Figure 2: NDCG performance for predicting random rankings.
The eye movement features are computed using only on the eye
trajectory and locations of the images in the page. This type of
features are general-purpose and are easily applicable to all appli-
cation scenarios. The features are divided into two categories; the            mended for media with mixed content1 .
first category uses the raw measurements obtained from the eye-                 Some of the features are not invariant to the location of the image
tracker, whereas the second category is based on fixations estimated            on the screen. For example, the typical pattern of moving from
from the raw data. A fixation means a period in which a user main-              left to right means that the horizontal co-ordinate of the first fixa-
tains their gaze around a given point. These are important as most             tion for the left-most image of each row typically differs from the
visual processing happens during fixations, due to blur and sac-                corresponding measure on the other images. Features that were ob-
cadic suppression during the rapid saccades between fixations (see,             served to be position-dependent were normalised by removing the
e.g. [Hammoud 2008]). Often visual attention features are based                mean of all observations sharing the same position, and are marked
solely on fixations and the relation between them [Rayner 1998].                in Table 2. Finally, each feature was normalised to have unit vari-
However, raw measurement data might be able to overcome possi-                 ance and zero mean.
ble problems caused by imperfect fixation detection.

                                                                               7     Experiments
In table 2 we list the candidate features we have considered. Most of
the listed features are motivated by earlier studies in text retrieval
                                                                               We evaluate two different scenarios for learning the ranking of im-
[Saloj¨ rvi et al. 2005]. The features cover the three main types
       a
                                                                               age based on image and eye features; 1. Predicting rankings on a
of information typically considered in reading studies: fixations,
                                                                               page given only other data from a single specific user. 2. A global
regressions (fixations to previously seen images), and re-fixations
                                                                               model using data from other users to predict rankings for a new
(multiple fixations within the same image). However, the features
                                                                               unseen user.
have been tailored to be more suitable for images, trying to include
measures for things that are not relevant for text, such as the cover          We compare our proposed tensor Ranking SVM algorithm which
of the image. Similarly to the image features, the eye movement                combines both information from eye movements and image his-
features are intentionally kept relatively simple with the intent that         togram features to a Ranking SVM using histogram features and
they are more likely to generalise over different users. Fixations             to a Ranking SVM using eye movements alone. We emphasis that
were detected using the standard ClearView fixation filter provided
with the Tobii eye-tracking software, with settings “radius 30 pix-                1 Tobii   Technology, Ltd.          Tobii                               Studio        Help.   url:
els, minimum duration 100 ms”. These are also the settings recom-              http://studiohelp.tobii.com/StudioHelp 1.2/



                                                                         294
Table 2: We list the eye movement features considered in this study. The first 16 features are computed from the raw data, whereas the
remainder are based on pre-detected fixations. We point out to the reader that features number 2 and 3 use both types of data since they are
based on raw measurements not belonging to fixations. All the features are computed separately for each image. Features marked with ∗ are
normalised for each image location.

                                   Number        Name                 Description

                                                                        Raw data features
                                   1             numMeasurements      total number of measurements
                                   2             numOutsideFix        total number of measurements outside fixations
                                   3             ratioInsideOutside   percentage of measurements inside/outside fixations
                                   4             xSpread              difference between largest and smallest x-coordinate
                                   5             ySpread              difference between largest and smallest y-coordinate
                                   6             elongation           ySpread/xSpread
                                   7             speed                average distance between two consecutive measurements
                                   8             coverage             number of subimages covered by measurements1
                                   9             normCoverage         coverage normalized by numMeasurements
                                   10∗           landX                x-coordinate of the first measurement
                                   11∗           landY                y-coordinate of the first measurement
                                   12∗           exitX                x-coordinate of the last measurement
                                   13∗           exitY                y-coordinate of the last measurement
                                   14            pupil                maximal pupil diameter during viewing
                                   15∗           nJumps1              number of breaks longer than 60 ms2
                                   16∗           nJumps2              number of breaks longer than 600 ms2

                                                                        Fixation features
                                   17            numFix               total number of fixations
                                   18            meanFixLen           mean length of fixations
                                   19            totalFixLen          total length of fixations
                                   20            fixPrct               percentage of time spent in fixations
                                   21∗           nJumpsFix            number of re-visits to the image
                                   22            maxAngle             maximal angle between two consecutive saccades3
                                   23∗           landXFix             x-coordinate of the first fixation
                                   24∗           landYFix             y-coordinate of the first fixation
                                   25∗           exitXFix             x-coordinate of the last fixation
                                   26∗           exitYFix             y-coordinate of the last fixation
                                   27            xSpreadFix           difference between largest and smallest x-coordinate
                                   28            ySpreadFix           difference between largest and smallest y-coordinate
                                   29            elongationFix        ySpreadFix/xSpreadFix
                                   30            firstFixLen           length of the first fixation
                                   31            firstFixNum           number of fixations during the first visit
                                   32            distPrev             distance to the fixation before the first
                                   33            durPrev              duration of the fixation before the first
                                   1
                                       The image was divided into a regular grid of 4x4 subimages.
                                   2
                                       A sequence of measurements outside the image occurring between two consecutive measure-
                                       ments within the image.
                                   3
                                       A transition from one fixation to another.



training and testing a model using only eye movements is not re-                    remaining pages, from the same user, are used for training.
alistic as there are no eye movements presented a-priori for new
images, i.e. one can not test. This comparison provides us with                     We evaluate the proposed approach with the following four setting:
a baseline as to how much it may be possible to improve on the                          • T 1: using the largest component of tensor decomposition in
performance using eye movements. Furthermore, we are unable to                            the form of a linear discriminator. We use the weight vec-
make direct comparison to [2009] as they had used an online learn-                        tor corresponding to the largest eigenvalue (as we have a t
ing algorithm with different image features.                                              weights).
In the experiments we use a linear kernel function. Although, it is                     • T 2: we project the image features into the learnt semantic
possible to use a non-linear kernel on the eye movement features as                       space (i.e. the decomposition on the image source) and train
this would not effect the decomposition for the image weights (as-                        and test within the projected space a secondary Ranking SVM
suming that φx (xi ) are taken as the image features in equation (6)).                    (Ranking SVM).
In figure 2 we give the NDCG performance for predicting random
ranking.                                                                                • T 1all : similar to T 1 although here we use all t weight vectors
                                                                                          and take the mean value across as the final score.

7.1   Page Generalisation                                                               • T 1opt : similar to T 1 although here we use the n-largest com-
                                                                                          ponents of the decomposition. i.e. we select n weight vectors
                                                                                          to use and take the mean value across as the final score.
In the following section we focus on predicting rankings on a page
given only other data from a single specific user (we repeat this                    We use a leave-one-out cross-validation for T 1opt to obtain the op-
for all users). We employ a leave-page-out routine where at each                    timal model for the later case which are selected based on maximum
iteration a page, from a given user, is withheld for testing and the                average NDCG across 10 positions.


                                                                             295
Average
                0.7
                                                                                 • T 2: we project the image features into the learnt seman-
                                                                                   tic space (i.e. the decomposition on the image source) and
                0.6                                                                train and test within the projected space a secondary Ranking
                                                                                   SVM.
                0.5
                                                                             We plot in figure 5 the resulting NDCG performance for the leave-
        NDCGk




                0.4
                                                                             user-out routine. We are able to observe, with the exclusion of user
                                                                             2 in figure 5(b), that T 2 is able to outperform the Ranking SVM
                0.3                                    Images
                                                                             on image features. Indicating that it is possible to generalise our
                                                       Eyes                  proposed approach across new unseen users. Furthermore, it is in-
                                                       T1
                0.2                                    T2
                                                                             teresting to observe that T 2 achieves a similar performance to that
                                                       T1
                                                         all
                                                                             of a Ranking SVM trained and tested on the eye features. Finally,
                                                         opt
                                                       T1
                0.1
                                                                             even though we do not improve when testing on data from user 2,
                   1   2   3   4   5     6     7   8   9        10           we are able to observe that we perform as-good-as the baselines. In
                                   Positions
                                                                             figure 5(e) we plot the average NDCG performance on the leave-
                                                                             user-out routine, demonstrating that on average we improve on the
Figure 4: Average NDCG performance across all users for predict-             ranking of new images for new users and that we perform better
ing rankings on a page given only other data from a single specific           than random (figure 2).
user.

                                                                             8    Discussion
We plot the user specific leave-page-out NDCG performances in
figure 3 where we are able to observe that T 2 consistently outper-           Improving search and content based retrieval systems with implicit
forms the image feature Ranking SVM across all users, demonstrat-            feedback is an attractive possibility given that a user is not required
ing that it is indeed possible to improve on the image ranking with          to explicitly provide information to then improve, and personalise,
the incorporation of eye movement features during training. Fur-             their search strategy. This, in turn, can render such a system more
thermore, it is interesting to observe that for certain users T 1opt         user-friendly and simple to use (at least from the users’ perspec-
improves on the ranking performance, suggesting that there is an             tive). Although, achieving such a goal is non-trivial as one needs
optimal combination of the decomposed features that may further              to be able to combine the implicit feedback information into the
improve on the results.                                                      search system in a manner that does not then require the implicit
                                                                             information for testing. In our study we focus on implicit feedback
In figure 4 we plot the average performance across all users. The             in the form of eye movements, as these are easily available and can
figure shows that T 1 and T 1all are slightly worse than using image          be measured in a non-intrusive manner.
histogram alone. However, when selecting using cross-validation
the number of largest components in tensor decomposition, the per-           Previous studies [Hardoon et al. 2007; Ajanki et al. 2009] have
formance of the classifier is improved and outperforms the Ranking            shown the feasibility of such systems using eye moments for a tex-
SVM with eye movements. Furthermore, we are able to observe                  tual search task. Demonstrating that it is indeed possible to ‘en-
that we perform better than random (figure 2). Using classifier T 2,           rich’ a textual search with eye features. Their proposed approach
the performance is improved above the Ranking SVM with image                 is computationally complex since it requires the construction of a
features and it is competitive with Ranking SVM with eye move-               regression function on eye measurements on each word. This was
ments features.                                                              not realistic in our setting. Furthermore, Pasupa et al. [2009] had
                                                                             extend the underlying methodology of using eye movement as im-
7.2   User Generalisation                                                    plicit feedback to an image retrieval system, combining eye move-
                                                                             ments with image features to improve the ranking of retrieved im-
In the following section we focus on learning a global model using           ages. Although, still, the proposed approach required eye features
data from other users to predict rankings for a new unseen user.             for the test images which would not be practical in a real system.
Although, as the experiment is set up such that all users view the           In this paper we present a novel search strategy for combining eye
same pages, we employ a leave-user-leave-page-out routine, i.e;              movements and image features with a tensor product kernel used in
For all users                                                                a ranking support vector machine framework. We continue to show
   Withhold data from user i                                                 that the joint learnt semantic space of eye and image features can
   For all pages                                                             be efficiently decomposed into its independent sources allowing us
      Withhold page j from all users                                         to further test or train only using images. We explored two different
      Train on all pages-j from all users - i                                search scenarios for learning the ranking of images based on image
                                                                             and eye features. The first was predicting ranking on a page given
      Test on page j from user i                                             only other data from a single specific user. This experiment was
   Endfor                                                                    to test the fundamental question of whether eye movement are able
Endfor                                                                       to improve ranking for a user. Demonstrating that it was indeed
                                                                             possible to improve in the single subject setting, we then proceeded
Therefore we only use the users from table 1 who viewed the same             to our second setting where we constructed a global model across
number of pages, i.e. users 1, 2, 3 and 6, which we refer to hence-          users in attempt to generalise on data from a new user. Again our
forth as users 1-4.                                                          results demonstrated that we are able to generalise out model to new
We evaluate the proposed approach with the following two setting:            users. Despite these promising results, it was also clear that using
                                                                             a single direction (weight vector) does not necessarily improve on
  • T 1: using the largest component of tensor decomposition in              the baseline result. Motivating the need for a more sophisticated
    the form of a linear discriminator. We use the weight vec-               combination of the resulting weights. This, as well as extending
    tor corresponding to the largest eigenvalue (as we have a t              our experiment to a much larger number of users, will be addressed
    weights).                                                                in a future study. Finally, we would also explore the notion of image


                                                                       296
User 1                                                      User 2                                                      User 3
                  0.7                                                         0.7                                                         0.7


                  0.6                                                         0.6                                                         0.6


                  0.5                                                         0.5                                                         0.5


                  0.4                                                         0.4                                                         0.4
          NDCGk




                                                                      NDCGk




                                                                                                                                  NDCGk
                  0.3                                                         0.3                                                         0.3

                                                        Images                                                      Images                                                      Images
                  0.2                                   Eyes                  0.2                                   Eyes                  0.2                                   Eyes
                                                        T1                                                          T1                                                          T1
                                                        T2                                                          T2                                                          T2
                  0.1                                     all                 0.1                                     all                 0.1                                     all
                                                        T1                                                          T1                                                          T1
                                                          opt                                                         opt                                                         opt
                                                        T1                                                          T1                                                          T1
                   0                                                           0                                                           0
                    1   2   3   4   5     6     7   8   9        10             1   2   3   4   5     6     7   8   9        10             1   2   3   4   5     6     7   8   9        10
                                    Positions                                                   Positions                                                   Positions

              (a) NDCG performance within user 1                          (b) NDCG performance within user 2                          (c) NDCG performance within user 3
                                    User 4                                                      User 5                                                      User 6
                  0.7                                                         0.7                                                         0.7


                  0.6                                                         0.6                                                         0.6


                  0.5                                                         0.5                                                         0.5


                  0.4                                                         0.4                                                         0.4
          NDCGk




                                                                      NDCGk




                                                                                                                                  NDCGk
                  0.3                                                         0.3                                                         0.3

                                                        Images                                                      Images                                                      Images
                  0.2                                   Eyes                  0.2                                   Eyes                  0.2                                   Eyes
                                                        T1                                                          T1                                                          T1
                                                        T2                                                          T2                                                          T2
                  0.1                                     all                 0.1                                     all                 0.1                                     all
                                                        T1                                                          T1                                                          T1
                                                          opt                                                         opt                                                         opt
                                                        T1                                                          T1                                                          T1
                   0                                                           0                                                           0
                    1   2   3   4   5     6     7   8   9        10             1   2   3   4   5     6     7   8   9        10             1   2   3   4   5     6     7   8   9        10
                                    Positions                                                   Positions                                                   Positions

              (d) NDCG performance within user 4                          (e) NDCG performance within user 5                              (f) NDCG performance within user 6


Figure 3: In the following sub-figures 3(a)-3(f) we illustrate the NDCG performance for each user in a leave-page-out routine, i.e. here
we aim to generalise over new pages rather new users. We are able to observe that T 2 and T 1opt routinely outperform the ranking with
only using image features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if
eye-features were indeed available a-priori for new images.


segmentation and the use of more sophisticated image features that                                        C RISTIANINI , N., AND S HAWE -TAYLOR , J. 2000. An Introduc-
are easily computable.                                                                                       tion to Support Vector Machines and other kernel-based learning
                                                                                                             methods. Cambridge University Press.
Acknowledgements                                                                                          E VERINGHAM , M., VAN G OOL , L., W ILLIAMS , C. K. I.,
                                                                                                             W INN , J., AND Z ISSERMAN , A. The PASCAL Visual Object
The authors would like to acknowledge financial support from                                                  Classes Challenge 2007 (VOC2007) Results. http://www.pascal-
the European Community’s Seventh Framework Programme                                                         network.org/challenges/VOC/voc2007/workshop/index.html.
(FP7/2007–2013) under grant agreement n◦ 216529, Personal In-
formation Navigator Adapting Through Viewing (PinView) project                                            H AMMOUD , R. 2008. Passive Eye Monitoring: Algorithms, Appli-
(http://www.pinview.eu). The authors would also like to thank                                                cations and Experiments. Springer-Verlag.
Craig Saunders for data collection.                                                                                                                      ¨
                                                                                                          H ARDOON , D. R., A JANKI , A., P UOLAM AKI , K., S HAWE -
                                                                                                             TAYLOR , J., AND K ASKI , S. 2007. Information retrieval
References                                                                                                   by inferring implicit queries from eye movements. In Pro-
                                                                                                             ceedings of the 11th International Conference on Artificial In-
                                                       ¨                                                     telligence and Statistics (AISTATS), Electronic proceedings at
A JANKI , A., H ARDOON , D. R., K ASKI , S., P UOLAM AKI , K.,
                                                                                                             www.stat.umn.edu/ aistat/proceedings/start.htm.
   AND S HAWE -TAYLOR , J. 2009. Can eyes reveal interest?
   implicit queries from gaze patterns. User Modeling and User-                                           H ERBRICH , R., G RAEPEL , T., AND O BERMAYER , K. 2000.
   Adapted Interaction 19, 4, 307–339.                                                                       Large margin rank boundaries for ordinal regression. MIT
                                                                                                             Press, Cambridge, MA.
B EN -H UR , A., AND N OBLE , W. S. 2005. Kernel methods for
   predicting protein-protein interactions. Bioinformatics 21, i38–                                         ¨                        ¨ ¨
                                                                                                          J ARVELIN , K., AND K EK AL AINEN , J. 2000. IR evaluation meth-
   i46.                                                                                                       ods for retrieving highly relevant documents. In SIGIR ’00:


                                                                                                297
User 1 − withheld (trained on 3 users)                                                                    User 2 − withheld (trained on 3 users)                                                               User 3 − withheld (trained on 3 users)
                 0.65                                                                                                        0.65                                                                                                0.8

                                Images                                                                                                                                                                                                         Images
                  0.6                                                                                                         0.6           Images
                                Eyes                                                                                                                                                                                             0.7           Eyes
                                                                                                                                            Eyes
                                T1                                                                                                                                                                                                             T1
                                                                                                                                            T1
                 0.55           T2                                                                                           0.55                                                                                                              T2
                                                                                                                                            T2
                                                                                                                                                                                                                                 0.6
                  0.5                                                                                                         0.5


                 0.45                                                                                                        0.45                                                                                                0.5
            k




                                                                                                                     NDCGk




                                                                                                                                                                                                                         NDCGk
          NDCG




                  0.4                                                                                                         0.4                                                                                                0.4

                 0.35                                                                                                        0.35
                                                                                                                                                                                                                                 0.3
                  0.3                                                                                                         0.3

                                                                                                                                                                                                                                 0.2
                 0.25                                                                                                        0.25


                  0.2                                                                                                         0.2                                                                                                0.1
                        1   2      3          4       5       6               7   8       9        10                               1   2     3         4       5       6               7   8       9    10                            1   2      3          4       5       6         7         8   9   10
                                                      Positions                                                                                                 Positions                                                                                            Positions


                    (a) NDCG performance on user 1                                                                              (b) NDCG performance on user 2                                                                     (c) NDCG performance on user 3

                                                                                               User 4 − withheld (trained on 3 users)                                                                    Leave User Out − Average
                                                                       0.65                                                                                                      0.65
                                                                                      Images                                                                                                    Images
                                                                        0.6           Eyes                                                                                        0.6           Eyes
                                                                                      T1                                                                                                        T1
                                                                       0.55           T2                                                                                                        T2
                                                                                                                                                                                 0.55

                                                                        0.5
                                                                                                                                                                                  0.5
                                                                       0.45
                                                                                                                                                                                 0.45
                                                               NDCGk




                                                                        0.4                                                                                              NDCGk

                                                                                                                                                                                  0.4
                                                                       0.35
                                                                                                                                                                                 0.35
                                                                        0.3

                                                                                                                                                                                  0.3
                                                                       0.25

                                                                        0.2                                                                                                      0.25


                                                                                                                                                                                  0.2
                                                                              1   2       3         4       5       6               7   8      9       10                               1   2       3    4      5       6              7   8       9        10
                                                                                                            Positions                                                                                           Positions


                                                                          (d) NDCG performance on user 4                                                                                (e) Average NDCG performance


Figure 5: In the following sub-figures 5(a)–5(d) we illustrate the NDCG performance in a leave-user-out (leave-page-out) routine. The
average NDCG performance is given in sub-figure 5(e) where we are able to observe that T 2 outperforms the ranking of only using image
features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if eye-features were
indeed available a-priori for new images.


  Proceedings of the 23rd annual international ACM SIGIR con-                                                                                                               OYEKOYA , O., AND S TENTIFORD , F. 2007. Perceptual image re-
  ference on Research and development in information retrieval,                                                                                                               trieval using eye movements. International Journal of Computer
  ACM, New York, NY, USA, 41–48.                                                                                                                                              Mathematics 84, 9, 1379–1391.

J OACHIMS , T. 2002. Optimizing search engines using clickthrough                                                                                                           PASUPA , K., S AUNDERS , C., S ZEDMAK , S., K LAMI , A., K ASKI ,
   data. In KDD ’02: Proceedings of the 8th ACM SIGKDD inter-                                                                                                                 S., AND G UNN , S. 2009. Learning to rank images from eye
   national conference on Knowledge discovery and data mining,                                                                                                                movements. In HCI ’09: Proceeding of the IEEE 12th Inter-
   ACM Press, New York, NY, USA, 133–142.                                                                                                                                     national Conference on Computer Vision Workshops on Human-
                                                                                                                                                                              Computer Interaction, 2009–2016.
K LAMI , A., S AUNDERS , C., DE C AMPOS , T. E., AND K ASKI , S.                                                                                                                          ´
                                                                                                                                                                            P ULMANNOV A , S. 2004. Tensor products of hilbert space effect
   2008. Can relevance of images be inferred from eye movements?                                                                                                               algebras. Reports on Mathematical Physics 53(2), 301–316.
   In MIR ’08: Proceeding of the 1st ACM international confer-
   ence on Multimedia information retrieval, ACM, New York, NY,                                                                                                             Q IU , J., AND N OBLE , W. S. 2008. Predicting co-complexed pro-
   USA, 134–140.                                                                                                                                                               tein pairs from heterogeneous data. PLoS Computational Biol-
                                                                                                                                                                               ogy 4(4), e1000054.
L AAKSONEN , J., KOSKELA , M., L AAKSO , S., AND O JA ,                                                                                                                     R AYNER , K. 1998. Eye movements in reading and information
   E. 2000. Picsom–content-based image retrieval with self-                                                                                                                    processing: 20 years of research. Psychological Bulletin 124, 3
   organizing maps. Pattern Recognition Letter 21, 13–14, 1199–                                                                                                                (November), 372–422.
   1207.
                                                                                                                                                                                   ¨                   ¨
                                                                                                                                                                            S ALOJ ARVI , J., P UOLAM AKI , K., S IMOLA , J., KOVANEN , L.,
M ARTIN , S., ROE , D., AND FAULON , J.-L. 2005. Predicting                                                                                                                    KOJO , I., AND K ASKI , S. 2005. Inferring relevance from eye
  protein-protein interactions using signature products. Bioinfor-                                                                                                             movements: Feature extraction. Tech. Rep. A82, Computer and
  matics 21, 218–226.                                                                                                                                                          Information Science, Helsinki University of Technology.


                                                                                                                                                            298

More Related Content

PDF
Faro Visual Attention For Implicit Relevance Feedback In A Content Based Imag...
PDF
An Experimental Study into Objective Quality Assessment of Watermarked Images
PDF
Volume 2-issue-6-1960-1964
PDF
META-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVAL
PDF
Paper id 212014133
PDF
Zhang Eye Movement As An Interaction Mechanism For Relevance Feedback In A Co...
PDF
Rae
PDF
Q01754118128
Faro Visual Attention For Implicit Relevance Feedback In A Content Based Imag...
An Experimental Study into Objective Quality Assessment of Watermarked Images
Volume 2-issue-6-1960-1964
META-HEURISTICS BASED ARF OPTIMIZATION FOR IMAGE RETRIEVAL
Paper id 212014133
Zhang Eye Movement As An Interaction Mechanism For Relevance Feedback In A Co...
Rae
Q01754118128

What's hot (18)

PDF
26.motion and feature based person tracking
PDF
Bieg Eye And Pointer Coordination In Search And Selection Tasks
PDF
Hg3512751279
PDF
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
PDF
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
PDF
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
PDF
Survey Paper on Image Denoising Using Spatial Statistic son Pixel
PDF
Effect of kernel size on Wiener and Gaussian image filtering
PDF
IRJET - Direct Me-Nevigation for Blind People
PDF
Deep learning for person re-identification
PDF
An Impact on Content Based Image Retrival A Perspective View
PDF
V2 i2087
PPTX
Assessing 3DTV QoE and beyond a look on testing methodologies
PDF
A review on Development of novel algorithm by combining Wavelet based Enhance...
PDF
VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...
PDF
20120140503009
PDF
MobiCom CHANTS
PDF
A Framework for Human Action Detection via Extraction of Multimodal Features
26.motion and feature based person tracking
Bieg Eye And Pointer Coordination In Search And Selection Tasks
Hg3512751279
Discovering Anomalies Based on Saliency Detection and Segmentation in Surveil...
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
Survey Paper on Image Denoising Using Spatial Statistic son Pixel
Effect of kernel size on Wiener and Gaussian image filtering
IRJET - Direct Me-Nevigation for Blind People
Deep learning for person re-identification
An Impact on Content Based Image Retrival A Perspective View
V2 i2087
Assessing 3DTV QoE and beyond a look on testing methodologies
A review on Development of novel algorithm by combining Wavelet based Enhance...
VIDEO SEGMENTATION FOR MOVING OBJECT DETECTION USING LOCAL CHANGE & ENTROPY B...
20120140503009
MobiCom CHANTS
A Framework for Human Action Detection via Extraction of Multimodal Features
Ad

Viewers also liked (20)

RTF
Successful Music Video Research
PPTX
Galerija Magicus Dnevnik Esencija Do 21 3 2010 Ciklus Cernik I Madonin Sv...
PPTX
감사 보고서 시즌,
PPT
Sniffer1
PPT
מצגת סופית פרטיות בעולם האינטרנט ב
KEY
Evaluating Packages
PPT
Total
PPT
הורדיון ירושלים
KEY
English presentation
PDF
Software Packet Manager
PDF
Guestrin Listings And Donders Laws And The Estimation Of The Point Of Gaze
PPT
Web 2 0
PPT
Mitex Pine
PPTX
Taylor Swift - NMik
PPTX
PDF
Predstavljanje poslovanja - press konferencija 15.06.12
PPT
Hyves Cbw Mitex Harry Van Wouter
PDF
To Brooklyn and Back Viewer Discussion Guide
PPTX
Homophones Lesson
PPT
Measuring social media
Successful Music Video Research
Galerija Magicus Dnevnik Esencija Do 21 3 2010 Ciklus Cernik I Madonin Sv...
감사 보고서 시즌,
Sniffer1
מצגת סופית פרטיות בעולם האינטרנט ב
Evaluating Packages
Total
הורדיון ירושלים
English presentation
Software Packet Manager
Guestrin Listings And Donders Laws And The Estimation Of The Point Of Gaze
Web 2 0
Mitex Pine
Taylor Swift - NMik
Predstavljanje poslovanja - press konferencija 15.06.12
Hyves Cbw Mitex Harry Van Wouter
To Brooklyn and Back Viewer Discussion Guide
Homophones Lesson
Measuring social media
Ad

Similar to Hardoon Image Ranking With Implicit Feedback From Eye Movements (20)

PDF
Paper id 25201471
PDF
Nakayama Estimation Of Viewers Response For Contextual Understanding Of Tasks...
PDF
Volume 2-issue-6-1960-1964
PDF
Active reranking for web image search
PDF
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
PDF
Hussain Learning Relevant Eye Movement Feature Spaces Across Users
PDF
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
PDF
PDF
PDF
E03404025032
PDF
A Review on Matching For Sketch Technique
PDF
Image retrieval and re ranking techniques - a survey
PDF
Precision face image retrieval by extracting the face features and comparing ...
PDF
Design of an Effective Method for Image Retrieval
PDF
Stellmach Advanced Gaze Visualizations For Three Dimensional Virtual Environm...
PDF
Engelman.2011.exploring interaction modes for image retrieval
PDF
CrossScenarioTransferPersonReidentification_finalManuscript
PDF
Droege Pupil Center Detection In Low Resolution Images
DOCX
Assignment 2 Application Case 6-5 Efficient Image Recognition and Cate.docx
Paper id 25201471
Nakayama Estimation Of Viewers Response For Contextual Understanding Of Tasks...
Volume 2-issue-6-1960-1964
Active reranking for web image search
Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks
Hussain Learning Relevant Eye Movement Feature Spaces Across Users
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
E03404025032
A Review on Matching For Sketch Technique
Image retrieval and re ranking techniques - a survey
Precision face image retrieval by extracting the face features and comparing ...
Design of an Effective Method for Image Retrieval
Stellmach Advanced Gaze Visualizations For Three Dimensional Virtual Environm...
Engelman.2011.exploring interaction modes for image retrieval
CrossScenarioTransferPersonReidentification_finalManuscript
Droege Pupil Center Detection In Low Resolution Images
Assignment 2 Application Case 6-5 Efficient Image Recognition and Cate.docx

More from Kalle (20)

PDF
Blignaut Visual Span And Other Parameters For The Generation Of Heatmaps
PDF
Yamamoto Development Of Eye Tracking Pen Display Based On Stereo Bright Pupil...
PDF
Wastlund What You See Is Where You Go Testing A Gaze Driven Power Wheelchair ...
PDF
Vinnikov Contingency Evaluation Of Gaze Contingent Displays For Real Time Vis...
PDF
Urbina Pies With Ey Es The Limits Of Hierarchical Pie Menus In Gaze Control
PDF
Urbina Alternatives To Single Character Entry And Dwell Time Selection On Eye...
PDF
Tien Measuring Situation Awareness Of Surgeons In Laparoscopic Training
PDF
Takemura Estimating 3 D Point Of Regard And Visualizing Gaze Trajectories Und...
PDF
Stevenson Eye Tracking With The Adaptive Optics Scanning Laser Ophthalmoscope
PDF
Skovsgaard Small Target Selection With Gaze Alone
PDF
San Agustin Evaluation Of A Low Cost Open Source Gaze Tracker
PDF
Rosengrant Gaze Scribing In Physics Problem Solving
PDF
Qvarfordt Understanding The Benefits Of Gaze Enhanced Visual Search
PDF
Prats Interpretation Of Geometric Shapes An Eye Movement Study
PDF
Porta Ce Cursor A Contextual Eye Cursor For General Pointing In Windows Envir...
PDF
Pontillo Semanti Code Using Content Similarity And Database Driven Matching T...
PDF
Park Quantification Of Aesthetic Viewing Using Eye Tracking Technology The In...
PDF
Palinko Estimating Cognitive Load Using Remote Eye Tracking In A Driving Simu...
PDF
Nagamatsu User Calibration Free Gaze Tracking With Estimation Of The Horizont...
PDF
Nagamatsu Gaze Estimation Method Based On An Aspherical Model Of The Cornea S...
Blignaut Visual Span And Other Parameters For The Generation Of Heatmaps
Yamamoto Development Of Eye Tracking Pen Display Based On Stereo Bright Pupil...
Wastlund What You See Is Where You Go Testing A Gaze Driven Power Wheelchair ...
Vinnikov Contingency Evaluation Of Gaze Contingent Displays For Real Time Vis...
Urbina Pies With Ey Es The Limits Of Hierarchical Pie Menus In Gaze Control
Urbina Alternatives To Single Character Entry And Dwell Time Selection On Eye...
Tien Measuring Situation Awareness Of Surgeons In Laparoscopic Training
Takemura Estimating 3 D Point Of Regard And Visualizing Gaze Trajectories Und...
Stevenson Eye Tracking With The Adaptive Optics Scanning Laser Ophthalmoscope
Skovsgaard Small Target Selection With Gaze Alone
San Agustin Evaluation Of A Low Cost Open Source Gaze Tracker
Rosengrant Gaze Scribing In Physics Problem Solving
Qvarfordt Understanding The Benefits Of Gaze Enhanced Visual Search
Prats Interpretation Of Geometric Shapes An Eye Movement Study
Porta Ce Cursor A Contextual Eye Cursor For General Pointing In Windows Envir...
Pontillo Semanti Code Using Content Similarity And Database Driven Matching T...
Park Quantification Of Aesthetic Viewing Using Eye Tracking Technology The In...
Palinko Estimating Cognitive Load Using Remote Eye Tracking In A Driving Simu...
Nagamatsu User Calibration Free Gaze Tracking With Estimation Of The Horizont...
Nagamatsu Gaze Estimation Method Based On An Aspherical Model Of The Cornea S...

Hardoon Image Ranking With Implicit Feedback From Eye Movements

  • 1. Image Ranking with Implicit Feedback from Eye Movements David R. Hardoon∗ Kitsuchart Pasupa† Data Mining Department School of Electronics & Computer Science Institute for Infocomm Research (I2 R) University of Southampton Abstract dure. Moreover, it is far from an ideal situation as both formulating an initial query and navigating the large number of retrieved hits In order to help users navigate an image search system, one could is a difficult. One image retrieval methodology which attempts to provide explicit information on a small set of images as to which address these issues, and has been a research topic since the early of them are relevant or not to their task. These rankings are learned 1990’s, is the so-called “Content-Based Image Retrieval”(CBIR). in order to present a user with a new set of images that are rel- The search of a CBIR system is analysed from the actual content evant to their task. Requiring such explicit information may not of the image which may includes colour, shape, and texture rather be feasible in a number of cases, we consider the setting where than using a textual annotation associated (if at all) with the image. the user provides implicit feedback, eye movements, to assist when performing such a task. This paper explores the idea of implic- Relevance feedback, which is explicitly provided by the user while itly incorporating eye movement features in an image ranking task performing a search query on the quality of the retrieved images, where only images are available during testing. Previous work had has shown to be able to improve on the performance of CBIR sys- demonstrated that combining eye movement and image features im- tems, as it is able to handle the large variability in semantic in- proved on the retrieval accuracy when compared to using each of terpretation of images across users. Relevance feedback will iter- the sources independently. Despite these encouraging results the atively guide the system to retrieve images the user is genuinely proposed approach is unrealistic as no eye movements will be pre- interested in. Many systems rely on an explicit feedback mech- sented a-priori for new images (i.e. only after the ranked images are anism, where the user explicitly indicates which images are rele- presented would one be able to measure a user’s eye movements vant for their search query and which ones are not. One can then on them). We propose a novel search methodology which com- use a machine learning algorithm to try and present a new set of bines image features together with implicit feedback from users’ images to the users which are more relevant - thus helping them eye movements in a tensor ranking Support Vector Machine and navigate the large number of hits. An example of such systems show that it is possible to extract the individual source-specific is PicSOM [Laaksonen et al. 2000]. However, providing explicit weight vectors. Furthermore, we demonstrate that the decomposed feedback is also a laborious process as it requires continues user re- image weight vector is able to construct a new image-based seman- sponse. Alternatively, it is possible to use implicit feedback to infer tic space that outperforms the retrieval accuracy than when solely relevance of images. Examples of implicit feedback are eye move- using the image-features. ments, mouse pointer movements, blood pressure, gestures, etc. In other words, user responses that are implicitly related to the task performed. CR Categories: G.3 [Probability and Statistics]: Multivariate Statistics—; H.3.3 [Information Search and Retrieval]: Retrieval In this study we explore the use of eye movements as a particu- models—Relevance feedback lar source of implicit feedback to assist a user when performing such a task (i.e. image retrieval). Eye movements can be treated Keywords: Image Retrieval, Implicit Feedback, Tensor, Ranking, as an implicit relevance feedback when the user is not consciously Support Vector Machine aware of their eye movements being tracked. Eye movement as implicit feedback has recently been used in the image retrieval set- ting [Oyekoya and Stentiford 2007; Klami et al. 2008; Pasupa et al. 1 Introduction 2009]. [Oyekoya and Stentiford 2007; Klami et al. 2008] used eye movements to infer a binary judgement of relevance while [Pasupa In recent years large digital image collections have been created in et al. 2009] makes the task more complex and realistic for search- numerous areas, examples of these include the commercial, aca- based task by asking the user to give multiple judgement of rele- demic, and medical domains. Furthermore, these databases also in- vance. Furthermore, earlier studies of Hardoon et al. [2007] and clude the digitisation of analogue photographs, paintings and draw- Ajanki et al. [2009] explored the problem of where an implicit in- ings. Conventionally, the images collected are manually tagged formation retrieval query is inferred from eye movements measured with various descriptors to allow retrieval to be performed over the during a reading task. The result of their empirical study is that it annotated words. However, the process of manually tagging images is possible to learn the implicit query from a small set of read doc- is an extremely laborious, time consuming and an expensive proce- uments, such that relevance predictions for a large set of unseen ∗ e-mail: documents are ranked better than by random guessing. More re- drhardoon@i2r.a-star.edu.sg † e-mail: cently, Pasupa et al. [2009] demonstrated that ranking of images kp2@ecs.soton.ac.uk can be inferred from eye movements using Ranking Support Vector Machine (Ranking SVM). Their experiment shows that the perfor- mance of the search can be improved when simple images features namely histograms are fused with the eye movement features. Copyright © 2010 by the Association for Computing Machinery, Inc. Despite Pasupa et al.’s [2009] encouraging results, their proposed Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed approach is largely unrealistic as they combine image and eye fea- for commercial advantage and that copies bear this notice and the full citation on the tures for both training and testing. Whereas in a real scenario no first page. Copyrights for components of this work owned by others than ACM must be eye movements will be presented a-priori for new images. In other honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on words, only after the ranked images are presented to a user, would servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail one be able to measure the users’ eye movements on them. There- permissions@acm.org. fore, we propose a novel search methodology which combines im- ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00 291
  • 2. age features together with implicit feedback from users’ eye move- with label ck and m being the total number of new samples. Fi- ments during training, such that we are able to rank new images nally, we quote from Cristianini and Shawe-Taylor [2000] the gen- with only using image features. We believe it is indeed more real- eral dual SVM optimisation as istic to have images and eye-movements during the training phase m m as these could be acquired deliberately to train up such a system. 1 max W (α) = αi − αi αj ci cj κ(xi , xj ) (3) α 2 For this purpose, we propose using tensor kernels in the ranking i=1 i,j=1 SVM framework. Tensors have been used in the machine learning m literature as a means of predicting edges in a protein interaction or subject to i=1 αi ci = 0 and αi ≥ 0 i = 1, . . . , m, co-complex network by using the tensor product transformation to where we again use ci to represent the label and κ(xi , xj ) to be the derive a kernel on protein pairs from a kernel on individual pro- kernel function between φ(x)i and φ(x)j , where φ(·) is a mapping teins [Ben-Hur and Noble 2005; Martin et al. 2005; Qiu and No- from X (or Y ) to an (inner product) feature space F . ble 2008]. In this study we use the tensor product to constructed a joined semantics space by combining eye movements and im- age features. Furthermore, we continue to show that the combined 3 Tensor Ranking SVM learnt semantic space can be efficiently decomposed into its con- tributing sources (i.e. images and eye movements), which in turn In the following section we propose to construct a tensor kernel can be used independently. on the ranked image and eye movements features, i.e. following equation (1), to then to train an SVM. Therefore, let X ∈ Rn×m The paper is organised as follows. In Section 2 we give a brief intro- and Y ∈ Rℓ×m be the matrix of sample vectors, x and y, for the duction to the ranking SVM methodology and continue to develop image and eye movements respectively, where n is the number of in Section 3 our proposed tensor ranking SVM and the efficient de- image features and ℓ is the number of eye movement features and composition of the joint semantic space into the individual sources. m are the total number of samples. We continue to define K x , K y In Section 5 we give our experimental set up whereas in Section 6 as the kernel matrices for the ranked images and eye movements we discusses the feature extraction and representation of the images respectively. In our experiments we use linear kernels, i.e. K x = and eye movements. In section 7 we bring forward our experiments X ′ X and K y = Y ′ Y . The resulting kernel matrix of the tensor on page ranking for individual users as well as a feasibility study on T = X ◦Y can be expressed as pair-wise product (see [Pulmannov´ a user generalisation. Finally, we conclude our study with discussion 2004] for more details) on our present methodology and results in Section 8. ¯ Kij = (T ′ T )ij = Kij Kij . x y 2 Ranking SVM ¯ We use K in conjunction with the vanilla SVM formulation as given in equation (3). Whereas the set up and training are straight The Ranking Support Vector Machine (SVM) was proposed by forward, the underlying problem is that for testing we do not have Joachims [2002] which was adapted from ordinal regression [Her- the eye movements. Therefore we propose to decompose the re- brich et al. 2000]. It is a pair-wise approach where the solution is sulting weight matrix from its corresponding image and eye com- a binary classification problem. Let xi denote some feature vector ponents such that each can be used independently. and let ri denote the ranking assigned to xi . If r1 ≻ r2 , it means that x1 is more relevance than x2 . Consider a linear ranking func- The goal is to decompose the weight matrix W given by a dual tion, representation xi ≻ xj ⇐⇒ w, xi − w, xj > 0, m where w is a weight vector and ·, · denotes dot product between W = αi ci φx (xi ) ◦ φy (yi ) i vectors. This can be placed in a binary SVM classification frame- work where let ck be the new label indicating the quality of kth without accessing the feature space. Given the paired samples x, y rank pair, the decision function in equation is w, xi − xj = ck = +1 if ri ≻ rj , (1) f (x, y) = W ◦ φx (x)φy (y)′ (4) ck = −1 if rj ≻ ri m = αi ci κx (xi , x)κy (yi , y). which can be solved by the following optimisation problem, i=1 1 min w, w + C ξk (2) 4 Decomposition 2 k The resulting decision function in equation (4) requires both image subject to the following constrains: and eye movement (xi , yi ) data for training and testing. We want to be able to test our model only using the image data. Therefore, ∀(i, j) ∈ r (k) : ck ( w, xi − xj + b) ≥ 1 − ξk we want to decompose the weight matrix (again without accessing ∀(k) : ξk ≥ 0 the feature space) into a sum of tensor products of corresponding weight components for the images and eye movements where r (k) = [r1 , r2 , . . . , rt ] for t rank values, furthermore C is T a hyper-parameter which allows trade-off between margin size and ′ W ≈ WT = t t wx wy , (5) training error, and ξk is training error. Alternatively, we are repre- t=1 sent the ranking SVM as a vanilla SVM where we re-represent our samples as t such that the weights are a linear combination of the data, i.e. wx = m φ(x)k = xi − xj i=1 βi φx (xi ) and wy = m γi φy (yi ) where β t , γ t are the t t i=1 t 292
  • 3. t t dual variables of wx , wy . We proceed to define our decomposition 5 Experimental Setup procedure such that we do not need to compute the (potentially non- linear) feature projection φ. We compute Our experimental set-up is as follows: Users are shown 10 images m on a single page as a five by two (5x2) grid and are asked to rank WW′ = αi αj ci cj κy (yi , yj )φx (xi )φx (xj )′ (6) the top five images in order of relevance to the topic of “Transport”. i,j This concept is deliberately slightly ambiguous given the context of images that were displayed. Each displayed page contained 1–3 and are able to express K y = (κy (yi , yj ))m = clearly relevant images (e.g. a freight train, cargo ship or airliner), i,j=1 K k′ 2–3 either borderline or marginally relevant images (e.g. bicycle or k=1 λk uk u = U ΛU ′ , where U = (u1 , . . . , uK ) by perform- baby carrier), and the rest are non-relevant images (e.g. images of ing an eigenvalue decomposition of the kernel matrix K y with en- people sitting at a dining room table, or a picture of a cat). y tries Kij = κy (yi , yj ). Substituting back into equation (6) gives The experiment had 30 pages in total, each showing 10 images from K m the PASCAL Visual Objects Challenge 2007 database [Everingham ′ ′ WW = λk αi αj ci cj uk uk φx (xi )φx (xj )′ . i j et al. ]. The interface consisted of selecting radio buttons (labelled k i,j 1st to 5th under each image) then clicking on next to retrieve the next page. This represents data for a ranking task where explicit m k ′ Letting hk = i=1 αi ci ui φx (xi ) we have W W = ranks are given to compliment any implicit information contained in K √ √ ′ ′ λk hk hk = HH where H = λ1 h1 , . . . , λK hK . We the eye movements. An example of each page is shown in figure 1. k would like to find the singular value decomposition of H = V ΥZ ′ . Consider for A = diag(α) and C = diag(c) we have The experiment was performed by six different users, with their eye movements recorded by a Tobii X120 eye tracker which was con- nected to a PC using a 19-inch monitor (resolution of 1280x1024). H ′H kℓ = λk λℓ αi αj ci cj uk uℓ κx (xi , xj ) i j The eye tracker has approximately 0.5 degrees of accuracy with a ij sample rate of 120 Hz and used infrared lens to detect pupil centres 1 ′ 1 = CAU Λ 2 K x CAU Λ 2 , and corneal reflection. The final data collected per user is illus- kℓ trated in table 1. Any pages that contained less than five images with gaze points (for example due to the subject moving and the which is computable without accessing the feature space. Perform- eye-tracker temporarily losing track of the subject’s eyes) were dis- ing an eigenvalue decomposition on H ′ H we have carded. Hence, only 29 and 20 pages were valid for users 4 and 5, respectively. H ′H = ZΥV ′ V ΥZ ′ = ZΥ2 Z ′ (7) with Υ a matrix with υt on the diagonal truncated after the J’th Table 1: The data collected per user. ∗ Pages with less than five 1 eigenvalue, which gives the dual representation of vt = υt Hzt for images with gaze points were removed. Therefore users 4 and 5 ′ 2 only have 29 and 20 pages viewed respectively. t = 1, . . . , T , and since H Hzt = υt zt we are able to verify that 1 User # Pages Viewed 2 W W ′ vt = HH ′ vt = HH ′ Hzt = υt Hzt = υt vt . User 1 30 υt User 2 30 User 3 30 Restricting to the first T singular vectors allows us to express W ≈ User 4∗ 29 ′ W T = T vt (W ′ vt ) , which in turn results in User 5∗ 20 t=1 User 6 30 m t 1 t wx = vt = Hzt = βi φx (xi ), υt i=1 5.1 Performance Measure t 1 T √ where βi = αc υt i i k=1 λk zt uk k i . We can now also express We use the Normalised Discount Cumulative Gain m (NDCG) [J¨ rvelin and Kek¨ l¨ inen 2000] as our performance a aa 1 ′ metric, due to our task involving multiple ranks rather than a binary wy = W ′ vt = t W Hzt = t γi φy (yi ), υt i=1 choice. NDCG measures the usefulness, or gain, of a retrieved item based on its position in the result list. NDCG is designed for where γi = m αi ci βj κx (xi , xj ) are the dual variables of wy . t t t tasks which have more than two levels of relevance judgement, and j=1 We are therefore now able to decompose W into Wx , Wy without is defined as, accessing the feature space giving us the desired result. k 1 We are now able to compute, for a given t, the ranking scores NDCGk (r) = D(ri )ϕ(gi ) t ˆ Nn in the linear discriminant analysis form s = wx φ(X) = i=1 m t ˆ ˆ βi κx (xi , X) for new test images X. These are in turn sorted i=1 1 in order of magnitude (importance). Equally, we can project our with D(r) = log (1+r) and ϕ(g) = 2g − 1, where for a given 2 data into the new defined semantic space β where we train and test page r is rank position and k is a truncation level (position), N is a ˜ an SVM. i.e. we compute φ(x) = K x β, for the training samples, normalising constant which gives the perfect ranking (based on gi ) ˜ x and φ(xt ) = Kt β for our test samples. We explore both these equal to one, and gi is the categorical grade; e.g. grade is equal to 5 approaches in our experiments. for the 1st rank and 0 for the 6th . 293
  • 4. Figure 1: An example illustrating the outlay of the interface displaying the 10 images with the overlaid eye movement measurements. The circles indicate fixations. 6 Feature extraction 0.65 Random Performance 0.6 0.55 0.5 In the following experiments we use standard image histograms and 0.45 NDCGk features collected from eye-tracking. We compute a 256-bin grey 0.4 scale histogram on the whole image as the feature representation. 0.35 0.3 These features are intentionally kept relatively simple. Although, 0.25 a possible extension of the current representation is to segment the 0.2 image and only use regions that have gaze information. We intend 1 2 3 4 5 6 7 8 9 10 Positions to explore this extension in a future study. Figure 2: NDCG performance for predicting random rankings. The eye movement features are computed using only on the eye trajectory and locations of the images in the page. This type of features are general-purpose and are easily applicable to all appli- cation scenarios. The features are divided into two categories; the mended for media with mixed content1 . first category uses the raw measurements obtained from the eye- Some of the features are not invariant to the location of the image tracker, whereas the second category is based on fixations estimated on the screen. For example, the typical pattern of moving from from the raw data. A fixation means a period in which a user main- left to right means that the horizontal co-ordinate of the first fixa- tains their gaze around a given point. These are important as most tion for the left-most image of each row typically differs from the visual processing happens during fixations, due to blur and sac- corresponding measure on the other images. Features that were ob- cadic suppression during the rapid saccades between fixations (see, served to be position-dependent were normalised by removing the e.g. [Hammoud 2008]). Often visual attention features are based mean of all observations sharing the same position, and are marked solely on fixations and the relation between them [Rayner 1998]. in Table 2. Finally, each feature was normalised to have unit vari- However, raw measurement data might be able to overcome possi- ance and zero mean. ble problems caused by imperfect fixation detection. 7 Experiments In table 2 we list the candidate features we have considered. Most of the listed features are motivated by earlier studies in text retrieval We evaluate two different scenarios for learning the ranking of im- [Saloj¨ rvi et al. 2005]. The features cover the three main types a age based on image and eye features; 1. Predicting rankings on a of information typically considered in reading studies: fixations, page given only other data from a single specific user. 2. A global regressions (fixations to previously seen images), and re-fixations model using data from other users to predict rankings for a new (multiple fixations within the same image). However, the features unseen user. have been tailored to be more suitable for images, trying to include measures for things that are not relevant for text, such as the cover We compare our proposed tensor Ranking SVM algorithm which of the image. Similarly to the image features, the eye movement combines both information from eye movements and image his- features are intentionally kept relatively simple with the intent that togram features to a Ranking SVM using histogram features and they are more likely to generalise over different users. Fixations to a Ranking SVM using eye movements alone. We emphasis that were detected using the standard ClearView fixation filter provided with the Tobii eye-tracking software, with settings “radius 30 pix- 1 Tobii Technology, Ltd. Tobii Studio Help. url: els, minimum duration 100 ms”. These are also the settings recom- http://studiohelp.tobii.com/StudioHelp 1.2/ 294
  • 5. Table 2: We list the eye movement features considered in this study. The first 16 features are computed from the raw data, whereas the remainder are based on pre-detected fixations. We point out to the reader that features number 2 and 3 use both types of data since they are based on raw measurements not belonging to fixations. All the features are computed separately for each image. Features marked with ∗ are normalised for each image location. Number Name Description Raw data features 1 numMeasurements total number of measurements 2 numOutsideFix total number of measurements outside fixations 3 ratioInsideOutside percentage of measurements inside/outside fixations 4 xSpread difference between largest and smallest x-coordinate 5 ySpread difference between largest and smallest y-coordinate 6 elongation ySpread/xSpread 7 speed average distance between two consecutive measurements 8 coverage number of subimages covered by measurements1 9 normCoverage coverage normalized by numMeasurements 10∗ landX x-coordinate of the first measurement 11∗ landY y-coordinate of the first measurement 12∗ exitX x-coordinate of the last measurement 13∗ exitY y-coordinate of the last measurement 14 pupil maximal pupil diameter during viewing 15∗ nJumps1 number of breaks longer than 60 ms2 16∗ nJumps2 number of breaks longer than 600 ms2 Fixation features 17 numFix total number of fixations 18 meanFixLen mean length of fixations 19 totalFixLen total length of fixations 20 fixPrct percentage of time spent in fixations 21∗ nJumpsFix number of re-visits to the image 22 maxAngle maximal angle between two consecutive saccades3 23∗ landXFix x-coordinate of the first fixation 24∗ landYFix y-coordinate of the first fixation 25∗ exitXFix x-coordinate of the last fixation 26∗ exitYFix y-coordinate of the last fixation 27 xSpreadFix difference between largest and smallest x-coordinate 28 ySpreadFix difference between largest and smallest y-coordinate 29 elongationFix ySpreadFix/xSpreadFix 30 firstFixLen length of the first fixation 31 firstFixNum number of fixations during the first visit 32 distPrev distance to the fixation before the first 33 durPrev duration of the fixation before the first 1 The image was divided into a regular grid of 4x4 subimages. 2 A sequence of measurements outside the image occurring between two consecutive measure- ments within the image. 3 A transition from one fixation to another. training and testing a model using only eye movements is not re- remaining pages, from the same user, are used for training. alistic as there are no eye movements presented a-priori for new images, i.e. one can not test. This comparison provides us with We evaluate the proposed approach with the following four setting: a baseline as to how much it may be possible to improve on the • T 1: using the largest component of tensor decomposition in performance using eye movements. Furthermore, we are unable to the form of a linear discriminator. We use the weight vec- make direct comparison to [2009] as they had used an online learn- tor corresponding to the largest eigenvalue (as we have a t ing algorithm with different image features. weights). In the experiments we use a linear kernel function. Although, it is • T 2: we project the image features into the learnt semantic possible to use a non-linear kernel on the eye movement features as space (i.e. the decomposition on the image source) and train this would not effect the decomposition for the image weights (as- and test within the projected space a secondary Ranking SVM suming that φx (xi ) are taken as the image features in equation (6)). (Ranking SVM). In figure 2 we give the NDCG performance for predicting random ranking. • T 1all : similar to T 1 although here we use all t weight vectors and take the mean value across as the final score. 7.1 Page Generalisation • T 1opt : similar to T 1 although here we use the n-largest com- ponents of the decomposition. i.e. we select n weight vectors to use and take the mean value across as the final score. In the following section we focus on predicting rankings on a page given only other data from a single specific user (we repeat this We use a leave-one-out cross-validation for T 1opt to obtain the op- for all users). We employ a leave-page-out routine where at each timal model for the later case which are selected based on maximum iteration a page, from a given user, is withheld for testing and the average NDCG across 10 positions. 295
  • 6. Average 0.7 • T 2: we project the image features into the learnt seman- tic space (i.e. the decomposition on the image source) and 0.6 train and test within the projected space a secondary Ranking SVM. 0.5 We plot in figure 5 the resulting NDCG performance for the leave- NDCGk 0.4 user-out routine. We are able to observe, with the exclusion of user 2 in figure 5(b), that T 2 is able to outperform the Ranking SVM 0.3 Images on image features. Indicating that it is possible to generalise our Eyes proposed approach across new unseen users. Furthermore, it is in- T1 0.2 T2 teresting to observe that T 2 achieves a similar performance to that T1 all of a Ranking SVM trained and tested on the eye features. Finally, opt T1 0.1 even though we do not improve when testing on data from user 2, 1 2 3 4 5 6 7 8 9 10 we are able to observe that we perform as-good-as the baselines. In Positions figure 5(e) we plot the average NDCG performance on the leave- user-out routine, demonstrating that on average we improve on the Figure 4: Average NDCG performance across all users for predict- ranking of new images for new users and that we perform better ing rankings on a page given only other data from a single specific than random (figure 2). user. 8 Discussion We plot the user specific leave-page-out NDCG performances in figure 3 where we are able to observe that T 2 consistently outper- Improving search and content based retrieval systems with implicit forms the image feature Ranking SVM across all users, demonstrat- feedback is an attractive possibility given that a user is not required ing that it is indeed possible to improve on the image ranking with to explicitly provide information to then improve, and personalise, the incorporation of eye movement features during training. Fur- their search strategy. This, in turn, can render such a system more thermore, it is interesting to observe that for certain users T 1opt user-friendly and simple to use (at least from the users’ perspec- improves on the ranking performance, suggesting that there is an tive). Although, achieving such a goal is non-trivial as one needs optimal combination of the decomposed features that may further to be able to combine the implicit feedback information into the improve on the results. search system in a manner that does not then require the implicit information for testing. In our study we focus on implicit feedback In figure 4 we plot the average performance across all users. The in the form of eye movements, as these are easily available and can figure shows that T 1 and T 1all are slightly worse than using image be measured in a non-intrusive manner. histogram alone. However, when selecting using cross-validation the number of largest components in tensor decomposition, the per- Previous studies [Hardoon et al. 2007; Ajanki et al. 2009] have formance of the classifier is improved and outperforms the Ranking shown the feasibility of such systems using eye moments for a tex- SVM with eye movements. Furthermore, we are able to observe tual search task. Demonstrating that it is indeed possible to ‘en- that we perform better than random (figure 2). Using classifier T 2, rich’ a textual search with eye features. Their proposed approach the performance is improved above the Ranking SVM with image is computationally complex since it requires the construction of a features and it is competitive with Ranking SVM with eye move- regression function on eye measurements on each word. This was ments features. not realistic in our setting. Furthermore, Pasupa et al. [2009] had extend the underlying methodology of using eye movement as im- 7.2 User Generalisation plicit feedback to an image retrieval system, combining eye move- ments with image features to improve the ranking of retrieved im- In the following section we focus on learning a global model using ages. Although, still, the proposed approach required eye features data from other users to predict rankings for a new unseen user. for the test images which would not be practical in a real system. Although, as the experiment is set up such that all users view the In this paper we present a novel search strategy for combining eye same pages, we employ a leave-user-leave-page-out routine, i.e; movements and image features with a tensor product kernel used in For all users a ranking support vector machine framework. We continue to show Withhold data from user i that the joint learnt semantic space of eye and image features can For all pages be efficiently decomposed into its independent sources allowing us Withhold page j from all users to further test or train only using images. We explored two different Train on all pages-j from all users - i search scenarios for learning the ranking of images based on image and eye features. The first was predicting ranking on a page given Test on page j from user i only other data from a single specific user. This experiment was Endfor to test the fundamental question of whether eye movement are able Endfor to improve ranking for a user. Demonstrating that it was indeed possible to improve in the single subject setting, we then proceeded Therefore we only use the users from table 1 who viewed the same to our second setting where we constructed a global model across number of pages, i.e. users 1, 2, 3 and 6, which we refer to hence- users in attempt to generalise on data from a new user. Again our forth as users 1-4. results demonstrated that we are able to generalise out model to new We evaluate the proposed approach with the following two setting: users. Despite these promising results, it was also clear that using a single direction (weight vector) does not necessarily improve on • T 1: using the largest component of tensor decomposition in the baseline result. Motivating the need for a more sophisticated the form of a linear discriminator. We use the weight vec- combination of the resulting weights. This, as well as extending tor corresponding to the largest eigenvalue (as we have a t our experiment to a much larger number of users, will be addressed weights). in a future study. Finally, we would also explore the notion of image 296
  • 7. User 1 User 2 User 3 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 NDCGk NDCGk NDCGk 0.3 0.3 0.3 Images Images Images 0.2 Eyes 0.2 Eyes 0.2 Eyes T1 T1 T1 T2 T2 T2 0.1 all 0.1 all 0.1 all T1 T1 T1 opt opt opt T1 T1 T1 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Positions Positions Positions (a) NDCG performance within user 1 (b) NDCG performance within user 2 (c) NDCG performance within user 3 User 4 User 5 User 6 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 NDCGk NDCGk NDCGk 0.3 0.3 0.3 Images Images Images 0.2 Eyes 0.2 Eyes 0.2 Eyes T1 T1 T1 T2 T2 T2 0.1 all 0.1 all 0.1 all T1 T1 T1 opt opt opt T1 T1 T1 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Positions Positions Positions (d) NDCG performance within user 4 (e) NDCG performance within user 5 (f) NDCG performance within user 6 Figure 3: In the following sub-figures 3(a)-3(f) we illustrate the NDCG performance for each user in a leave-page-out routine, i.e. here we aim to generalise over new pages rather new users. We are able to observe that T 2 and T 1opt routinely outperform the ranking with only using image features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if eye-features were indeed available a-priori for new images. segmentation and the use of more sophisticated image features that C RISTIANINI , N., AND S HAWE -TAYLOR , J. 2000. An Introduc- are easily computable. tion to Support Vector Machines and other kernel-based learning methods. Cambridge University Press. Acknowledgements E VERINGHAM , M., VAN G OOL , L., W ILLIAMS , C. K. I., W INN , J., AND Z ISSERMAN , A. The PASCAL Visual Object The authors would like to acknowledge financial support from Classes Challenge 2007 (VOC2007) Results. http://www.pascal- the European Community’s Seventh Framework Programme network.org/challenges/VOC/voc2007/workshop/index.html. (FP7/2007–2013) under grant agreement n◦ 216529, Personal In- formation Navigator Adapting Through Viewing (PinView) project H AMMOUD , R. 2008. Passive Eye Monitoring: Algorithms, Appli- (http://www.pinview.eu). The authors would also like to thank cations and Experiments. Springer-Verlag. Craig Saunders for data collection. ¨ H ARDOON , D. R., A JANKI , A., P UOLAM AKI , K., S HAWE - TAYLOR , J., AND K ASKI , S. 2007. Information retrieval References by inferring implicit queries from eye movements. In Pro- ceedings of the 11th International Conference on Artificial In- ¨ telligence and Statistics (AISTATS), Electronic proceedings at A JANKI , A., H ARDOON , D. R., K ASKI , S., P UOLAM AKI , K., www.stat.umn.edu/ aistat/proceedings/start.htm. AND S HAWE -TAYLOR , J. 2009. Can eyes reveal interest? implicit queries from gaze patterns. User Modeling and User- H ERBRICH , R., G RAEPEL , T., AND O BERMAYER , K. 2000. Adapted Interaction 19, 4, 307–339. Large margin rank boundaries for ordinal regression. MIT Press, Cambridge, MA. B EN -H UR , A., AND N OBLE , W. S. 2005. Kernel methods for predicting protein-protein interactions. Bioinformatics 21, i38– ¨ ¨ ¨ J ARVELIN , K., AND K EK AL AINEN , J. 2000. IR evaluation meth- i46. ods for retrieving highly relevant documents. In SIGIR ’00: 297
  • 8. User 1 − withheld (trained on 3 users) User 2 − withheld (trained on 3 users) User 3 − withheld (trained on 3 users) 0.65 0.65 0.8 Images Images 0.6 0.6 Images Eyes 0.7 Eyes Eyes T1 T1 T1 0.55 T2 0.55 T2 T2 0.6 0.5 0.5 0.45 0.45 0.5 k NDCGk NDCGk NDCG 0.4 0.4 0.4 0.35 0.35 0.3 0.3 0.3 0.2 0.25 0.25 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Positions Positions Positions (a) NDCG performance on user 1 (b) NDCG performance on user 2 (c) NDCG performance on user 3 User 4 − withheld (trained on 3 users) Leave User Out − Average 0.65 0.65 Images Images 0.6 Eyes 0.6 Eyes T1 T1 0.55 T2 T2 0.55 0.5 0.5 0.45 0.45 NDCGk 0.4 NDCGk 0.4 0.35 0.35 0.3 0.3 0.25 0.2 0.25 0.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Positions Positions (d) NDCG performance on user 4 (e) Average NDCG performance Figure 5: In the following sub-figures 5(a)–5(d) we illustrate the NDCG performance in a leave-user-out (leave-page-out) routine. The average NDCG performance is given in sub-figure 5(e) where we are able to observe that T 2 outperforms the ranking of only using image features. The ‘Eyes’ plot in all the figures demonstrates how the ranking (only using eye-movements) would perform if eye-features were indeed available a-priori for new images. Proceedings of the 23rd annual international ACM SIGIR con- OYEKOYA , O., AND S TENTIFORD , F. 2007. Perceptual image re- ference on Research and development in information retrieval, trieval using eye movements. International Journal of Computer ACM, New York, NY, USA, 41–48. Mathematics 84, 9, 1379–1391. J OACHIMS , T. 2002. Optimizing search engines using clickthrough PASUPA , K., S AUNDERS , C., S ZEDMAK , S., K LAMI , A., K ASKI , data. In KDD ’02: Proceedings of the 8th ACM SIGKDD inter- S., AND G UNN , S. 2009. Learning to rank images from eye national conference on Knowledge discovery and data mining, movements. In HCI ’09: Proceeding of the IEEE 12th Inter- ACM Press, New York, NY, USA, 133–142. national Conference on Computer Vision Workshops on Human- Computer Interaction, 2009–2016. K LAMI , A., S AUNDERS , C., DE C AMPOS , T. E., AND K ASKI , S. ´ P ULMANNOV A , S. 2004. Tensor products of hilbert space effect 2008. Can relevance of images be inferred from eye movements? algebras. Reports on Mathematical Physics 53(2), 301–316. In MIR ’08: Proceeding of the 1st ACM international confer- ence on Multimedia information retrieval, ACM, New York, NY, Q IU , J., AND N OBLE , W. S. 2008. Predicting co-complexed pro- USA, 134–140. tein pairs from heterogeneous data. PLoS Computational Biol- ogy 4(4), e1000054. L AAKSONEN , J., KOSKELA , M., L AAKSO , S., AND O JA , R AYNER , K. 1998. Eye movements in reading and information E. 2000. Picsom–content-based image retrieval with self- processing: 20 years of research. Psychological Bulletin 124, 3 organizing maps. Pattern Recognition Letter 21, 13–14, 1199– (November), 372–422. 1207. ¨ ¨ S ALOJ ARVI , J., P UOLAM AKI , K., S IMOLA , J., KOVANEN , L., M ARTIN , S., ROE , D., AND FAULON , J.-L. 2005. Predicting KOJO , I., AND K ASKI , S. 2005. Inferring relevance from eye protein-protein interactions using signature products. Bioinfor- movements: Feature extraction. Tech. Rep. A82, Computer and matics 21, 218–226. Information Science, Helsinki University of Technology. 298