Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags

Identifying Objects in
Images from Analyzing the
User„s Gaze Movements
for Provided Tags
Tina Walber, Ansgar Scherp, Steffen Staab
University of Koblenz-Landau, Koblenz, Germany

Multimedia Modeling Conference
Klagenfurt, Austria
January 4-6, 2012

Motivation: Image Tagging
tree

girl
car

store

people
sidewalk
 Find specific objects in images
 Analyzing the user‟s gaze path only
T. Walber, A. Scherp, S. Staab – Identifying Objects in Images 2 of 21

Research Questions

1.Best fixation measure to find the correct
image region given a specific tag?

2. Can we differentiate two regions in the
same image?


3 Steps Conducted by Users

 Look at red blinking dot
 Decide whether tag can be seen (“y” or “n”)

Dataset
 LabelMe community images
 Manually drawn polygons
 Regions annotated with tags
 182.657 images (August 2010)

 High-quality segmentation and annotation
 Used as ground truth


Experiment Images and Tags
 Randomly selected 51 images
 Contain at least two tagged regions

 Created two tag sets for the 51 images
 Each image is assigned two tags (one per set)

 Tags are either “true” or “false”
 “true”  object described by tag can be seen
 “false”  object cannot be seen on the image
 Keep subjects concentrated during experiment

Subjects & Experiment System
 20 subjects
 16 male, 4 female (age: 23-40, Ø=29.6)
 Undergrads (6), PhD (12), office clerks (2)

 Experiment system
 Simple web page in Internet Explorer
 Standard notebook, resolution 1680x1050
 Tobii X60 eye-tracker (60 Hz, 0.5° accuracy)


Conducting the Experiment
 Each user looked at 51 tag-image-pairs
 First tag-image-pair dismissed

 94.3% correct answers
 Equal for true/false tags
 ~3s until decision (average)

 85% of users strongly agreed or agreed that
they felt comfortable during the experiment
 Eyetracker did not much influence comfort

Pre-processing of Eye-tracking Data
 Obtained 547 gaze paths from 20 users where
 Users gave correct answers
 Image has “true” tag assigned
 Fixation extraction
 Tobii Studio‟s velocity & distance thresholds
 Fixation: focus on particular point on screen

 One fixation inside or near the correct region
 476 (87%) gaze paths fulfill this requirement


Analysis of Gaze Fixations (1)
 Applied 13 fixation measures on the 476 paths
(2 new, 7 standard Tobii , 4 literature)

 Fixation measure: function on users‟ gaze paths
 Calculated for each image region, over all users
viewing the same tag-image-pair


Considered Fixation Measures
Nr Name Favorite region r Origin
1 firstFixation No. of fixations before 1st on r Tobii
2 secondFixation No. of fixations before 2nd on r [13]
3 fixationsAfter No. of fixations after last on r [4]
4 fixationsBeforeDecision fixationsAfter, but before decision New
5 fixationsAfterDecision fixationsBeforeDecision and after New
6 fixationDuration Total duration of all fixations on r Tobii
7 firstFixationDuration Duration of first fixation on r Tobii
8 lastFixationDuration Duration of last fixation on r [11]
9 fixationCount Number of fixations on r Tobii
10 maxVisitDuration Max time first fixation until outside r Tobii
11 meanVisitDuration Mean time first fixation until outside r Tobii
12 visitCount No. of fixations until outside r Tobii
13 T. saccLength S. Staab – Identifying Objects in Imageslength, before fixation on r
Walber, A. Scherp, Saccade [6]of 21
11

Analysis of Gaze Fixations (2)

 For every image region (b) the fixation
measure is calculated over all gaze paths (c)
 Results are summed up per region
 Regions ordered according to fixation measure
 If favorite region (d) and tag (a) match, result is
true positive (tp), otherwise false positive (fp)

Precision per Fixation Measure
meanVisitDuration P
Sum of tp and fp assignments

fixationsBeforeDecision lastFixationDuration

fixationDuration

Fixation measures

Adding Boundaries and Weights
 Take eye-tracker inaccuracies into account
 Extension of region boundaries by 13 pixels

 Larger regions more likely to be fixated
 Give weight to regions < 5% of image size
 meanVisitDuration increases to P = 0.67

Examples: Tag-Region-Assignments


Comparison with Baselines

 Naïve baseline: largest region r is favorite
 Random baseline: randomly select favorite r

 Gaze / Gaze* significantly better (χ², α<0.001)


Effect of Gaze Path Aggregation
P

Number of gaze paths used

 Aggregation of precision P for Gaze*

 Single user still significantly better (χ² for
naive with α<0.001 and random with α<0.002)

Research Questions

 meanVisitDuration with precision of 67%

same image?


Differentiate Two Objects
 Use second tag set to identify different objects
in the same image
 16 images (of our 51) have two “true” tags
 6 images had two correct regions identified
 Proportion of 38%

 Average precision for single object is 67%
 Correct tag assignment for two images: 44%


Correctly Differentiated Objects


Research Questions

 meanVisitDuration with precision of 67%

same image?
 Accuracy of 38%
Acknowledgement: This research was partially supported by the EU projects
Petamedia (FP7-216444) andObjects in Images
T. Walber, A. Scherp, S. Staab – Identifying SocialSensor (FP7-287975). 21 of 21

Influence of Red Dot

 First 5 fixations, over all subjects and all images

Experiment Data Cleaning
 Manually replaced images with
a) Tags that are incomprehensible, require
expert-knowledge, or nonsense
b) Tag refers to multiple regions, but not all are
drawn into the image (e.g., bicycle)
c) Obstructed objects (bicycle behind a car)
d) “False”-tag actually refers to a visible part of
the image and thus were “true” tags


Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags

More Related Content

More from Ansgar Scherp (18)

Recently uploaded (20)

Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags