MediaEval 2016 - HUCVL Predicting Interesting Key Frames with Deep Models

HUCVL@MediaEval 2016:
Predicting Interesting Key Frames with Deep Models
Göksu Erdoğan, Aykut Erdem, Erkut Erdem
{goksuerdogan, aykut, erkut}@cs.hacettepe.edu.tr
Experimental Results
• We submit totally three runs on the test set.
• Each of them corresponds to different deep model.
2016 Predicting Media Interestingness Task
• The MediaEval 2016 Predicting Media Interestingness Task as a
new task [1].
• We focus on the image interestingness subtask.
• Goal: Automatically identify interesting key frames of a given
movie trailer.
References
[1] C.-H. Demarty, M. Sj•oberg, B. Ionescu, T.-T. Do, H. Wang, N. Q.K. Duong,
and F. Lefebvre, "Mediaeval 2016 predicting media interestingness task", In
Proc. of
the MediaEval 2016 Workshop, 2016.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classication with
deep convolutional neural networks", Advances in Neural Information Processing
Systems, pages 1097 - 1105, 2012.
[3] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and predicting
image memorability at a large scale", In Proc. International Conference on
Computer Vision, pages 2390 - 2398, 2015.
[4] X. Wang and A. Gupta, "Unsupervised learning of visual representations using
videos.", In Proc. International Conference on Computer Vision, pages 2794 -
2802, 2015.
Conclusion & Future Work
• Imbalanced data makes training process challenging.
• Future directions:
– Context of a local temporal neighborhood or the whole video.
– Multi-task learning scheme: jointly perform classification
(interestingness label) and regression (interestingness score).
Our Approach
• We propose three different deep models:
– AlexNet
– MemNet
– Triplet Loss
• Problem is formulated as regression problem.
• Post-processing is employed to find labels.
Deep
network
0.7
1
0.2 0.15 0.1
0.9
0.4
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8
Interestingnessscore
frame
0
0.5
1
interesting
uninteresting
Acknowledgement
This work is partially supported by the Scientic and Technological
Research Council of Turkey (Award #113E497).
Runs network
model
Run 1 AlexNet
Run 2 MemNet
Run 3 Triplet Loss
Runs mAP accuracy
Run 1 0.2125 0.8224
Run 2 0.2121 0.8275
Run 3 0.2001 0.8249
1890 211
205 36
1896 205
202 42
1893 208
202 39
Run 1 Run 2 Run 3
AlexNet [2]
• ImageNet dataset
• ILSVRC 2012 task
• Object classification
Change:
• Softmax loss layer is replaced by Euclidean loss.
• Training lasted approximately 2000 epochs.
MemNet [3]
• Memorability and interestingness are both intrinsic image
properties.
• No change: "MemNet fits directly to the our problem."
Triplet Loss [4]
• Triplet loss function:
𝐿 𝑥, 𝑥+
, 𝑥−
= max 0, 𝐷 𝑥, 𝑥+
− 𝐷 𝑥, 𝑥−
+ 𝑀
• Learning a 1D embedding space for images in which
– interesting images become closer,
– uninteresting frames become distant from interesting
images.
Change:
• Size of fc8 layer is dropped to one.
Method overview.
Interestingness Classification
• Our CNN models compute real valued interestingness scores for
each key frame of a given video sequence.
• We need to convert these scores to interestingness labels such as
interesting/uninteresting
frames mean std
interesting 0.11 0.08
uninteresting 0.89 0.08
Distributions of the confidence values for interesting/uninteresting
frames over the training data (left) and a video sample(right)
Statistics for the confidence values for interesting and
uninteresting frames over training data
Evaluation results on the test set.
Confusion matrices
• A global threshold for interestingness over all training data does not
exist.
• So threshold for interestingness is specified on the video level.
• Key frames are sorted based on interestingness scores, then top
10% frames are classified as interesting.

MediaEval 2016 - HUCVL Predicting Interesting Key Frames with Deep Models

More Related Content

What's hot (9)

Viewers also liked (19)

Similar to MediaEval 2016 - HUCVL Predicting Interesting Key Frames with Deep Models (20)

More from multimediaeval (20)

Recently uploaded (20)

MediaEval 2016 - HUCVL Predicting Interesting Key Frames with Deep Models