Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis

Television Linked To The Web

Daniel Stein1, Evlampios Apostolidis2, Vasileios Mezaris2,
Nicolas de Abreu Pereira3, Jennifer Müller3,
Mathilde Sahuguet4, Benoit Huet4, Ivo Lašek5

Enrichment of News Show Videos with
Multimodal Semi-Automatic Analysis
1 Fraunhofer Institute IAIS, Schloss Birlinghoven, Germany
2 Information Technologies Institute, CERTH, Greece
3 rbb - Rundfunk Berlin-Brandenburg, 14482 Potsdam, Germany
4 Eurecom, Sophia Antpolis, France
5 Czech Technical University and University of Economics, Prague, Czech Republic
NEM Summit, Istanbul,
www.linkedtv.eu
October 2012

Synopsis
www.linkedtv.eu

 Introduction: LinkedTV Project
 Use Cases
 Intelligent Video Analysis
 Results
 Conclusions & future plans

2 Information Technologies Institute
Centre for Research and Technology Hellas

LinkedTV ― Television Linked To the Web
www.linkedtv.eu

 Vision: 12 Excellent Partners
 hypervideo Fraunhofer Eurecom
 ubiquitously online cloud of STI GMBH Condat
Networked Audio-Visual Content CERTH BEELD EN GELUID
 decoupled from place, device or UEP Noterik
source UMONS U. ST GALLEN
CWI RBB
 Aim:
 provide interactive multimedia

service for non-professional end-
users
 Focus on television broadcast

content as seed videos

 Web: http://www.linkedtv.eu


LinkedTV Workflow
www.linkedtv.eu

Overall Architecture

Use Case Scenarios

Intelligent Video Analysis

Linking Hypervideo to Web Content
Contextualization
and Personalization
Interface and Presentation Engine


LinkedTV Workflow
www.linkedtv.eu

Overall Architecture

Use Case Scenarios


Linking Hypervideo to Web Content
Contextualization
and Personalization
Interface and Presentation Engine


Two Use Case Scenarios in LinkedTV
www.linkedtv.eu

Scenario 1 (this talk):
Interactive News Show
 Professional news
 Due to legal constraints: whitelist
 Detailed scenario archetype description
content produced by
RBB News topic,
 Seed content: local people,
news show "rbb locations,
Aktuell" objects etc
Scenario 2
(not covered here):
Hyperlinked Documentary
 Cultural content from
S&V (1700 hours of
cultural heritage AV-
content under CCL)
 Seed content:
"Antique Roadshow"
6

www.linkedtv.eu


Segmentation
www.linkedtv.eu

 Shot segmentation technique  Spatio-temporal Segmentation
 [Tsamoura et. al., 2008]  [Mezaris et. al., 2004]
 News show video performance:  News show performance: Good
“remarkably well”  False positives due to:
 Out of 269 shots detected:  Camera movement or zoom in/out (~ 55 %)
 2 had wrong starting points  Gradual transition between frames (~ 10 %)
 Erroneous motion vectors (~ 35 %)
 4 contained multiple shots

 11 were too short to evaluate
 Unwanted effect: false recognition of
properly moving banners which do not yield
additional information

V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis, "Real-time compressed-domain spatiotemporal segmentation and ontologies
for video indexing and retrieval", IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 5, pp. 606-621, May 2004.

E. Tsamoura, V. Mezaris, I. Kompatsiaris, "Gradual transition detection using color coherence and other criteria in a video shot meta-
segmentation framework", IEEE International Conference on Image Processing, Workshop on Multimedia Information Retrieval (ICIP-MIR
2008), San Diego, CA, USA, October 2008, pp. 45-48.


Concept Detection
www.linkedtv.eu

 Method was described in [Moumtzidou et.
al., 2011]
 346 concepts from TRECVID 2011 SIN task
 Overall performance:
 Correctly detected concepts > 64 %
 About 25 % of them are characterized as
particularly useful mostly related to detecting
persons (e.g., person, face, adult)
 Erroneous concepts vary between 22% -
42% and in many cases achieve high scores
(e.g., outdoor, amateur video)

Visit: http://mklab.iti.gr/eventdetection-linkedtv/

A. Moumtzidou, P. Sidiropoulos, S. Vrochidis, N. Gkalelis, S. Nikolopoulos, V. Mezaris, I. Kompatsiaris, I. Patras, "ITI-CERTH participation
to TRECVID 2011", Proc. TRECVID 2011 Workshop, December 2011, Gaithersburg, MD, USA.


Automatic Speech Recognition
www.linkedtv.eu

 Automatic speech recognition for German (using [Schneider08]):

segment of one news show WER notes
new airport 36.2 outdoor, spontaneous
soccer riot 44.2 tavern, dialect, background noise
various news I 9.5

murder case 24.0

boxing 50.6 dialect, very spontaneous
various news II 20.9

rbb game 39.1

weather report 46.7 spontaneous, casual

 main obstacles: local dialect, spontaneous speech, background noise

Schneider, D., Schon, J., and Eickeler, S. (2008). Towards Large Scale Vocabulary Independent Spoken Term Detection: Advances
in the Fraunhofer IAIS Audiomining System. In Proc. SIGIR, Singapore.


Person Detection
www.linkedtv.eu

 Face clustering using the face.com api  Speaker Identification using a GMM-
 Result: generally very good, some HMM model, with 253 German parliament
erroneous clusters due to side-view speakers
 Result: 8.0% Equal Error Rate

Conclusions
www.linkedtv.eu

 We have established:
 all the different video analysis techniques
 their exact functionality
 the connections among them

 Preliminary results work as a solid ground for further improvements

 Many challenges have been addressed but several aspects of the analysis
techniques show much room for improvement, e.g.,
 over-sensitivity of spatiotemporal segmentation algorithm to gradual transitions
and camera’s movement
 adaptation of several TRECVID concepts to the needs of each specific multimedia
content (news show, documentary, art show)
 over-sensitivity of speech recognizer to localized dialects and background noise


Future Plans
www.linkedtv.eu

 Incorporate new methods:
 Near-duplicate Content Detection
 Goal: find parts that are already watched
 Optical Character Recognition
 Goal: exploit banner information to obtain a database for face and
speaker recognition
 Topic Segmentation
 Goal: improve scene segmentation

 Find synergies between methods:
 ASR + Speaker Recognition + Face Detection
 Person Detection
 ASR + Topic Classification + Shot Segmentation
 Story Segmentation
 Concept Detection + Keywords Extraction + Topic Segmentation
 Video Similarity/Clustering


www.linkedtv.eu

Questions ?

More information:
http://www.iti.gr/~bmezaris
bmezaris@iti.gr

http://www.linkedtv.eu


Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis

More Related Content

What's hot (17)

Viewers also liked (8)

Similar to Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis (20)

More from LinkedTV (20)

Recently uploaded (20)

Enrichment of News Show Videos with Multimodal Semi-Automatic Analysis