FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video

Deep learning for semantic
analysis and annotation of
conventional and 360° video
Hannes Fassold

Who we are
• Smart Media Solutions Team
• CCM research group @ DIGITAL, JOANNEUM RESEARCH, Graz, Austria
• Content-based quality analysis & restoration of film and video
• http://vidicert.com
• http://www.hs-art.com
• Semantic video analysis
• Extraction of semantic information from a video
(with deep learning and classical methods)
• Shot & cadence detection
• Brand monitoring
• Object detection & recognition (faces, persons, …)
• Most components are real-time capable
2

Presentation overview
• Deep learning in a nutshell
• Face detection & recognition
• State of the art & issues
• Object detection & tracking
• State of the art & issues
• Applications
• Semi-automatic annotation of video
• Generating non-interactive version
of 360° video
3

Deep learning in a nutshell
• Deep neural networks (DNNs)
• Mimick the human brain structure
• Training
• Learn the weights for all layers
• A huge annotated (‚ground truth‘) dataset
is needed for training ‚from scratch‘
• Inference
• Run the network (classify / detect / …) for one image
• Both training and inference usually done on graphic cards (GPUs)
4

Face detection
• State of the art approaches
• Multi-Task CNN, RetinaFace, …
• Face detection is more or less ‚solved‘
• Works great even for small faces
and profile views of faces
• Accuracy of > 90 % (mAP)
• Real-time capable
(on the GPU)
5
Result of our face detection algorithm on a region of an image from a 360° video.
Content provided by Mediaset for Hyper360 project.

Face recognition
• Most algorithms rely on „closed world assumption“
• All faces occurring in the videos are known, meaning that the face recognition
algorithm has been trained on them
• FaceNet, ArcFace, SphereFace, …
• Accuracy of > 98 % on the standard databases, processing in real-time
• Factors influencing the recognition result negatively
• Small face (or low resolution video)
• Profile view
• Bad lighting conditions
6

Face recognition – challenges & issues
• „Closed world assumption“ is difficult to achieve in practice
• You do not want to retrain your DNN if you want to recognize a new person,
as training takes quite some time …
• Incremental training can help here
• Not easy – you have to identify first that a person is ‚new‘ and have to retrain the DNN on-the-fly
• We have added incremental training in our in-house face framework
• You may not have enough annotations (samples) for each person
• 50 – 100 annotations for each person‘s face usually employed in the databases
• Training with less data is an active research area („few-shot learning“)
7

Face recognition – challenges & issues
• Class imbalance
• Some classes are under-represented in the dataset used for training the DNN
• Ethnic bias
• Publically available face datasets contain mostly faces from caucasian people
• Error rates on african people are about twice as big as for caucasian people [1]
• Few faces with glasses in most face datsets, but many asians have glasses
• Active research on methods in order to mitigate class imbalance
• Better data augmentation strategies
• Data crawling
• Synthetic generation of additional training data samples (‚face synthesis‘)
• Domain adaption & unsupervised learning
8
[1] https://arxiv.org/pdf/1812.00194.pdf

Object detection & tracking
• Task
• Detect an object of a certain class (e.g. person, dog, car, …)
and track it through its lifetime (each object gets an unique id)
• RetinaNet, YoloV3, Faster R-CNN, …
• Usually detect 80 classes from MS COCO
• Our inhouse algorithm
• Detects & tracks general objects,
faces, text and logo in real-time
• << Demovideo >>
9
Result of our object detection & tracking algorithm

Object detection & tracking – challenges & issues
• Current state
• Algorithms are really usable in practice: robust (mAP > 60 %) and fast (real-time)
• Remaining issues
• Re-identification of objects is challenging
• E.g. persons which get occluded and then appear again (crowdy scene)
• One can use the object‘s appearance, but what if all look the same (e.g. soccer players) ?
• Simple Strategy used in our framework - newly appearing objects get a new id
• Quite limited number of object classes
• E.g. MS COCO dataset [1] has classes for a few animals
(dog, cat, horse, cow, …) but what if your the subject
of your documentary video are dinosaurs ?
10
[1] https://arxiv.org/pdf/1405.0312.pdf

Semi-automatic video annotation
• Automate the annotation process of archive videos
• Who is appearing in the video (with whom), in which video sections
• Other potentially useful metadata: facial emotion, what action is he / she doing,
what is he / she saying, what logos appear, what are the ‚video highlights‘, …
• Semi-automatic video annotation workflow
• Deep learning algorithms (face recognition, object detection & tracking, …)
do the first pass and generate the „raw metadata“
• Raw metadata is inspected and corrected (false detections, multiple ids for one
person, …) by a human operator with a convenient tool
• Hopefully the whole process is more efficient than the ‚human-only‘ workflow ☺
11

Non-interactive version of 360° video
• Generate non-interactive version of 360° video
• For archiving purposes a preview-version of the video
(additionally to original 360° video) could be fine
• For consumption of 360° video on old TV sets, or as
„lean-back mode“ for users who do not want to interact
• Rough algorithm workflow
• Works iteratively, shot-per-shot
• Extract all scene objects (focusing on persons currently)
• Determine the most „interesting“ person for the current
shot (based on size, movement, what we have seen in
last shot etc.) and track it
12
Non-interactive version of a 360° music video
(each row is one generated shot)
Content provided by RBB for Hyper360 project.

Non-interactive version of 360° video – outlook
• << Demovideo >>
• Currently working on adressing some limitations of original algorithm
• More diverse shot types: close-up, wide-angle shot, panning shot, …
(currently, all shots are tracking shots with horizontal FOV of 75°)
• Employ best-practice rules for framing and „continuity editing“
• Avoid jump-cuts
• 180° rule
• …
• Goal is „virtual director“ which trys to mimicks a certain human director‘s style
13

Acknowledgments
• Thanks to the “Hyper360” project partners RBB, Mediaset, Fraunhofer Fokus, Drukka for providing the
360° video sequences for research and development purposes within the project.
• The research leading to these results has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 761934 - Hyper360 and grant
agreement No. 761802 – MARCONI
• http://www.hyper360.eu/
• https://www.projectmarconi.eu/
14

Thank you for your attention !
Contact:
hannes.fassold@joanneum.at
JOANNEUM RESEARCH
Forschungsgesellschaft mbH
DIGITAL– Institut für Informations-
und Kommunikationstechnologien
Steyrergasse 17, 8010 Graz
Tel. +43 316 876-5000
digital@joanneum.at
www.joanneum.at/digital

FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video

More Related Content

Similar to FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video (20)

More from FIAT/IFTA (20)

Recently uploaded (20)

FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video