SlideShare a Scribd company logo
Deep learning for semantic
analysis and annotation of
conventional and 360° video
Hannes Fassold
Who we are
• Smart Media Solutions Team
• CCM research group @ DIGITAL, JOANNEUM RESEARCH, Graz, Austria
• Content-based quality analysis & restoration of film and video
• http://vidicert.com
• http://www.hs-art.com
• Semantic video analysis
• Extraction of semantic information from a video
(with deep learning and classical methods)
• Shot & cadence detection
• Brand monitoring
• Object detection & recognition (faces, persons, …)
• Most components are real-time capable
2
Presentation overview
• Deep learning in a nutshell
• Face detection & recognition
• State of the art & issues
• Object detection & tracking
• State of the art & issues
• Applications
• Semi-automatic annotation of video
• Generating non-interactive version
of 360° video
3
Deep learning in a nutshell
• Deep neural networks (DNNs)
• Mimick the human brain structure
• Training
• Learn the weights for all layers
• A huge annotated (‚ground truth‘) dataset
is needed for training ‚from scratch‘
• Inference
• Run the network (classify / detect / …) for one image
• Both training and inference usually done on graphic cards (GPUs)
4
Face detection
• State of the art approaches
• Multi-Task CNN, RetinaFace, …
• Face detection is more or less ‚solved‘
• Works great even for small faces
and profile views of faces
• Accuracy of > 90 % (mAP)
• Real-time capable
(on the GPU)
5
Result of our face detection algorithm on a region of an image from a 360° video.
Content provided by Mediaset for Hyper360 project.
Face recognition
• Most algorithms rely on „closed world assumption“
• All faces occurring in the videos are known, meaning that the face recognition
algorithm has been trained on them
• State of the art approaches
• FaceNet, ArcFace, SphereFace, …
• Accuracy of > 98 % on the standard databases, processing in real-time
• Factors influencing the recognition result negatively
• Small face (or low resolution video)
• Profile view
• Bad lighting conditions
6
Face recognition – challenges & issues
• „Closed world assumption“ is difficult to achieve in practice
• You do not want to retrain your DNN if you want to recognize a new person,
as training takes quite some time …
• Incremental training can help here
• Not easy – you have to identify first that a person is ‚new‘ and have to retrain the DNN on-the-fly
• We have added incremental training in our in-house face framework
• You may not have enough annotations (samples) for each person
• 50 – 100 annotations for each person‘s face usually employed in the databases
• Training with less data is an active research area („few-shot learning“)
7
Face recognition – challenges & issues
• Class imbalance
• Some classes are under-represented in the dataset used for training the DNN
• Ethnic bias
• Publically available face datasets contain mostly faces from caucasian people
• Error rates on african people are about twice as big as for caucasian people [1]
• Few faces with glasses in most face datsets, but many asians have glasses
• Active research on methods in order to mitigate class imbalance
• Better data augmentation strategies
• Data crawling
• Synthetic generation of additional training data samples (‚face synthesis‘)
• Domain adaption & unsupervised learning
8
[1] https://arxiv.org/pdf/1812.00194.pdf
Object detection & tracking
• Task
• Detect an object of a certain class (e.g. person, dog, car, …)
and track it through its lifetime (each object gets an unique id)
• State of the art approaches
• RetinaNet, YoloV3, Faster R-CNN, …
• Usually detect 80 classes from MS COCO
• Our inhouse algorithm
• Detects & tracks general objects,
faces, text and logo in real-time
• << Demovideo >>
9
Result of our object detection & tracking algorithm
Object detection & tracking – challenges & issues
• Current state
• Algorithms are really usable in practice: robust (mAP > 60 %) and fast (real-time)
• Remaining issues
• Re-identification of objects is challenging
• E.g. persons which get occluded and then appear again (crowdy scene)
• One can use the object‘s appearance, but what if all look the same (e.g. soccer players) ?
• Simple Strategy used in our framework - newly appearing objects get a new id
• Quite limited number of object classes
• E.g. MS COCO dataset [1] has classes for a few animals
(dog, cat, horse, cow, …) but what if your the subject
of your documentary video are dinosaurs ?
10
[1] https://arxiv.org/pdf/1405.0312.pdf
Semi-automatic video annotation
• Automate the annotation process of archive videos
• Who is appearing in the video (with whom), in which video sections
• Other potentially useful metadata: facial emotion, what action is he / she doing,
what is he / she saying, what logos appear, what are the ‚video highlights‘, …
• Semi-automatic video annotation workflow
• Deep learning algorithms (face recognition, object detection & tracking, …)
do the first pass and generate the „raw metadata“
• Raw metadata is inspected and corrected (false detections, multiple ids for one
person, …) by a human operator with a convenient tool
• Hopefully the whole process is more efficient than the ‚human-only‘ workflow ☺
11
Non-interactive version of 360° video
• Generate non-interactive version of 360° video
• For archiving purposes a preview-version of the video
(additionally to original 360° video) could be fine
• For consumption of 360° video on old TV sets, or as
„lean-back mode“ for users who do not want to interact
• Rough algorithm workflow
• Works iteratively, shot-per-shot
• Extract all scene objects (focusing on persons currently)
• Determine the most „interesting“ person for the current
shot (based on size, movement, what we have seen in
last shot etc.) and track it
12
Non-interactive version of a 360° music video
(each row is one generated shot)
Content provided by RBB for Hyper360 project.
Non-interactive version of 360° video – outlook
• << Demovideo >>
• Currently working on adressing some limitations of original algorithm
• More diverse shot types: close-up, wide-angle shot, panning shot, …
(currently, all shots are tracking shots with horizontal FOV of 75°)
• Employ best-practice rules for framing and „continuity editing“
• Avoid jump-cuts
• 180° rule
• …
• Goal is „virtual director“ which trys to mimicks a certain human director‘s style
13
Acknowledgments
• Thanks to the “Hyper360” project partners RBB, Mediaset, Fraunhofer Fokus, Drukka for providing the
360° video sequences for research and development purposes within the project.
• The research leading to these results has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 761934 - Hyper360 and grant
agreement No. 761802 – MARCONI
• http://www.hyper360.eu/
• https://www.projectmarconi.eu/
14
Thank you for your attention !
Contact:
hannes.fassold@joanneum.at
JOANNEUM RESEARCH
Forschungsgesellschaft mbH
DIGITAL– Institut für Informations-
und Kommunikationstechnologien
Steyrergasse 17, 8010 Graz
Tel. +43 316 876-5000
digital@joanneum.at
www.joanneum.at/digital

More Related Content

PDF
ADHD & Technology: Brain Hacks and Upgrades
PPTX
Mapping the Digital Preservation Wilderness: What you need to know
PPTX
BPE_keynote_2014_DeRidder
PPTX
OBJECT DETECTION FOR VISUALLY IMPAIRED USING TENSORFLOW LITE.pptx
PPTX
Elderly Assistance- Deep Learning Theme detection
PDF
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
PDF
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
PPTX
Semantic human activity detection in videos
ADHD & Technology: Brain Hacks and Upgrades
Mapping the Digital Preservation Wilderness: What you need to know
BPE_keynote_2014_DeRidder
OBJECT DETECTION FOR VISUALLY IMPAIRED USING TENSORFLOW LITE.pptx
Elderly Assistance- Deep Learning Theme detection
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
Semantic human activity detection in videos

Similar to FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video (20)

PPTX
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
PDF
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
PPTX
Deep learning: the future of recommendations
PDF
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
PPTX
NDF2017 - Digitisation 101 Workshop
PPTX
IMAGE PROCESSING
PPT
Fast object re-detection and localization in video for spatio-temporal fragme...
PPTX
Face Recognition System for Door Unlocking
PPTX
Lidnug Presentation - Kinect - The How, Were and When of developing with it
PPTX
Virtual mouse
PDF
Fast object re detection and localization in video for spatio-temporal fragme...
PDF
Hacklu2011 tricaud
PDF
Deep Learning AtoC with Image Perspective
PPTX
Five Cliches of Online Game Development
PPTX
KorraAI - a probabilistic virtual agent framework
PPTX
Deep Learning and Recurrent Neural Networks in the Enterprise
PPTX
Computer vision introduction
PDF
Computer Vision
PPTX
Information from pixels
PPTX
Overview of Computer Vision For Footwear Industry
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Deep learning: the future of recommendations
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
NDF2017 - Digitisation 101 Workshop
IMAGE PROCESSING
Fast object re-detection and localization in video for spatio-temporal fragme...
Face Recognition System for Door Unlocking
Lidnug Presentation - Kinect - The How, Were and When of developing with it
Virtual mouse
Fast object re detection and localization in video for spatio-temporal fragme...
Hacklu2011 tricaud
Deep Learning AtoC with Image Perspective
Five Cliches of Online Game Development
KorraAI - a probabilistic virtual agent framework
Deep Learning and Recurrent Neural Networks in the Enterprise
Computer vision introduction
Computer Vision
Information from pixels
Overview of Computer Vision For Footwear Industry
Ad

More from FIAT/IFTA (20)

PPTX
2021 FIAT/IFTA Timeline Survey
PPTX
20211021 FIAT/IFTA Most Wanted List
PPTX
WARBURTON FIAT/IFTA Timeline Survey results 2020
PPTX
OOMEN MEZARIS ReTV
PPTX
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
PPTX
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
PPTX
HULSENBECK Value Use and Copyright Comission initiatives
PPT
WILSON Film digitisation at BBC Scotland
PDF
GOLODNOFF We need to make our past accessible!
PPTX
LORENZ Building an integrated digital media archive and legal deposit
PPTX
BIRATUNGANYE Shock of formats
PPTX
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
PPTX
BERGER RIPPON BBC Music memories
PDF
AOIBHINN and CHOISTIN Rehash your archive
PDF
HULSENBECK BLOM A blast from the past open up
PDF
PERVIZ Automated evolvable media console systems in digital archives
PPTX
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
PPTX
VINSON Accuracy and cost assessment for archival video transcription methods
PDF
LYCKE Artificial intelligence, hype or hope?
PDF
AZIZ BABBUCCI Let's play with the archive
2021 FIAT/IFTA Timeline Survey
20211021 FIAT/IFTA Most Wanted List
WARBURTON FIAT/IFTA Timeline Survey results 2020
OOMEN MEZARIS ReTV
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
HULSENBECK Value Use and Copyright Comission initiatives
WILSON Film digitisation at BBC Scotland
GOLODNOFF We need to make our past accessible!
LORENZ Building an integrated digital media archive and legal deposit
BIRATUNGANYE Shock of formats
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
BERGER RIPPON BBC Music memories
AOIBHINN and CHOISTIN Rehash your archive
HULSENBECK BLOM A blast from the past open up
PERVIZ Automated evolvable media console systems in digital archives
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
VINSON Accuracy and cost assessment for archival video transcription methods
LYCKE Artificial intelligence, hype or hope?
AZIZ BABBUCCI Let's play with the archive
Ad

Recently uploaded (20)

PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
Comments on Crystal Cloud and Energy Star.pdf
PDF
Deliverable file - Regulatory guideline analysis.pdf
PPTX
DMT - Profile Brief About Business .pptx
PPTX
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
PDF
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
PDF
Charisse Litchman: A Maverick Making Neurological Care More Accessible
PDF
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
PDF
Module 2 - Modern Supervison Challenges - Student Resource.pdf
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
How to Get Business Funding for Small Business Fast
PPT
Chapter four Project-Preparation material
PPTX
Business Ethics - An introduction and its overview.pptx
PPTX
2025 Product Deck V1.0.pptxCATALOGTCLCIA
PDF
Tata consultancy services case study shri Sharda college, basrur
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
Cours de Système d'information about ERP.pdf
PDF
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
Comments on Crystal Cloud and Energy Star.pdf
Deliverable file - Regulatory guideline analysis.pdf
DMT - Profile Brief About Business .pptx
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
Charisse Litchman: A Maverick Making Neurological Care More Accessible
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
Module 2 - Modern Supervison Challenges - Student Resource.pdf
Nidhal Samdaie CV - International Business Consultant
How to Get Business Funding for Small Business Fast
Chapter four Project-Preparation material
Business Ethics - An introduction and its overview.pptx
2025 Product Deck V1.0.pptxCATALOGTCLCIA
Tata consultancy services case study shri Sharda college, basrur
Ôn tập tiếng anh trong kinh doanh nâng cao
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Cours de Système d'information about ERP.pdf
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...

FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video

  • 1. Deep learning for semantic analysis and annotation of conventional and 360° video Hannes Fassold
  • 2. Who we are • Smart Media Solutions Team • CCM research group @ DIGITAL, JOANNEUM RESEARCH, Graz, Austria • Content-based quality analysis & restoration of film and video • http://vidicert.com • http://www.hs-art.com • Semantic video analysis • Extraction of semantic information from a video (with deep learning and classical methods) • Shot & cadence detection • Brand monitoring • Object detection & recognition (faces, persons, …) • Most components are real-time capable 2
  • 3. Presentation overview • Deep learning in a nutshell • Face detection & recognition • State of the art & issues • Object detection & tracking • State of the art & issues • Applications • Semi-automatic annotation of video • Generating non-interactive version of 360° video 3
  • 4. Deep learning in a nutshell • Deep neural networks (DNNs) • Mimick the human brain structure • Training • Learn the weights for all layers • A huge annotated (‚ground truth‘) dataset is needed for training ‚from scratch‘ • Inference • Run the network (classify / detect / …) for one image • Both training and inference usually done on graphic cards (GPUs) 4
  • 5. Face detection • State of the art approaches • Multi-Task CNN, RetinaFace, … • Face detection is more or less ‚solved‘ • Works great even for small faces and profile views of faces • Accuracy of > 90 % (mAP) • Real-time capable (on the GPU) 5 Result of our face detection algorithm on a region of an image from a 360° video. Content provided by Mediaset for Hyper360 project.
  • 6. Face recognition • Most algorithms rely on „closed world assumption“ • All faces occurring in the videos are known, meaning that the face recognition algorithm has been trained on them • State of the art approaches • FaceNet, ArcFace, SphereFace, … • Accuracy of > 98 % on the standard databases, processing in real-time • Factors influencing the recognition result negatively • Small face (or low resolution video) • Profile view • Bad lighting conditions 6
  • 7. Face recognition – challenges & issues • „Closed world assumption“ is difficult to achieve in practice • You do not want to retrain your DNN if you want to recognize a new person, as training takes quite some time … • Incremental training can help here • Not easy – you have to identify first that a person is ‚new‘ and have to retrain the DNN on-the-fly • We have added incremental training in our in-house face framework • You may not have enough annotations (samples) for each person • 50 – 100 annotations for each person‘s face usually employed in the databases • Training with less data is an active research area („few-shot learning“) 7
  • 8. Face recognition – challenges & issues • Class imbalance • Some classes are under-represented in the dataset used for training the DNN • Ethnic bias • Publically available face datasets contain mostly faces from caucasian people • Error rates on african people are about twice as big as for caucasian people [1] • Few faces with glasses in most face datsets, but many asians have glasses • Active research on methods in order to mitigate class imbalance • Better data augmentation strategies • Data crawling • Synthetic generation of additional training data samples (‚face synthesis‘) • Domain adaption & unsupervised learning 8 [1] https://arxiv.org/pdf/1812.00194.pdf
  • 9. Object detection & tracking • Task • Detect an object of a certain class (e.g. person, dog, car, …) and track it through its lifetime (each object gets an unique id) • State of the art approaches • RetinaNet, YoloV3, Faster R-CNN, … • Usually detect 80 classes from MS COCO • Our inhouse algorithm • Detects & tracks general objects, faces, text and logo in real-time • << Demovideo >> 9 Result of our object detection & tracking algorithm
  • 10. Object detection & tracking – challenges & issues • Current state • Algorithms are really usable in practice: robust (mAP > 60 %) and fast (real-time) • Remaining issues • Re-identification of objects is challenging • E.g. persons which get occluded and then appear again (crowdy scene) • One can use the object‘s appearance, but what if all look the same (e.g. soccer players) ? • Simple Strategy used in our framework - newly appearing objects get a new id • Quite limited number of object classes • E.g. MS COCO dataset [1] has classes for a few animals (dog, cat, horse, cow, …) but what if your the subject of your documentary video are dinosaurs ? 10 [1] https://arxiv.org/pdf/1405.0312.pdf
  • 11. Semi-automatic video annotation • Automate the annotation process of archive videos • Who is appearing in the video (with whom), in which video sections • Other potentially useful metadata: facial emotion, what action is he / she doing, what is he / she saying, what logos appear, what are the ‚video highlights‘, … • Semi-automatic video annotation workflow • Deep learning algorithms (face recognition, object detection & tracking, …) do the first pass and generate the „raw metadata“ • Raw metadata is inspected and corrected (false detections, multiple ids for one person, …) by a human operator with a convenient tool • Hopefully the whole process is more efficient than the ‚human-only‘ workflow ☺ 11
  • 12. Non-interactive version of 360° video • Generate non-interactive version of 360° video • For archiving purposes a preview-version of the video (additionally to original 360° video) could be fine • For consumption of 360° video on old TV sets, or as „lean-back mode“ for users who do not want to interact • Rough algorithm workflow • Works iteratively, shot-per-shot • Extract all scene objects (focusing on persons currently) • Determine the most „interesting“ person for the current shot (based on size, movement, what we have seen in last shot etc.) and track it 12 Non-interactive version of a 360° music video (each row is one generated shot) Content provided by RBB for Hyper360 project.
  • 13. Non-interactive version of 360° video – outlook • << Demovideo >> • Currently working on adressing some limitations of original algorithm • More diverse shot types: close-up, wide-angle shot, panning shot, … (currently, all shots are tracking shots with horizontal FOV of 75°) • Employ best-practice rules for framing and „continuity editing“ • Avoid jump-cuts • 180° rule • … • Goal is „virtual director“ which trys to mimicks a certain human director‘s style 13
  • 14. Acknowledgments • Thanks to the “Hyper360” project partners RBB, Mediaset, Fraunhofer Fokus, Drukka for providing the 360° video sequences for research and development purposes within the project. • The research leading to these results has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 761934 - Hyper360 and grant agreement No. 761802 – MARCONI • http://www.hyper360.eu/ • https://www.projectmarconi.eu/ 14
  • 15. Thank you for your attention ! Contact: hannes.fassold@joanneum.at JOANNEUM RESEARCH Forschungsgesellschaft mbH DIGITAL– Institut für Informations- und Kommunikationstechnologien Steyrergasse 17, 8010 Graz Tel. +43 316 876-5000 digital@joanneum.at www.joanneum.at/digital