SlideShare a Scribd company logo
Automatic multi-modal
metadata annotation
based on trained
cognitive solutions
Jakob Rosinski
Lead Architect Video & Broadcast
IBM GBS Europe
Lead Architect Video & Broadcast, IBM GBS Europe
Member IBM Global Center of Competence Telco, Media & Entertainment
Member IBM Technical Expert Council Central (TEC CR)
Product Owner IBM AREMA
Jakob Rosinski is the Lead Architect for Video & Broadcast for IBM Global Business
Services Europe and also member of IBMs Global Center of Competence for Telecom,
Media & Entertainment. In this role he is also the product owner of IBM AREMA, a
workflow and essence management solution which is widely used at different
broadcasters for essence archives and workflow automation.
Over the last decade Jakob was responsible for various projects in the media industry
at HBO, France24, ORF, SRF, RTL Mediengruppe or Deutsche Bundesliga/Sportcast. He
is a subject matter expert for multi-site &multi-tier essence management and
workflow automation for ingest, archive, production & distribution.
Further he is well recognized in topics like cognitive content enrichment and broadcast
integration.
Dipl.-Inf. (M.Sc.) Jakob Rosinski
2
1. Introduction
2. Components
3. Training & Optimization
4. Analysis & Aggregation
5. Overall process & Integration
Agenda
3
Introduction
„Rich metadata is the key to content discovery and monetization. It powers
advanced video search and recommandation engines...“
FKTG Magazin 03/2017, S.84
5
Scene Detection / Segmentation
Deep Video-Analysis
 People-, Object and Context-Detection
 Classification of actors based on 24
emotions
 Classification of scenes based on 22.000
categories
Deep Audio-Analysis
 Background
 Actor sentiment and tone
Analysis of scene composition
 Classification of light and color
Analysis of succesful
trailers
https://www.youtube.com/watch?v=gJEzuYynaiw
6
7
Automatic content enrichment of 40+ years of soccer
content
 Annotation by usage of a portfolio of cognitive
solutions (IBM, FRH, Google, MS)
 Audio: Speech-to-text / Transcript
 Audio: Speaker-Detection
 Audio: Atmosphere (cheers, whistles, ..)
 Video: Angle/Camera & Context Detection
 Video: Face- & Object Detection
 Domain trained services including Traningsportal
 Sharpening of results by knowledge of domain and
creation of timelines, identifiying of concepts
Link with Game- and Playerdata
 Optimize content analysis and search based on game
and player statistics
 Guided search.
Persona-based User Experience
 Personalized Discovery, Suggestions, Design & Projects
Content enrichment for
Bundesliga archive
8
Components
Magical Metadata
10
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text / Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available
Magical Metadata
11
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text / Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available
IBM Watson Visual Recognition
Visual Recognition understands the contents of images - visual concepts
tag the image, find human faces, approximate age and gender, and find
similar images in a collection. You can also train the service by creating
your own custom concepts. Use Visual Recognition to detect a dress
type in retail, identify spoiled fruit in inventory, and more.
 Image Recognition
 Text Recognition
 Face- & Persondetection
 Pattern search / Collection
 Trainable
12
13
IBM Watson Visual Recognition
IBM Watson
Visual
Recognition –
A Multi-layered
trainable
architecture for
image analysis
• Need to learn effective semantic classifiers using a wide diversity of audio-visual features and models
• Need to design a rich space of semantic concepts that captures multiple facets of audio-visual content
FeaturesColor
Background
Frequencies SpectrumEdges
Camera
Motion
Energy Zero-crossings
Models
P P P P
P P P
P
PP
Positive
Examples
Negative
Examples
N N N N
N N N
N
NN
Labeled Data
Unlabeled Data
Addaboost
K-means
Regression
Bayes Net
Nearest
Neighbor
Neural Net
Deep Belief Nets
GMMClustering
Markov
ModelDecision TreeExpectation
Maximization
Factor Graph
Shot
Boundaries
Semantics
Multimedia Data
Scenes
Locations
Settings Objects
Activities
Actions
Objects
Actions
Behaviors
People
Objects
Living
CarsAnimals
People
Vehicles
Activities
Scenes
People
Places Faces
Objects
Events
Activities
GMMSVMs
ShapeTexture
Ensemble
Classifiers
Motion
Moving
Objects
Active
Learning
Regions
Scene
Dynamics
Tracks
14
Microsoft Cognitive Services
 Image Recognition
This feature returns information about visual content found in an image.
Use tagging, descriptions and domain-specific models to identify
content and label it with confidence. Apply the adult/racy settings to
enable automated restriction of adult content. Identify image types and
color schemes in pictures.
 Text Recognition
Optical Character Recognition (OCR) detects text in an image and
extracts the recognized words into a machine-readable character
stream. Analyze images to detect embedded text, generate character
streams and enable searching. Allow users to take photos of text
instead of copying to save time and effort.
 Face- & Persondetection
The Celebrity Model is an example of Domain Specific Models. Our
new celebrity recognition model recognizes 200K celebrities from
business, politics, sports and entertainment around the World. Domain-
specific models is a continuously evolving feature within Computer
Vision API.
 Emotiondetection
15
Google Vision
Google Cloud Vision API enables developers to understand
the content of an image by encapsulating powerful machine
learning models in an easy to use REST API. It quickly
classifies images into thousands of categories (e.g., "sailboat",
"lion", "Eiffel Tower"), detects individual objects and faces
within images, and finds and reads printed words contained
within images. You can build metadata on your image catalog,
moderate offensive content, or enable new marketing scenarios
through image sentiment analysis. Analyze images uploaded
in the request or integrate with your image storage on Google
Cloud Storage.
 Imagerecognition
 Textrecognition
 Facedetection
 Emotiondetection
 Textanalyzes (nicht deutsch)
16
OpenCV
OpenCV is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and
Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computational efficiency and
with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing.
Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous compute platform.
Adopted all around the world, OpenCV has more than 47 thousand people of user community and estimated number of
downloads exceeding 14 million. Usage ranges from interactive art, to mines inspection, stitching maps on the web or through
advanced robotics.
 Imagerecognition
 Face- &Persondetection
 Trainierbar
17
Clarifai Image and Video Recognition API
Predict / Classify
 Predict analyzes your images and tells you what's inside of them.
 The API will return a list of concepts with corresponding
probabilities of how likely it is these concepts are contained within
the image
Search
 The Search API allows you to send images (url or bytes) to the
service and have them indexed by 'general' model concepts and
their visual representations.
 Once indexed, you can search for images by concept or using
reverse image search.
Train
 Clarifai provides many different models that 'see' the world
differently. A model contains a group of concepts. A model will
only see the concepts it contains.
18
Imagga Auto-Tagging
Imagga is an Image Recognition
Platform-as-a-Service providing
Image Tagging APIs for
developers & businesses to
build scalable, image intensive
cloud apps.
19
Magical Metadata
20
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text / Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available
Fraunhofer IAIS Audiomining
 Segmentation
 Speaker- and Languagedetection
 Emotiondetection
 Trainable
 Keywordextraction
Alternatives
 IBM Watson Speech2Text (see later)
 Microsoft Cognitive Services – Bing Speech
 Google Speech
21
22
{"segments": [
…
{
"segmentNumber": 1,
"startTime": 4480,
"duration": 3190,
"endTime": 7670,
"speaker": 1,
"gender": "female",
"transcript": "Hier ist das erste deutsche Fernsehen mit der Tagesschau."
},
...
{
"segmentNumber": 20,
"startTime": 238980,
"duration": 23620,
"endTime": 262600,
"speaker": 2,
"gender": "male",
"transcript": "Großbritannien raus aus der Europäischen Union für viele unvorstellbar
das weiß auch der britische Premierminister Cameron und er nutzt es um die EU Partner
unter Druck zu setzen entweder das Staatenbündnis ist zu Reformen bereit oder bei der
geplanten Volksabstimmung über die EU Mitgliedschaft droht ein Nein heute hatte EU
Ratspräsident Tosca ein Kompromisspapier vorgelegt dass die Briten besänftigen soll."
},
Fraunhofer IAIS Audiomining
IBM Watson Speech to Text
23
https://www-
03.ibm.com/press
/us/en/pressrelea
se/51790.wss
24
Magical Metadata
25
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text/ Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available
IBM Watson Natural Language Unterstanding (NLU)
Extraction of
• Sentiment
• Emotion
• Keywords
• Entities
• Categories
• Concepts
• Semantic Roles
26
Magical Metadata
27
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text / Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available
Visual Atoms
FIND is a high-speed, high-accuracy, image visual search solution.
Our state-of-the-art visual search engine enables the matching of images
depicting the same objects or scenes based on visual similarities, without the
need for manual annotations or metadata.
If you are a provider of image editing or management solutions, the
FIND engine will equip your product with the necessary tools for the creation
of image databases which are searchable using images as queries. Your
end users will be able to create and maintain their own image databases and
efficiently organise, manage and search their image assets.
For providers of image hosting solutions, the FIND engine will allow the
creation of image databases which users can search using visual queries.
For developers of mobile apps, such as for e-commerce, tourism
or entertainment, the FIND engine will give your app cloud-based and/or
terminal based visual search functionality for retrieval of relevant images
and associated information.
With a streamlined API, the FIND engine is designed so that it can be
easily integrated in any third-party application or workflow.
Alternatives: IBM Watson VR Collections, Clarifai Search
28
Training & Optimization
...
Why is training necessary?
30
Visual Recognition - Training
31
Domain- specific model
32
Domain- specific model - Trainer
33
...
Optimization
of keyframe
extraction –
not good
extraction /
use
adaptive
extraction
34
...
Analysis & Aggregation
Cognitive modell for
German Soccer League
Archive
36
Metadaten
(Technisch, Statistik, Ticker,
etc.)
Essenzen
(Audio, Video, Keyframes,
etc.)
Analyse verschiedener Ordnung
(Audiomining, Bilderkennung, Gesichtserkennung,
Mustererkennung, etc.)
Timelines verschiedener Ordnung
(Atmosphäre, Kontext, Perspektive, Personen, etc.)
Cognitive model for German Soccer League Archive
– multi-modal analyzes
37
38
Cognitive model for German Soccer League Archive
– example for timeline of first order
Just uses
results from
analysis
39
Cognitive model for German Soccer League Archive
– example for timeline of second order
Uses results
from analyzes
as well as other
timelines
40
Cognitive model for German Soccer League Archive
– example for timeline of third order
Uses results
from analyzes
as well as other
timelines
41
Camera Timeline
Speed Timeline
Cognitive Aggregator for
Timelines
42
Normal: 60 %
Spidercam: 80%
SlowMo: 55 %
CloseUp: 83%
Normal: 67 %
Goalline: 77%
Normal: 83 %
Spidercam: 76%
Normal: 87 %
Spidercam: 77%
Reduce and sharpen from 20 analysis
events to 4
Combine
Timelines
Combine and
Sharpen SlowMo
Combine
Timelines
Combine Timelines and Frames
due to near similarity
+20 %
Overall process & Integration
IBM AREMA & Watson at Hackdays/SRF
„Die Zukunft der Mediennutzung“
44
Involving now:
• Watson VR - ClassifyImage
• Watson VR - DetectFaces
• Watson VR - RecognizeText
• Watson Speech2Text
• Alchemy API
Used to find
meaningful
content from
SRGs Archives
45
IBM AREMA & Watson at Hackdays/SRF
„Die Zukunft der Mediennutzung“
46
IBM AREMA & Watson at Hackdays/SRF
„Die Zukunft der Mediennutzung“
Cognitive
Process with
Trainer,
Analysis
Workflow and
Aggregator
47
Cognitive
Analysis
Workflow
Cognitive
Trainer
Cognitive
Aggregator
Image
Classifier
Inbox
Taxonomy
Database
Image
Classifier
Repository
Media
Ingestion
Metadata
Repository
(MAM)
1
2
3
4
5
6
1. Configure Taxonomy (add
Classifiers, Categories, etc.)
2. Show and organize classifier
images
3. Move good classifiers to
repository to optimize training
4. Use classifier repository to train
services and perform custom
analysis
5. Move actual frame to inbox
when confidence ok
6. Use taxonomy for rule creation
Future?
Upcoming:
Watson For Media,
announced in April 2017
at
First use cases available
at IBC in September 2017
49

More Related Content

PDF
Fraunhofer iais audio mining - automatic metadata gereration of audio streams...
PDF
A framework for visual search in broadcasting companies' multimedia archives
PDF
Automated metadata generation projects at yle - 2017 Selkala, Elina
PDF
Automagically archiving the bbc's tv programmes - 2017 Dent, Allcorn
PDF
Extend the Reach of R to the Enterprise (for useR! 2013)
PDF
Rosinski ibm ai overview with several examples of projects in the media and l...
PDF
A Journey with Microsoft Cognitive Service I
PDF
Microsoft Cognitive Services at a Glance
Fraunhofer iais audio mining - automatic metadata gereration of audio streams...
A framework for visual search in broadcasting companies' multimedia archives
Automated metadata generation projects at yle - 2017 Selkala, Elina
Automagically archiving the bbc's tv programmes - 2017 Dent, Allcorn
Extend the Reach of R to the Enterprise (for useR! 2013)
Rosinski ibm ai overview with several examples of projects in the media and l...
A Journey with Microsoft Cognitive Service I
Microsoft Cognitive Services at a Glance

Similar to Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob (20)

PDF
Gianni Rosa Gallina - Where and how can AI be used in a real-world multimedia...
PDF
Computers Are Opening Their Eyes - And They're Already Better at Seeing Than ...
PPTX
AI in image recognition
PDF
IBM Watson Developer Cloud Vision Services
PDF
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
PDF
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
PPTX
Jay Y
PDF
DN 2017 | Machines are Learning - Bringing Powerful Artificial Intelligence t...
PDF
A.I. in the Enterprise: Computer Vision
PPTX
UNIT III_Cloud APIs for CV_unit III power point
PDF
Danilo Poccia - Machines are Learning: Bringing Powerful Artificial Intellige...
PDF
Image recognition
PPTX
Image recognition
PDF
imagerecognition-191220044946 (1).pdf
PDF
Introduction talk to Computer Vision
PPTX
Inteligencia artificial para todos
PPTX
IBM Cloud Artificial Intelligence : A Comprehensive Overview
PPTX
A Picture’s Worth a Thousand Hashtags: How image recognition will power the f...
PPTX
How to Get Started in ML?
PPTX
AI TOOLS AND TECHNIQUES FOR IMAGE PROCESSING
Gianni Rosa Gallina - Where and how can AI be used in a real-world multimedia...
Computers Are Opening Their Eyes - And They're Already Better at Seeing Than ...
AI in image recognition
IBM Watson Developer Cloud Vision Services
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Jay Y
DN 2017 | Machines are Learning - Bringing Powerful Artificial Intelligence t...
A.I. in the Enterprise: Computer Vision
UNIT III_Cloud APIs for CV_unit III power point
Danilo Poccia - Machines are Learning: Bringing Powerful Artificial Intellige...
Image recognition
Image recognition
imagerecognition-191220044946 (1).pdf
Introduction talk to Computer Vision
Inteligencia artificial para todos
IBM Cloud Artificial Intelligence : A Comprehensive Overview
A Picture’s Worth a Thousand Hashtags: How image recognition will power the f...
How to Get Started in ML?
AI TOOLS AND TECHNIQUES FOR IMAGE PROCESSING
Ad

More from FIAT/IFTA (20)

PPTX
2021 FIAT/IFTA Timeline Survey
PPTX
20211021 FIAT/IFTA Most Wanted List
PPTX
WARBURTON FIAT/IFTA Timeline Survey results 2020
PPTX
OOMEN MEZARIS ReTV
PPTX
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
PPTX
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
PPTX
HULSENBECK Value Use and Copyright Comission initiatives
PPT
WILSON Film digitisation at BBC Scotland
PDF
GOLODNOFF We need to make our past accessible!
PPTX
LORENZ Building an integrated digital media archive and legal deposit
PPTX
BIRATUNGANYE Shock of formats
PPTX
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
PPTX
BERGER RIPPON BBC Music memories
PDF
AOIBHINN and CHOISTIN Rehash your archive
PDF
HULSENBECK BLOM A blast from the past open up
PDF
PERVIZ Automated evolvable media console systems in digital archives
PPTX
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
PPTX
VINSON Accuracy and cost assessment for archival video transcription methods
PDF
LYCKE Artificial intelligence, hype or hope?
PDF
AZIZ BABBUCCI Let's play with the archive
2021 FIAT/IFTA Timeline Survey
20211021 FIAT/IFTA Most Wanted List
WARBURTON FIAT/IFTA Timeline Survey results 2020
OOMEN MEZARIS ReTV
BUCHMAN Digitisation of quarter inch audio tapes at DR (FRAME Expert)
CULJAT (FRAME Expert) Public procurement in audiovisual digitisation at RTÉ
HULSENBECK Value Use and Copyright Comission initiatives
WILSON Film digitisation at BBC Scotland
GOLODNOFF We need to make our past accessible!
LORENZ Building an integrated digital media archive and legal deposit
BIRATUNGANYE Shock of formats
CANTU VT is TV The History of Argentinian Video Art and Television Archives P...
BERGER RIPPON BBC Music memories
AOIBHINN and CHOISTIN Rehash your archive
HULSENBECK BLOM A blast from the past open up
PERVIZ Automated evolvable media console systems in digital archives
AICHROTH Systemaic evaluation and decentralisation for a (bit more) trusted AI
VINSON Accuracy and cost assessment for archival video transcription methods
LYCKE Artificial intelligence, hype or hope?
AZIZ BABBUCCI Let's play with the archive
Ad

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Database Infoormation System (DBIS).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to machine learning and Linear Models
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to machine learning and Linear Models

Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob

  • 1. Automatic multi-modal metadata annotation based on trained cognitive solutions Jakob Rosinski Lead Architect Video & Broadcast IBM GBS Europe
  • 2. Lead Architect Video & Broadcast, IBM GBS Europe Member IBM Global Center of Competence Telco, Media & Entertainment Member IBM Technical Expert Council Central (TEC CR) Product Owner IBM AREMA Jakob Rosinski is the Lead Architect for Video & Broadcast for IBM Global Business Services Europe and also member of IBMs Global Center of Competence for Telecom, Media & Entertainment. In this role he is also the product owner of IBM AREMA, a workflow and essence management solution which is widely used at different broadcasters for essence archives and workflow automation. Over the last decade Jakob was responsible for various projects in the media industry at HBO, France24, ORF, SRF, RTL Mediengruppe or Deutsche Bundesliga/Sportcast. He is a subject matter expert for multi-site &multi-tier essence management and workflow automation for ingest, archive, production & distribution. Further he is well recognized in topics like cognitive content enrichment and broadcast integration. Dipl.-Inf. (M.Sc.) Jakob Rosinski 2
  • 3. 1. Introduction 2. Components 3. Training & Optimization 4. Analysis & Aggregation 5. Overall process & Integration Agenda 3
  • 4. Introduction „Rich metadata is the key to content discovery and monetization. It powers advanced video search and recommandation engines...“ FKTG Magazin 03/2017, S.84
  • 5. 5
  • 6. Scene Detection / Segmentation Deep Video-Analysis  People-, Object and Context-Detection  Classification of actors based on 24 emotions  Classification of scenes based on 22.000 categories Deep Audio-Analysis  Background  Actor sentiment and tone Analysis of scene composition  Classification of light and color Analysis of succesful trailers https://www.youtube.com/watch?v=gJEzuYynaiw 6
  • 7. 7
  • 8. Automatic content enrichment of 40+ years of soccer content  Annotation by usage of a portfolio of cognitive solutions (IBM, FRH, Google, MS)  Audio: Speech-to-text / Transcript  Audio: Speaker-Detection  Audio: Atmosphere (cheers, whistles, ..)  Video: Angle/Camera & Context Detection  Video: Face- & Object Detection  Domain trained services including Traningsportal  Sharpening of results by knowledge of domain and creation of timelines, identifiying of concepts Link with Game- and Playerdata  Optimize content analysis and search based on game and player statistics  Guided search. Persona-based User Experience  Personalized Discovery, Suggestions, Design & Projects Content enrichment for Bundesliga archive 8
  • 10. Magical Metadata 10 Visual recognition allows us to understand the contents of an image or video frame, answering the question: “What is in this image?” Returns class, class description, face detection, and text recognition. Enhanced and automated understanding of personalities present in the frame, and objects Speech to text / Audiomining lets us transcribe audio into text by leveraging machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. Activate decade-old material by running it through the STT API and then performing deeper analytics Deeper understanding of concepts, recognized entities, keywords, and relationships Natural Language Undestanding delivers several tools to distill text and dialogue into fundamental concepts of relevance, like: Concepts, Document-Level Emotions, Sentiment, Entities, Keywords, Language, etc Target Deeply enriched content second- to-second Search for image and videodata for not trained objects or contexts. Pattern Detection & Similarity Search indexes visual content bases on patterns and makes a similarity search available
  • 11. Magical Metadata 11 Visual recognition allows us to understand the contents of an image or video frame, answering the question: “What is in this image?” Returns class, class description, face detection, and text recognition. Enhanced and automated understanding of personalities present in the frame, and objects Speech to text / Audiomining lets us transcribe audio into text by leveraging machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. Activate decade-old material by running it through the STT API and then performing deeper analytics Deeper understanding of concepts, recognized entities, keywords, and relationships Natural Language Undestanding delivers several tools to distill text and dialogue into fundamental concepts of relevance, like: Concepts, Document-Level Emotions, Sentiment, Entities, Keywords, Language, etc Target Deeply enriched content second- to-second Search for image and videodata for not trained objects or contexts. Pattern Detection & Similarity Search indexes visual content bases on patterns and makes a similarity search available
  • 12. IBM Watson Visual Recognition Visual Recognition understands the contents of images - visual concepts tag the image, find human faces, approximate age and gender, and find similar images in a collection. You can also train the service by creating your own custom concepts. Use Visual Recognition to detect a dress type in retail, identify spoiled fruit in inventory, and more.  Image Recognition  Text Recognition  Face- & Persondetection  Pattern search / Collection  Trainable 12
  • 13. 13 IBM Watson Visual Recognition
  • 14. IBM Watson Visual Recognition – A Multi-layered trainable architecture for image analysis • Need to learn effective semantic classifiers using a wide diversity of audio-visual features and models • Need to design a rich space of semantic concepts that captures multiple facets of audio-visual content FeaturesColor Background Frequencies SpectrumEdges Camera Motion Energy Zero-crossings Models P P P P P P P P PP Positive Examples Negative Examples N N N N N N N N NN Labeled Data Unlabeled Data Addaboost K-means Regression Bayes Net Nearest Neighbor Neural Net Deep Belief Nets GMMClustering Markov ModelDecision TreeExpectation Maximization Factor Graph Shot Boundaries Semantics Multimedia Data Scenes Locations Settings Objects Activities Actions Objects Actions Behaviors People Objects Living CarsAnimals People Vehicles Activities Scenes People Places Faces Objects Events Activities GMMSVMs ShapeTexture Ensemble Classifiers Motion Moving Objects Active Learning Regions Scene Dynamics Tracks 14
  • 15. Microsoft Cognitive Services  Image Recognition This feature returns information about visual content found in an image. Use tagging, descriptions and domain-specific models to identify content and label it with confidence. Apply the adult/racy settings to enable automated restriction of adult content. Identify image types and color schemes in pictures.  Text Recognition Optical Character Recognition (OCR) detects text in an image and extracts the recognized words into a machine-readable character stream. Analyze images to detect embedded text, generate character streams and enable searching. Allow users to take photos of text instead of copying to save time and effort.  Face- & Persondetection The Celebrity Model is an example of Domain Specific Models. Our new celebrity recognition model recognizes 200K celebrities from business, politics, sports and entertainment around the World. Domain- specific models is a continuously evolving feature within Computer Vision API.  Emotiondetection 15
  • 16. Google Vision Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API. It quickly classifies images into thousands of categories (e.g., "sailboat", "lion", "Eiffel Tower"), detects individual objects and faces within images, and finds and reads printed words contained within images. You can build metadata on your image catalog, moderate offensive content, or enable new marketing scenarios through image sentiment analysis. Analyze images uploaded in the request or integrate with your image storage on Google Cloud Storage.  Imagerecognition  Textrecognition  Facedetection  Emotiondetection  Textanalyzes (nicht deutsch) 16
  • 17. OpenCV OpenCV is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computational efficiency and with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing. Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous compute platform. Adopted all around the world, OpenCV has more than 47 thousand people of user community and estimated number of downloads exceeding 14 million. Usage ranges from interactive art, to mines inspection, stitching maps on the web or through advanced robotics.  Imagerecognition  Face- &Persondetection  Trainierbar 17
  • 18. Clarifai Image and Video Recognition API Predict / Classify  Predict analyzes your images and tells you what's inside of them.  The API will return a list of concepts with corresponding probabilities of how likely it is these concepts are contained within the image Search  The Search API allows you to send images (url or bytes) to the service and have them indexed by 'general' model concepts and their visual representations.  Once indexed, you can search for images by concept or using reverse image search. Train  Clarifai provides many different models that 'see' the world differently. A model contains a group of concepts. A model will only see the concepts it contains. 18
  • 19. Imagga Auto-Tagging Imagga is an Image Recognition Platform-as-a-Service providing Image Tagging APIs for developers & businesses to build scalable, image intensive cloud apps. 19
  • 20. Magical Metadata 20 Visual recognition allows us to understand the contents of an image or video frame, answering the question: “What is in this image?” Returns class, class description, face detection, and text recognition. Enhanced and automated understanding of personalities present in the frame, and objects Speech to text / Audiomining lets us transcribe audio into text by leveraging machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. Activate decade-old material by running it through the STT API and then performing deeper analytics Deeper understanding of concepts, recognized entities, keywords, and relationships Natural Language Undestanding delivers several tools to distill text and dialogue into fundamental concepts of relevance, like: Concepts, Document-Level Emotions, Sentiment, Entities, Keywords, Language, etc Target Deeply enriched content second- to-second Search for image and videodata for not trained objects or contexts. Pattern Detection & Similarity Search indexes visual content bases on patterns and makes a similarity search available
  • 21. Fraunhofer IAIS Audiomining  Segmentation  Speaker- and Languagedetection  Emotiondetection  Trainable  Keywordextraction Alternatives  IBM Watson Speech2Text (see later)  Microsoft Cognitive Services – Bing Speech  Google Speech 21
  • 22. 22 {"segments": [ … { "segmentNumber": 1, "startTime": 4480, "duration": 3190, "endTime": 7670, "speaker": 1, "gender": "female", "transcript": "Hier ist das erste deutsche Fernsehen mit der Tagesschau." }, ... { "segmentNumber": 20, "startTime": 238980, "duration": 23620, "endTime": 262600, "speaker": 2, "gender": "male", "transcript": "Großbritannien raus aus der Europäischen Union für viele unvorstellbar das weiß auch der britische Premierminister Cameron und er nutzt es um die EU Partner unter Druck zu setzen entweder das Staatenbündnis ist zu Reformen bereit oder bei der geplanten Volksabstimmung über die EU Mitgliedschaft droht ein Nein heute hatte EU Ratspräsident Tosca ein Kompromisspapier vorgelegt dass die Briten besänftigen soll." }, Fraunhofer IAIS Audiomining
  • 23. IBM Watson Speech to Text 23
  • 25. Magical Metadata 25 Visual recognition allows us to understand the contents of an image or video frame, answering the question: “What is in this image?” Returns class, class description, face detection, and text recognition. Enhanced and automated understanding of personalities present in the frame, and objects Speech to text/ Audiomining lets us transcribe audio into text by leveraging machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. Activate decade-old material by running it through the STT API and then performing deeper analytics Deeper understanding of concepts, recognized entities, keywords, and relationships Natural Language Undestanding delivers several tools to distill text and dialogue into fundamental concepts of relevance, like: Concepts, Document-Level Emotions, Sentiment, Entities, Keywords, Language, etc Target Deeply enriched content second- to-second Search for image and videodata for not trained objects or contexts. Pattern Detection & Similarity Search indexes visual content bases on patterns and makes a similarity search available
  • 26. IBM Watson Natural Language Unterstanding (NLU) Extraction of • Sentiment • Emotion • Keywords • Entities • Categories • Concepts • Semantic Roles 26
  • 27. Magical Metadata 27 Visual recognition allows us to understand the contents of an image or video frame, answering the question: “What is in this image?” Returns class, class description, face detection, and text recognition. Enhanced and automated understanding of personalities present in the frame, and objects Speech to text / Audiomining lets us transcribe audio into text by leveraging machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. Activate decade-old material by running it through the STT API and then performing deeper analytics Deeper understanding of concepts, recognized entities, keywords, and relationships Natural Language Undestanding delivers several tools to distill text and dialogue into fundamental concepts of relevance, like: Concepts, Document-Level Emotions, Sentiment, Entities, Keywords, Language, etc Target Deeply enriched content second- to-second Search for image and videodata for not trained objects or contexts. Pattern Detection & Similarity Search indexes visual content bases on patterns and makes a similarity search available
  • 28. Visual Atoms FIND is a high-speed, high-accuracy, image visual search solution. Our state-of-the-art visual search engine enables the matching of images depicting the same objects or scenes based on visual similarities, without the need for manual annotations or metadata. If you are a provider of image editing or management solutions, the FIND engine will equip your product with the necessary tools for the creation of image databases which are searchable using images as queries. Your end users will be able to create and maintain their own image databases and efficiently organise, manage and search their image assets. For providers of image hosting solutions, the FIND engine will allow the creation of image databases which users can search using visual queries. For developers of mobile apps, such as for e-commerce, tourism or entertainment, the FIND engine will give your app cloud-based and/or terminal based visual search functionality for retrieval of relevant images and associated information. With a streamlined API, the FIND engine is designed so that it can be easily integrated in any third-party application or workflow. Alternatives: IBM Watson VR Collections, Clarifai Search 28
  • 30. ... Why is training necessary? 30
  • 31. Visual Recognition - Training 31
  • 33. Domain- specific model - Trainer 33
  • 34. ... Optimization of keyframe extraction – not good extraction / use adaptive extraction 34 ...
  • 36. Cognitive modell for German Soccer League Archive 36 Metadaten (Technisch, Statistik, Ticker, etc.) Essenzen (Audio, Video, Keyframes, etc.) Analyse verschiedener Ordnung (Audiomining, Bilderkennung, Gesichtserkennung, Mustererkennung, etc.) Timelines verschiedener Ordnung (Atmosphäre, Kontext, Perspektive, Personen, etc.)
  • 37. Cognitive model for German Soccer League Archive – multi-modal analyzes 37
  • 38. 38 Cognitive model for German Soccer League Archive – example for timeline of first order Just uses results from analysis
  • 39. 39 Cognitive model for German Soccer League Archive – example for timeline of second order Uses results from analyzes as well as other timelines
  • 40. 40 Cognitive model for German Soccer League Archive – example for timeline of third order Uses results from analyzes as well as other timelines
  • 41. 41
  • 42. Camera Timeline Speed Timeline Cognitive Aggregator for Timelines 42 Normal: 60 % Spidercam: 80% SlowMo: 55 % CloseUp: 83% Normal: 67 % Goalline: 77% Normal: 83 % Spidercam: 76% Normal: 87 % Spidercam: 77% Reduce and sharpen from 20 analysis events to 4 Combine Timelines Combine and Sharpen SlowMo Combine Timelines Combine Timelines and Frames due to near similarity +20 %
  • 43. Overall process & Integration
  • 44. IBM AREMA & Watson at Hackdays/SRF „Die Zukunft der Mediennutzung“ 44
  • 45. Involving now: • Watson VR - ClassifyImage • Watson VR - DetectFaces • Watson VR - RecognizeText • Watson Speech2Text • Alchemy API Used to find meaningful content from SRGs Archives 45 IBM AREMA & Watson at Hackdays/SRF „Die Zukunft der Mediennutzung“
  • 46. 46 IBM AREMA & Watson at Hackdays/SRF „Die Zukunft der Mediennutzung“
  • 47. Cognitive Process with Trainer, Analysis Workflow and Aggregator 47 Cognitive Analysis Workflow Cognitive Trainer Cognitive Aggregator Image Classifier Inbox Taxonomy Database Image Classifier Repository Media Ingestion Metadata Repository (MAM) 1 2 3 4 5 6 1. Configure Taxonomy (add Classifiers, Categories, etc.) 2. Show and organize classifier images 3. Move good classifiers to repository to optimize training 4. Use classifier repository to train services and perform custom analysis 5. Move actual frame to inbox when confidence ok 6. Use taxonomy for rule creation
  • 48. Future? Upcoming: Watson For Media, announced in April 2017 at First use cases available at IBC in September 2017
  • 49. 49