Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob

Automatic multi-modal
metadata annotation
based on trained
cognitive solutions
Jakob Rosinski
Lead Architect Video & Broadcast
IBM GBS Europe

Lead Architect Video & Broadcast, IBM GBS Europe
Member IBM Global Center of Competence Telco, Media & Entertainment
Member IBM Technical Expert Council Central (TEC CR)
Product Owner IBM AREMA
Jakob Rosinski is the Lead Architect for Video & Broadcast for IBM Global Business
Services Europe and also member of IBMs Global Center of Competence for Telecom,
Media & Entertainment. In this role he is also the product owner of IBM AREMA, a
workflow and essence management solution which is widely used at different
broadcasters for essence archives and workflow automation.
Over the last decade Jakob was responsible for various projects in the media industry
at HBO, France24, ORF, SRF, RTL Mediengruppe or Deutsche Bundesliga/Sportcast. He
is a subject matter expert for multi-site &multi-tier essence management and
workflow automation for ingest, archive, production & distribution.
Further he is well recognized in topics like cognitive content enrichment and broadcast
integration.
Dipl.-Inf. (M.Sc.) Jakob Rosinski
2

1. Introduction
2. Components
3. Training & Optimization
4. Analysis & Aggregation
5. Overall process & Integration
Agenda
3

Introduction
„Rich metadata is the key to content discovery and monetization. It powers
advanced video search and recommandation engines...“
FKTG Magazin 03/2017, S.84

Scene Detection / Segmentation
Deep Video-Analysis
 People-, Object and Context-Detection
 Classification of actors based on 24
emotions
 Classification of scenes based on 22.000
categories
Deep Audio-Analysis
 Background
 Actor sentiment and tone
Analysis of scene composition
 Classification of light and color
Analysis of succesful
trailers
https://www.youtube.com/watch?v=gJEzuYynaiw
6

Automatic content enrichment of 40+ years of soccer
content
 Annotation by usage of a portfolio of cognitive
solutions (IBM, FRH, Google, MS)
 Audio: Speech-to-text / Transcript
 Audio: Speaker-Detection
 Audio: Atmosphere (cheers, whistles, ..)
 Video: Angle/Camera & Context Detection
 Video: Face- & Object Detection
 Domain trained services including Traningsportal
 Sharpening of results by knowledge of domain and
creation of timelines, identifiying of concepts
Link with Game- and Playerdata
 Optimize content analysis and search based on game
and player statistics
 Guided search.
Persona-based User Experience
 Personalized Discovery, Suggestions, Design & Projects
Content enrichment for
Bundesliga archive
8

Magical Metadata
10
Visual recognition allows us to understand the
contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class
description, face detection, and text recognition.
Enhanced and automated
understanding of personalities
present in the frame, and objects
Speech to text / Audiomining lets us transcribe audio
into text by leveraging machine intelligence to combine
information about grammar and language structure with
knowledge of the composition of the audio signal.
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Natural Language Undestanding delivers several
tools to distill text and dialogue into fundamental
concepts of relevance, like: Concepts, Document-Level
Emotions, Sentiment, Entities, Keywords, Language, etc
Target
Deeply enriched
content second-
to-second
Search for image and videodata for
not trained objects or contexts.
Pattern Detection & Similarity Search indexes visual
content bases on patterns and makes a similarity
search available

Magical Metadata
11
relationships
Target
Deeply enriched
content second-
to-second
search available

IBM Watson Visual Recognition
Visual Recognition understands the contents of images - visual concepts
tag the image, find human faces, approximate age and gender, and find
similar images in a collection. You can also train the service by creating
your own custom concepts. Use Visual Recognition to detect a dress
type in retail, identify spoiled fruit in inventory, and more.
 Image Recognition
 Text Recognition
 Face- & Persondetection
 Pattern search / Collection
 Trainable
12

13
IBM Watson Visual Recognition

IBM Watson
Visual
Recognition –
A Multi-layered
trainable
architecture for
image analysis
• Need to learn effective semantic classifiers using a wide diversity of audio-visual features and models
• Need to design a rich space of semantic concepts that captures multiple facets of audio-visual content
FeaturesColor
Background
Frequencies SpectrumEdges
Camera
Motion
Energy Zero-crossings
Models
P P P P
P P P
P
PP
Positive
Examples
Negative
Examples
N N N N
N N N
N
NN
Labeled Data
Unlabeled Data
Addaboost
K-means
Regression
Bayes Net
Nearest
Neighbor
Neural Net
Deep Belief Nets
GMMClustering
Markov
ModelDecision TreeExpectation
Maximization
Factor Graph
Shot
Boundaries
Semantics
Multimedia Data
Scenes
Locations
Settings Objects
Activities
Actions
Objects
Actions
Behaviors
People
Objects
Living
CarsAnimals
People
Vehicles
Activities
Scenes
People
Places Faces
Objects
Events
Activities
GMMSVMs
ShapeTexture
Ensemble
Classifiers
Motion
Moving
Objects
Active
Learning
Regions
Scene
Dynamics
Tracks
14

Microsoft Cognitive Services
 Image Recognition
This feature returns information about visual content found in an image.
Use tagging, descriptions and domain-specific models to identify
content and label it with confidence. Apply the adult/racy settings to
enable automated restriction of adult content. Identify image types and
color schemes in pictures.
 Text Recognition
Optical Character Recognition (OCR) detects text in an image and
extracts the recognized words into a machine-readable character
stream. Analyze images to detect embedded text, generate character
streams and enable searching. Allow users to take photos of text
instead of copying to save time and effort.
 Face- & Persondetection
The Celebrity Model is an example of Domain Specific Models. Our
new celebrity recognition model recognizes 200K celebrities from
business, politics, sports and entertainment around the World. Domain-
specific models is a continuously evolving feature within Computer
Vision API.
 Emotiondetection
15

Google Vision
Google Cloud Vision API enables developers to understand
the content of an image by encapsulating powerful machine
learning models in an easy to use REST API. It quickly
classifies images into thousands of categories (e.g., "sailboat",
"lion", "Eiffel Tower"), detects individual objects and faces
within images, and finds and reads printed words contained
within images. You can build metadata on your image catalog,
moderate offensive content, or enable new marketing scenarios
through image sentiment analysis. Analyze images uploaded
in the request or integrate with your image storage on Google
Cloud Storage.
 Imagerecognition
 Textrecognition
 Facedetection
 Textanalyzes (nicht deutsch)
16

OpenCV
OpenCV is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and
Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computational efficiency and
with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing.
Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous compute platform.
Adopted all around the world, OpenCV has more than 47 thousand people of user community and estimated number of
downloads exceeding 14 million. Usage ranges from interactive art, to mines inspection, stitching maps on the web or through
advanced robotics.
 Imagerecognition
 Face- &Persondetection
 Trainierbar
17

Clarifai Image and Video Recognition API
Predict / Classify
 Predict analyzes your images and tells you what's inside of them.
 The API will return a list of concepts with corresponding
probabilities of how likely it is these concepts are contained within
the image
Search
 The Search API allows you to send images (url or bytes) to the
service and have them indexed by 'general' model concepts and
their visual representations.
 Once indexed, you can search for images by concept or using
reverse image search.
Train
 Clarifai provides many different models that 'see' the world
differently. A model contains a group of concepts. A model will
only see the concepts it contains.
18

Imagga Auto-Tagging
Imagga is an Image Recognition
Platform-as-a-Service providing
Image Tagging APIs for
developers & businesses to
build scalable, image intensive
cloud apps.
19

Magical Metadata
20
relationships
Target
Deeply enriched
content second-
to-second
search available

Fraunhofer IAIS Audiomining
 Segmentation
 Speaker- and Languagedetection
 Trainable
 Keywordextraction
Alternatives
 IBM Watson Speech2Text (see later)
 Microsoft Cognitive Services – Bing Speech
 Google Speech
21

22
{"segments": [
…
{
"segmentNumber": 1,
"startTime": 4480,
"duration": 3190,
"endTime": 7670,
"speaker": 1,
"gender": "female",
"transcript": "Hier ist das erste deutsche Fernsehen mit der Tagesschau."
},
...
{
"segmentNumber": 20,
"startTime": 238980,
"duration": 23620,
"endTime": 262600,
"speaker": 2,
"gender": "male",
"transcript": "Großbritannien raus aus der Europäischen Union für viele unvorstellbar
das weiß auch der britische Premierminister Cameron und er nutzt es um die EU Partner
unter Druck zu setzen entweder das Staatenbündnis ist zu Reformen bereit oder bei der
geplanten Volksabstimmung über die EU Mitgliedschaft droht ein Nein heute hatte EU
Ratspräsident Tosca ein Kompromisspapier vorgelegt dass die Briten besänftigen soll."
},
Fraunhofer IAIS Audiomining

https://www-
03.ibm.com/press
/us/en/pressrelea
se/51790.wss
24

Magical Metadata
25
Speech to text/ Audiomining lets us transcribe audio
relationships
Target
Deeply enriched
content second-
to-second
search available

IBM Watson Natural Language Unterstanding (NLU)
Extraction of
• Sentiment
• Emotion
• Keywords
• Entities
• Categories
• Concepts
• Semantic Roles
26

Magical Metadata
27
relationships
Target
Deeply enriched
content second-
to-second
search available

Visual Atoms
FIND is a high-speed, high-accuracy, image visual search solution.
Our state-of-the-art visual search engine enables the matching of images
depicting the same objects or scenes based on visual similarities, without the
need for manual annotations or metadata.
If you are a provider of image editing or management solutions, the
FIND engine will equip your product with the necessary tools for the creation
of image databases which are searchable using images as queries. Your
end users will be able to create and maintain their own image databases and
efficiently organise, manage and search their image assets.
For providers of image hosting solutions, the FIND engine will allow the
creation of image databases which users can search using visual queries.
For developers of mobile apps, such as for e-commerce, tourism
or entertainment, the FIND engine will give your app cloud-based and/or
terminal based visual search functionality for retrieval of relevant images
and associated information.
With a streamlined API, the FIND engine is designed so that it can be
easily integrated in any third-party application or workflow.
Alternatives: IBM Watson VR Collections, Clarifai Search
28

...
Why is training necessary?
30

Visual Recognition - Training
31

Domain- specific model - Trainer
33

...
Optimization
of keyframe
extraction –
not good
extraction /
use
adaptive
extraction
34
...

Cognitive modell for
German Soccer League
Archive
36
Metadaten
(Technisch, Statistik, Ticker,
etc.)
Essenzen
(Audio, Video, Keyframes,
etc.)
Analyse verschiedener Ordnung
(Audiomining, Bilderkennung, Gesichtserkennung,
Mustererkennung, etc.)
Timelines verschiedener Ordnung
(Atmosphäre, Kontext, Perspektive, Personen, etc.)

Cognitive model for German Soccer League Archive
– multi-modal analyzes
37

38
– example for timeline of first order
Just uses
results from
analysis

39
– example for timeline of second order
Uses results
from analyzes
as well as other
timelines

40
– example for timeline of third order
Uses results
from analyzes
as well as other
timelines

Camera Timeline
Speed Timeline
Cognitive Aggregator for
Timelines
42
Normal: 60 %
Spidercam: 80%
SlowMo: 55 %
CloseUp: 83%
Normal: 67 %
Goalline: 77%
Normal: 83 %
Spidercam: 76%
Normal: 87 %
Spidercam: 77%
Reduce and sharpen from 20 analysis
events to 4
Combine
Timelines
Combine and
Sharpen SlowMo
Combine
Timelines
Combine Timelines and Frames
due to near similarity
+20 %

IBM AREMA & Watson at Hackdays/SRF
„Die Zukunft der Mediennutzung“
44

Involving now:
• Watson VR - ClassifyImage
• Watson VR - DetectFaces
• Watson VR - RecognizeText
• Watson Speech2Text
• Alchemy API
Used to find
meaningful
content from
SRGs Archives
45

46

Cognitive
Process with
Trainer,
Analysis
Workflow and
Aggregator
47
Cognitive
Analysis
Workflow
Cognitive
Trainer
Cognitive
Aggregator
Image
Classifier
Inbox
Taxonomy
Database
Image
Classifier
Repository
Media
Ingestion
Metadata
Repository
(MAM)
1
2
3
4
5
6
1. Configure Taxonomy (add
Classifiers, Categories, etc.)
2. Show and organize classifier
images
3. Move good classifiers to
repository to optimize training
4. Use classifier repository to train
services and perform custom
analysis
5. Move actual frame to inbox
when confidence ok
6. Use taxonomy for rule creation

Future?
Upcoming:
Watson For Media,
announced in April 2017
at
First use cases available
at IBC in September 2017

Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob

More Related Content

Similar to Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob (20)

More from FIAT/IFTA (20)

Recently uploaded (20)

Automatic multi-modal metadata annotation based on trained cognitive solutions - Rosinski, Jakob