Leveraging social media for training object detectors

Enhancing Object Detectors Using the
Collective
Intelligence of Social Media
Manish Kumar

Summary of the presentation
•  Recently, we have been witnessing the rapid growth of Social Media that emerged as the result of
users’ willingness to communicate, socialize, collaborate and share content.
•  The outcome of this massive activity was the generation of a tremendous volume of user contributed
data that have been made available on the Web, usually along with an indication of their meaning
(i.e., tags).
•  This has motivated the research objective of investigating whether the Collective Intelligence that
emerges from the users’ contributions inside a Web 2.0 application, can be used to remove the
need for dedicated human supervision during the process of learning.
•  In this presentation, I deal with a very demanding learning problem in computer vision that consists
of detecting and localizing an object within the image content.

Background - 1
•  The recent advances of Web technologies have effectively turned ordinary
people into active members of the Web, that generate, share, contribute and
exchange various types of information.
•  Based on this huge repository of content, various services have evolved,
ranging from the field of eCommerce, to emergency response and consumer
collective applications such as realtravel.com.
•  The intelligence provided by single users organized in communities, takes a
radical new shape in the context of Web 2.0, that of Collective Intelligence.

Background - 2
•  The MIT Center for Collective Intelligence frames the research question as
“How can people and computers be connected so that-collectively-they act
more intelligently than any individuals, groups, or computers have ever done
before?”.
•  In this presentation, I try to investigate whether the Collective Intelligence
derived from the user contributed content can be used to guide a learning
process that will teach the machine how to recognize objects from visual
content, the way a human does.

Learning and Web 2.0 Multimedia (1)
•  If we wish to construct a visual system that is able to scale on an arbitrary large
number of concepts, effortless learning is crucial. To solve this issue, we have to
address these questions:
1.  Can a computer program learn how to recognize semantic concepts from images?
2.  What is the process of learning?
3.  What is the mechanism that allows humans to initially require many examples to learn, as performed by little
babies, and after they have learned how to learn, they can learn from just a few examples?
4.  Most importantly what is the role of the teacher in this process and what is the minimum amount of
supervision that is absolutely necessary for facilitating efficient learning?

Learning and Web 2.0 Multimedia (2)
1.  Annotation-based learning model
a.  use labels provided by human annotators
b.  the amount of human effort that is required for annotation increases linearly
c.  Social Tagging Systems (STS) is the main driving factor
2.  Search-based model
a.  use models automatically obtained from the Web
b.  classification performance decreases from search-based methods

Social Tagging System - STS (1)
•  An STS is a web-based application, where users, either as individuals or more
commonly as members of a community (i.e., social networks), assign labels (i.e.,
arbitrary textual descriptions) to digital resources. Their motivation for tagging is
information organization and sharing.
•  Social tagging systems tend to form rich knowledge repositories that enable the
extraction of patterns reflecting the way content semantics is perceived by the web
users.
•  The tag proportions each resource receives crystallizes after about 100 annotations
attributing this behavior to the users’ common background and their tendency for
imitation on other users’ tagging habits.

Social Tagging System (2)
•  Limitations of STS:
1.  Users are prone to make mistakes and they often suggest invalid metadata (tag spamming).
2.  The lack of (hierarchical) structure of information results in tag ambiguity (a tag may have many
senses), tag synonymy (two different tags may have the same meaning) and granularity variation
(users do not use the same description level, when they refer to a concept).
•  The correlations between the tag and visual information space that are
established when the users suggest tags for the uploaded visual content, are
mostly treated as complementary sources of information that both contribute to
the semantic description of the resources.

Techniques used for Multimedia Analysis (1)
•  The very first attempts for image retrieval were based on keyword search applied
either on the associated annotations (assuming that annotations existed) or on the
images’ file names. But they are barely as descriptive as the multimedia content
itself.
•  To overcome these limitations, the use of the image visual characteristics has been
proposed. In this case, the visual content is utilized by extracting a set of visual
features from each image or image region. By comparing the visual features an
algorithm can decide whether the two images/regions represent the same semantic
concept. Then, image retrieval is performed by comparing the visual features of an
example image/region that is associated with a semantic concept by the user, to the
visual features of all images in a given collection.

Currently pattern classification has been brought to the core of most image analysis techniques in order to render a kind of meaning on visual
patterns. A typical pattern classification problem can be consider to include a series of sub-problems, the most important of which are:
•  a) determining the optimal feature space,
•  b) removing the noisy data that can be misleading,
•  c) avoid over-fitting on training data,
•  d) use the most appropriate distribution for the model,
•  e) make good use of any prior knowledge that may help you in making the correct choices,
•  f) perform meaningful segmentation when the related task requires to do so,
•  h) exploit the analysis context, etc.
All the above are crucial for initiating a learning process that aims at using the available training samples to estimate the parameters of a model representing a semantic concept.

•  Many problems derive from the fact that it is very difficult to describe visual content effectively in
a form that can be handled by machines.
•  In general, feature extraction is a domain dependent problem and it is unlikely that a good
feature extractor for a specific domain will work as good for another domain.
•  Additionally, many problems derive from the fact that images tend to include more than one
objects in their content, which decreases the descriptiveness of the feature space and raises
the need for segmentation.
•  The segmentation of images into regions and the use of a separate set of features for each
region was introduced to address the aforementioned issue. Segmentation techniques seek to
detect groups of pixels sharing similar visual characteristics and identify in this way meaningful
objects (similar to the ones identified by human visual system).

Learning mechanism for Multimedia Analysis (2)
Un-supervised Learning
•  Unsupervised learning is a class of problems in which one seeks to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unlabeled examples. e.g. Clustering algorithms
Strongly-Supervised Learning
•  In strongly-supervised learning there is prior knowledge about the labels of the training samples and there is one-to-one relation between a sample and its label
•  The aim of strongly-supervised learning is to generate a global model that maps input objects to the desired outputs and generalize from the presented data to unseen situations in a “reasonable” way.
Semi-supervised Learning
•  Semi-supervised learning algorithms try to exploit unlabeled data, which are usually of low cost and can be obtained in high quantities, in conjunction with some supervision information. In this case, only a
small portion of the data is labeled and the algorithm aims at propagating the labels to the unlabeled data.
Weakly-Supervised Learning
•  By weakly-supervised we refer to the process of learning using weakly labeled data (i.e., samples labeled as containing the semantic concept of interest, but without indication of which segments/parts of
the sample are observations of that concept. In this case, the basic idea is to introduce a set of latent variables that encode hidden states of the world, where each state induces a joint distribution on the
space of semantic labels and image visual features. New images are annotated by maximizing the joint density of semantic labels, given the visual features of the new image.

Annotation Cost for Learning
•  Object detection schemes always employ some form of supervision as it is practically impossible to
detect and recognize an object without using any semantic information during training.
•  Semantic labels may be provided at different levels of granularity (global or region level) and
preciseness (one-to-one or many- to-many relation between objects and labels), imposing different
requirements on the effort required to generate them.
•  There is a clear distinction between the strong and accurate annotations that are usually generated
manually and constitute a laborious and time consuming task, and the weak and noisy annotations
that are usually generated by web users for their personal interest and can be obtained in large
quantities from the Web or collaborative tagging environments like flickr5.
•  The goal is to highlight the tradeoff between the annotation cost for preparing the necessary training
samples and the quality of the resulting models.

Pros & Cons for the different types of annotation
Annotation
Type
Automated
Annotation
Scaling
Capability
Training
Efficiency
Learning
Mechanism
Example
Region-level
(manual)
Poor Poor Excellent strongly-
supervised
Global-level
(manual)
Fair Fair Good weakly-
supervised
Global-level
(automatically
via Search
Engines)
Excellent Excellent Poor weakly-
supervised
Global-level
(automatically
via Social
Networks)
Excellent Excellent Fair weakly-
supervised
when further training is not resulting in better generalization.
3.3.3 Semi-supervised Learning
Semi-supervised learning algorithms try to exploit unlabeled data
ally of low cost and can be obtained in high quantities, in conjun
supervision information. In this case, only a small portion of the d
the algorithm aims at propagating the labels to the unlabeled data.
about using unlabeled data when learning a classification model is
self-learning, the classification model is initially trained using only
and at each step a part of the unlabeled data is labeled according to
current model. Then, a new classification model is trained using bo
well as the data that were labeled as positive from the previous ste
Another category of semi-supervised learning algorithms is bas
assumption, according to which the points that are in the same
Fig. 2. An image depicting the object sea that is manually annotated a

Social Media for Training Object Detectors
Machine learning algorithms fail in two main categories in terms of the annotation granularity:
•  The algorithms that are designed to learn from strongly annotated samples (i.e., samples in which the exact location of
an object within an image is known). The goal in this case is to learn a mapping from visual features fi to semantic
labels ci given a training set made of pairs ( fi , ci ). Such samples are very expensive to obtain.
•  The algorithms that learn from weakly annotated samples (i.e., samples in which it is known that an object is depicted in
the image, but its location is unknown). The goal in this case is to estimate the joint probability distribution between the
visual features fi and the semantic labels ci given a training set made of pairs between sets {(f1,..., fn),(c1,...,cm)}.
Weakly annotated samples can be found in large quantities especially from sources related to social networks.
This work aims at combining the advantages of both strongly supervised (learn model parameters
more efficiently) and weakly supervised (learn from samples obtained at low cost) methods, by
allowing the strongly supervised methods to learn object detection models from training samples
that are found in collaborative tagging environments.

Problem Formulation
Drawing from a large pool of weakly annotated images, the goal is to benefit from
the knowledge that can be extracted from social tagging systems, in order to
automatically transform some of the weakly annotated images into strongly
annotated ones.

References
•  Flickr
•  Google Images
•  New Directions in Web Data Management, Vakali & Jain

Leveraging social media for training object detectors

More Related Content

What's hot (18)

Viewers also liked (14)

Similar to Leveraging social media for training object detectors (20)

Recently uploaded (20)

Leveraging social media for training object detectors