SlideShare a Scribd company logo
A Higher-Level Visual Representation For Semantic  Learning In ImageDatabasesIsmail EL SAYAD18/07/2011
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesOverviewIntroductionRelated worksOur approachEnhanced Bag of Visual Words (E-BOW)Multilayer Semantically Significant Analysis Model (MSSA)Semantically Significant Invariant Visual Glossary (SSIVG) ExperimentsImage retrievalImage classificationObject RecognitionConclusion and perspectives2
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMotivationDigital content grows rapidly                  Personal acquisition devicesBroadcast TV SurveillanceRelatively easy to store, but useless if no automatic processing, classification, and retrievingThe usual way to solve this problem is by describing images by keywords.This method suffers from subjectivity, text ambiguity and the lack of automatic annotation3
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsImage-based representations  are based on global visual features extracted  over the whole image like color, color moment, shape or texture Visual representationsImage-based  representationsPart-based representations4
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsThe main drawbacks of  Image-based representations:High sensitivity to :ScalePose  Lighting condition changes OcclusionsCannot capture the local information of an imagePart-based representations: Based  on the statistics of features extracted from segmented image regions5
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsPart-based representations (Bag of visual words)Feature spaceVisual  word vocabulary VW1VW2VW3VW4...Compute  local descriptorsFeature clusteringVW1VW2VW3VW1VW2VW3VW4...2111...VW4FrequencyVW16
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representations Bag of visual words (BOW) drawbacksSpatial information loss
Record number of occurrences
Ignore the positionUsing only keypoints-based Intensity descriptors:Neither shape nor color information  is usedFeature quantization noisiness: Unnecessary and insignificant visual words are generated7
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsDrawbacks Bag of Visual words (BOW)Low discrimination power: Different image semantics are represented by the same visual wordsLow invariance for visual diversity: One image semantic is represented by different visual wordsVW1364VW1364VW330VW480VW263VW1488
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesObjectives Enhanced BOW representationDifferent local information (intensity, color, shape…)Spatial constitution of the imageEfficient visual word vocabulary structureHigher-level visual representationLess noisy More discriminativeMore  invariant to the visual diversity9
MSSA modelIntroductionRelated worksOur approachExperimentsConclusion and perspectivesE-BOWOverview of the proposed higher-level visual representationSSVIWs & SSIVPsgenerationE-BOW representationSSIVG representationSSIVGLearning the MSSA model Visual word vocabulary buildingSet of images10
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesIntroductionRelated worksSpatial Pyramid Matching Kernel (SPM)  & sparse coding Visual phrase & descriptive visual phraseVisual phrase pattern & visual synsetOur approachExperimentsConclusion and perspectives11
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSpatial Pyramid Matching Kernel (SPM) & sparse coding Lazebnik et al. [CVPR06]Spatial Pyramid Matching Kernel (SPM): exploiting the spatial information of location regions.Yang et al. [CVPR09]SPM + sparse coding: replacing  k-means in the SPM12
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual phrase & descriptive visual phraseZheng and Gao [TOMCCAP08]Visual phrase:  pair of spatially adjacent local image patchesZhang et al. [ACM MM09] Descriptive visual phrase: selected according to the frequencies of its constituent visual word pairs13
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual phrase pattern & visual sysnetYuan et al. [CVPR07] Visual phrase pattern: spatially co-occurring group of visual wordsZheng et al.  [CVPR08] Visual synset:  relevance-consistent group of visual words or phrases in the spirit of the text synset14
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesComparison of the different enhancements of  the BOW 15
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesIntroductionRelated worksOur approach Enhanced Bag of Visual Words (E-BOW)Multilayer Semantically Significant Analysis Model (MSSA)Semantically Significant Invariant Visual Glossary (SSIVG) ExperimentsConclusion and perspectives16
IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Set of imagesE-BOWMSSA modelSSIVGSURF & Edge Context extractionFeatures fusion Hierarchal features quantizationE-BOW representation17
IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction Interest points detectionEdge points detectionColorfiltering  using vector median filter (VMF ) SURF feature vector extraction at each interest point Colorfeature extraction at each interest and edge point Fusion of the SURF and edgecontextfeaturevectorsColor and position vector  clustering using Gaussian mixture model Edge Context feature vector extraction at each  interest point Collection of all vectors for the whole image set∑3 µ3Pi3∑2 µ2Pi2∑1 µ1Pi1HAC and Divisive Hierarchical K-Means clusteringVW vocabulary18
IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (SURF)SURF is a low-level feature descriptorDescribes how the pixel intensities are distributed within a scale dependent neighborhood of each interest point. Good atHandling serious blurring Handling image rotation Poor atHandling illumination change Efficient19
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesEnhanced Bag of Visual Words (E-BOW)Feature extraction (Edge Context descriptor) Edge context descriptor is represented at  each interest point as a histogram : 6 bins for the magnitude of the drawn vectors to the edge points   4 bins for the orientation angle20
IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge context descriptor) This descriptor is invariant to :Translation :The distribution of the edge points is measured with respect to fixed  pointsScale:The radial distance is normalized by a mean distance between the whole set of points within the same GaussianRotation:All angles are measured relative to the tangent angle of each interest point21
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesEnhanced Bag of Visual Words (E-BOW)Hierarchalfeature quantization  Visual word vocabulary  is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps: Hierarchical Agglomerative Clustering (HAC) Divisive Hierarchical K-Means ClusteringStop clustering at desired level k k clusters from HAC…The tree is determined level by level, down to some maximum number of levels L, and each division into k parts. Merged feature in  the feature spaceA cluster at k =422
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Set of imagesE-BOWMSSA modelSSIVGVWs semantic  inference estimationSURF & Edge Context extractionNumber of latent topics EstimationFeatures fusion Parameters estimationHierarchal features quantizationGenerative processE-BOW representation23
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Generative ProcessDifferent VisualaspectsA topic model that considers this hierarchal structure is needed Higher-level aspect: People24
IntroductionRelated worksOur approachExperimentsConclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative ProcessφΘΨIn the MSSA,  there are two  different latent (hidden) topics:High latent topic       that represents the high aspects
Visual latent topic        that represents the visual aspectsV W  vhimMN25
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Parameter EstimationProbability distribution function :Log-likelihood function :Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergenceObjective function:26
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Parameter EstimationKKT conditions  are used to derive the multiplicative update rules for minimizing the objective function
This leads to the following  multiplicative update rules : 27
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) modelNumber of Latent Topics EstimationMinimum Description Length (MDL) is used as a model selection criteria
Number of the high latent topics (L)
Number of the visual latent topics (K)
is the log-likelihood
     is the number of free parameters:28
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSet of imagesE-BOWMSSA modelSSIVGVWs semantic  inference estimationSURF & Edge Context extractionSSVP representationSSIVG representationNumber of latent topics EstimationFeatures fusion SSVPs generationSSIVP representationParameters estimationHierarchal features quantizationSSVW representationDivisivetheoretic  clusteringSSIVW representationGenerative processE-BOW representationSSVWs selection29
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Word (SSVW)Set of relevant visual topics Set of VWsEstimating using MSSASet of SSVWsEstimating using MSSA30
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically significant Visual Phrase (SSVP)SSVP: Higher-level and more discriminative representationSSVWs + their inter-relationships SSVPs are formed from SSVW sets that satisfy all the following conditions: Occur in the same spatial context
Involved in strong association rules
High support and confidence
Have  the same semantic meaning
High probability related to at least one common visual latent topic31
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Phrase (SSVP)SSIVP126SSIVP126SSIVP326SSIVP326SSIVP304SSIVP30432
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationInvariance Problem Studying the co-occurrence and spatial scatter information make the image representation more discriminative The invariance power of SSVWs and SSVPs is still low Text documentsSynonymous words can be clustered into one synonymy set to improve the document categorization performance33
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSSIVG : higher-level visual representation composed from two different layers of representationSemantically Significant Invariant Visual Word (SSIVW) Re-indexed SSVWs  after a distributional clusteringSemantically Significant Invariant Visual Phrases  (SSIVP) Re-indexed SSVPs after a distributional clusteringSet of relevant visual topics     Set of SSVWs and SSVPsEstimating                using MSSAEstimating                          using MSSADivisivetheoretic  clusteringSet of SSIVGsSet of SSIVPsSet of SSIVWs34
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesExperimentsIntroductionRelated worksOur approachExperimentsImage retrievalImage classificationObject RecognitionConclusion and perspectives35
IntroductionRelated worksOur approachExperimentsConclusion and perspectivesAssessment of the SSIVG representation performance in image retrievalEvaluation criteria :Mean Average Precision (MAP)

More Related Content

PPTX
List of processed foods
PPT
ECCV2010: feature learning for image classification, part 1
PDF
Vocabulary length experiments for binary image classification using bov approach
PDF
Fusion of demands in review of bag of-visual words
PPT
Bagwords
PPT
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
PDF
Lec18 bag of_features
PDF
3 bagofwords.ppt
List of processed foods
ECCV2010: feature learning for image classification, part 1
Vocabulary length experiments for binary image classification using bov approach
Fusion of demands in review of bag of-visual words
Bagwords
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Lec18 bag of_features
3 bagofwords.ppt

Similar to A Higher-Level Visual Representation For Semantic Learning In Image Databases (20)

PDF
Currency recognition on mobile phones
PDF
duy_bien_ism2014_paper
PDF
3 d recognition via 2-d stage associative memory kunal
PDF
Flickr Image Classification using SIFT Algorism
PDF
Computer Vision: Visual Extent of an Object
PPTX
Large scale object recognition (AMMAI presentation)
PDF
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
PPT
Cvpr2007 object category recognition p1 - bag of words models
PDF
F010433136
PDF
Probabilistic generative models for machine vision
PPTX
12 cie552 object_recognition
PDF
Pc Seminar Jordi
PPTX
Iccv2009 recognition and learning object categories p1 c01 - classical methods
PPTX
A brief introduction to extracting information from images
PPTX
Handwritten and Machine Printed Text Separation in Document Images using the ...
ODP
An Introduction to Computer Vision
PDF
Web Image Retrieval Using Visual Dictionary
PDF
Web Image Retrieval Using Visual Dictionary
PDF
Web Image Retrieval Using Visual Dictionary
PPTX
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Currency recognition on mobile phones
duy_bien_ism2014_paper
3 d recognition via 2-d stage associative memory kunal
Flickr Image Classification using SIFT Algorism
Computer Vision: Visual Extent of an Object
Large scale object recognition (AMMAI presentation)
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Cvpr2007 object category recognition p1 - bag of words models
F010433136
Probabilistic generative models for machine vision
12 cie552 object_recognition
Pc Seminar Jordi
Iccv2009 recognition and learning object categories p1 c01 - classical methods
A brief introduction to extracting information from images
Handwritten and Machine Printed Text Separation in Document Images using the ...
An Introduction to Computer Vision
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
Contextless Object Recognition with Shape-enriched SIFT and Bags of Features
Ad

A Higher-Level Visual Representation For Semantic Learning In Image Databases

Editor's Notes

  • #3: Talk briefly about the introductionTalk about the different parts of the approach briefly
  • #5: Nowadays, images can be described using their visual content.
  • #9: Talk about analogy between text and imagesDifferent visual appearanceSAME VISUAL WORD APPEAR IN TWO DIFFERENT IMAGE DESCRIBING DIFFERENT SEMANTICS
  • #10: This work aims at addressing these drawbacks with the following
  • #12: All the related works are based on BOW representation, they propose different Higher-level representation
  • #13: Spatial pyramid extended the BOW representationExample of spatial pyramid for three different spatial level and resolutions
  • #14: enhanced this approach by selecting descriptive visual phrases from the constructed visual phrases according to the frequencies of their constituent visual word pairs.
  • #16: EFFICIENT structure by other approaches are for visual word vocabulary but we used for word and phrase
  • #21: Motivation of the edge context
  • #26: Add training images
  • #27: This generative process leads to the following conditional probability distribution:Following the maximum likelihood principle, one can estimate the parameters bymaximizing the log-likelihood function as follows:
  • #29: The number of the high latent topics, L, and the number of the visual latent topics, K, is determined in advance for the model fitting based on the Minimum Description Length (MDL)principle
  • #31: Check the size of the cylinders
  • #34: See right part
  • #35: Correct the animationMake boxes bigger
  • #36: Global approach slide before this slide
  • #37: Add parameter settings
  • #39: Discriptive correct and upper casesAdd references
  • #41: Add this slide at the end of the presentations and add sub points to slide before
  • #42: Upper case corrections
  • #45: Global approach slide before this slide
  • #46: Add parameter settings
  • #47: Parameters update: It will be essential to design on-line algorithms to continuously (re-)learn the parameters of the proposed MSSA model, as the content of digital databases is modified by the regular upload or deletion of images.Invariance issue: It will be interesting to investigate more on the invariance issue especially in the context large-scale databases where large intra-class variations can occur.Cross-modalitily extension: The proposed higher-level visual representation can be extended to video content. The extension can be based on cross-modal data (visual and textual closed captions contents). Video summarization: A new generic framework of video summarization based on the extended higher-level semantic representation of video content can be designed.Talk that this work is applied at the frame level
  • #49: Parameters update: It will be essential to design on-line algorithms to continuously (re-)learn the parameters of the proposed MSSA model, as the content of digital databases is modified by the regular upload or deletion of images.Invariance issue: It will be interesting to investigate more on the invariance issue especially in the context large-scale databases where large intra-class variations can occur.Cross-modalitily extension: The proposed higher-level visual representation can be extended to video content. The extension can be based on cross-modal data (visual and textual closed captions contents). Video summarization: A new generic framework of video summarization based on the extended higher-level semantic representation of video content can be designed.Talk that this work is applied at the frame level