A Higher-Level Visual Representation For Semantic Learning In ImageDatabases

A Higher-Level Visual Representation For Semantic Learning In ImageDatabasesIsmail EL SAYAD18/07/2011

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesOverviewIntroductionRelated worksOur approachEnhanced Bag of Visual Words (E-BOW)Multilayer Semantically Significant Analysis Model (MSSA)Semantically Significant Invariant Visual Glossary (SSIVG) ExperimentsImage retrievalImage classificationObject RecognitionConclusion and perspectives2

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMotivationDigital content grows rapidly Personal acquisition devicesBroadcast TV SurveillanceRelatively easy to store, but useless if no automatic processing, classification, and retrievingThe usual way to solve this problem is by describing images by keywords.This method suffers from subjectivity, text ambiguity and the lack of automatic annotation3

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsImage-based representations are based on global visual features extracted over the whole image like color, color moment, shape or texture Visual representationsImage-based representationsPart-based representations4

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsThe main drawbacks of Image-based representations:High sensitivity to :ScalePose Lighting condition changes OcclusionsCannot capture the local information of an imagePart-based representations: Based on the statistics of features extracted from segmented image regions5

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsPart-based representations (Bag of visual words)Feature spaceVisual word vocabulary VW1VW2VW3VW4...Compute local descriptorsFeature clusteringVW1VW2VW3VW1VW2VW3VW4...2111...VW4FrequencyVW16

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representations Bag of visual words (BOW) drawbacksSpatial information loss

Ignore the positionUsing only keypoints-based Intensity descriptors:Neither shape nor color information is usedFeature quantization noisiness: Unnecessary and insigniﬁcant visual words are generated7

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual representationsDrawbacks Bag of Visual words (BOW)Low discrimination power: Different image semantics are represented by the same visual wordsLow invariance for visual diversity: One image semantic is represented by different visual wordsVW1364VW1364VW330VW480VW263VW1488

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesObjectives Enhanced BOW representationDifferent local information (intensity, color, shape…)Spatial constitution of the imageEfficient visual word vocabulary structureHigher-level visual representationLess noisy More discriminativeMore invariant to the visual diversity9

MSSA modelIntroductionRelated worksOur approachExperimentsConclusion and perspectivesE-BOWOverview of the proposed higher-level visual representationSSVIWs & SSIVPsgenerationE-BOW representationSSIVG representationSSIVGLearning the MSSA model Visual word vocabulary buildingSet of images10

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesIntroductionRelated worksSpatial Pyramid Matching Kernel (SPM) & sparse coding Visual phrase & descriptive visual phraseVisual phrase pattern & visual synsetOur approachExperimentsConclusion and perspectives11

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSpatial Pyramid Matching Kernel (SPM) & sparse coding Lazebnik et al. [CVPR06]Spatial Pyramid Matching Kernel (SPM): exploiting the spatial information of location regions.Yang et al. [CVPR09]SPM + sparse coding: replacing k-means in the SPM12

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual phrase & descriptive visual phraseZheng and Gao [TOMCCAP08]Visual phrase: pair of spatially adjacent local image patchesZhang et al. [ACM MM09] Descriptive visual phrase: selected according to the frequencies of its constituent visual word pairs13

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesVisual phrase pattern & visual sysnetYuan et al. [CVPR07] Visual phrase pattern: spatially co-occurring group of visual wordsZheng et al. [CVPR08] Visual synset: relevance-consistent group of visual words or phrases in the spirit of the text synset14

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesComparison of the different enhancements of the BOW 15

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesIntroductionRelated worksOur approach Enhanced Bag of Visual Words (E-BOW)Multilayer Semantically Significant Analysis Model (MSSA)Semantically Significant Invariant Visual Glossary (SSIVG) ExperimentsConclusion and perspectives16

IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Set of imagesE-BOWMSSA modelSSIVGSURF & Edge Context extractionFeatures fusion Hierarchal features quantizationE-BOW representation17

IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction Interest points detectionEdge points detectionColorfiltering using vector median filter (VMF ) SURF feature vector extraction at each interest point Colorfeature extraction at each interest and edge point Fusion of the SURF and edgecontextfeaturevectorsColor and position vector clustering using Gaussian mixture model Edge Context feature vector extraction at each interest point Collection of all vectors for the whole image set∑3 µ3Pi3∑2 µ2Pi2∑1 µ1Pi1HAC and Divisive Hierarchical K-Means clusteringVW vocabulary18

IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (SURF)SURF is a low-level feature descriptorDescribes how the pixel intensities are distributed within a scale dependent neighborhood of each interest point. Good atHandling serious blurring Handling image rotation Poor atHandling illumination change Efficient19

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesEnhanced Bag of Visual Words (E-BOW)Feature extraction (Edge Context descriptor) Edge context descriptor is represented at each interest point as a histogram : 6 bins for the magnitude of the drawn vectors to the edge points 4 bins for the orientation angle20

IntroductionRelated worksOur approachExperimentsConclusion and perspectives Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge context descriptor) This descriptor is invariant to :Translation :The distribution of the edge points is measured with respect to fixed pointsScale:The radial distance is normalized by a mean distance between the whole set of points within the same GaussianRotation:All angles are measured relative to the tangent angle of each interest point21

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesEnhanced Bag of Visual Words (E-BOW)Hierarchalfeature quantization Visual word vocabulary is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps: Hierarchical Agglomerative Clustering (HAC) Divisive Hierarchical K-Means ClusteringStop clustering at desired level k k clusters from HAC…The tree is determined level by level, down to some maximum number of levels L, and each division into k parts. Merged feature in the feature spaceA cluster at k =422

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Set of imagesE-BOWMSSA modelSSIVGVWs semantic inference estimationSURF & Edge Context extractionNumber of latent topics EstimationFeatures fusion Parameters estimationHierarchal features quantizationGenerative processE-BOW representation23

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Generative ProcessDifferent VisualaspectsA topic model that considers this hierarchal structure is needed Higher-level aspect: People24

IntroductionRelated worksOur approachExperimentsConclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative ProcessφΘΨIn the MSSA, there are two different latent (hidden) topics:High latent topic that represents the high aspects

Visual latent topic that represents the visual aspectsV W vhimMN25

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Parameter EstimationProbability distribution function :Log-likelihood function :Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergenceObjective function:26

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) model Parameter EstimationKKT conditions are used to derive the multiplicative update rules for minimizing the objective function

This leads to the following multiplicative update rules : 27

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesMultilayer Semantically Significant Analysis (MSSA) modelNumber of Latent Topics EstimationMinimum Description Length (MDL) is used as a model selection criteria

Number of the high latent topics (L)

Number of the visual latent topics (K)

is the number of free parameters:28

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSet of imagesE-BOWMSSA modelSSIVGVWs semantic inference estimationSURF & Edge Context extractionSSVP representationSSIVG representationNumber of latent topics EstimationFeatures fusion SSVPs generationSSIVP representationParameters estimationHierarchal features quantizationSSVW representationDivisivetheoretic clusteringSSIVW representationGenerative processE-BOW representationSSVWs selection29

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Word (SSVW)Set of relevant visual topics Set of VWsEstimating using MSSASet of SSVWsEstimating using MSSA30

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically significant Visual Phrase (SSVP)SSVP: Higher-level and more discriminative representationSSVWs + their inter-relationships SSVPs are formed from SSVW sets that satisfy all the following conditions: Occur in the same spatial context

Involved in strong association rules

Have the same semantic meaning

High probability related to at least one common visual latent topic31

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Phrase (SSVP)SSIVP126SSIVP126SSIVP326SSIVP326SSIVP304SSIVP30432

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationInvariance Problem Studying the co-occurrence and spatial scatter information make the image representation more discriminative The invariance power of SSVWs and SSVPs is still low Text documentsSynonymous words can be clustered into one synonymy set to improve the document categorization performance33

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesSemantically Significant Invariant Visual Glossary (SSIVG) representationSSIVG : higher-level visual representation composed from two different layers of representationSemantically Significant Invariant Visual Word (SSIVW) Re-indexed SSVWs after a distributional clusteringSemantically Significant Invariant Visual Phrases (SSIVP) Re-indexed SSVPs after a distributional clusteringSet of relevant visual topics Set of SSVWs and SSVPsEstimating using MSSAEstimating using MSSADivisivetheoretic clusteringSet of SSIVGsSet of SSIVPsSet of SSIVWs34

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesExperimentsIntroductionRelated worksOur approachExperimentsImage retrievalImage classificationObject RecognitionConclusion and perspectives35

IntroductionRelated worksOur approachExperimentsConclusion and perspectivesAssessment of the SSIVG representation performance in image retrievalEvaluation criteria :Mean Average Precision (MAP)

A Higher-Level Visual Representation For Semantic Learning In ImageDatabases

More Related Content

Similar to A Higher-Level Visual Representation For Semantic Learning In ImageDatabases (20)