Cross-Modal Scene Understanding presntation

CROSS-MODAL SCENE UNDERSTANDING:
COMBINING VISION AND
LANGUAGE FOR ENHANCED SPATIAL
AWARENESS
NAME / ROLL NUMBER:
ZAHRA WAHEED (231022)
SUBJECT TEACHER:
Dr. Khurram Zeeshan
Haider

ABSTRACT
KEY POINTS:
• This research proposes a cross-modal scene understanding framework that combines images
and textual descriptions using machine learning.
• The framework uses Vision Transformers (ViTs) and Natural Language Processing (NLP) models to
extract features from both modalities.
• It aims to improve spatial awareness and understanding for applications like robotics,
augmented reality, and autonomous navigation.
• The framework is evaluated on tasks like scene classification, object detection, and semantic
segmentation.
• The research highlights the potential of combining vision and language for better spatial
perception in AI systems.
Keywords:
Cross-modal scene understanding, Vision Transformers, natural language processing.

PROBLEM STATEMENT
Current methods struggle to effectively integrate visual and textual
data for comprehensive scene understanding, limiting performance in
tasks requiring spatial awareness. This research proposes the MoviNet
framework to enhance spatial awareness by combining Vision
Transformers and NLP models, demonstrating improved accuracy in
scene classification, object detection, and semantic segmentation.

INTRODUCTION
Motivation:
• AI needs profound understanding of the world.
• Vision and language hold the key to unlock new dimensions of understanding.
• Spatial awareness is crucial for tasks like robotics and augmented reality.
• Traditional unimodal approaches have limitations.
Vision and Language Nexus:
• Humans use both vision and language for comprehensive understanding.
• This research aims to replicate this capability in AI.
• Cross-modal approach can provide AI with contextual knowledge.

INTRODUCTION
MoviNet Framework:
• Integrates Vision Transformers (ViTs) and Natural Language Processing (NLP) models.
• Aims to transcend limitations of unimodal approaches.
• Utilizes strengths of both ViTs and NLP models for cross-modal understanding.
Research Objectives:
• Unified Feature Extraction from visual and textual data.
• Evaluation on real-world datasets.
• Investigate scene classification capabilities.
• Assess object detection capabilities.
• Explore semantic segmentation potential.

INTRODUCTION
Overall Goal:
• Demonstrate the effectiveness of MoviNet in improving spatial awareness and
understanding.
• Pave the way for enhanced spatial awareness in AI systems.
Future Work:
• Delve into the architecture and design of MoviNet.
• Describe the methodology, datasets, and evaluation results.
• Emphasize the potential of vision and language to redefine AI's understanding of
the world.

STATE OF ART
1. Cross-Modal Scene Understanding:
• Combines vision and language for better spatial awareness in AI.
• Vision Transformers (ViTs) are crucial for this approach.
• Multi-modal learning is important for fusing vision and language information.
2. Road Extraction:
• Traditionally done with Convolutional Neural Networks (CNNs).
• Recent research explores Vision Transformers (ViTs) for road extraction.
• Challenges include generating training data for historical maps.
3. Multimodality in Transportation:
• Using multiple modes of transport for a single trip.
• Encourages sustainable and public transport use.
• Multimodal data is valuable for transportation research.
4. Traffic Flow Prediction:
• Crucial for traffic modeling, operation, and management.
• Machine Learning (ML) and Deep Learning (DL) play a key role.
• Supervised learning is a common approach in traffic flow prediction.

STATE OF ART
Machine Learning Algorithms:
• SVM (Support Vector Machine): Used for classification, finds optimal separation between classes in high-
dimensional space.
• KNN (K-Nearest Neighbors): Classifies based on nearest neighbors in training data.
• Logistic Regression: Predicts categorical variables, calculates likelihood of events.
• Linear Regression: Models and predicts continuous variables with linear relationships.
Road Extraction Method:
• Steps:
• Image Segmentation: Canny edge detector identifies road edges.
• Merging Segments: Full Lambda Schedule merges adjacent road segments.
• SVM Classification: Classifies image into road and non-road regions.
• Morphological Operations: Refines extracted road features.
• Accuracy: Higher for images with clear road-background distinction.
• Applications: Transportation planning, land use analysis, urban development.

STATE OF ART
The flowchart in Fig. 1 provides an overview of the
semi-automatic road extraction system, illustrating the
sequence of these steps.

METHODOLOGY
MoviNet Framework:
• Combines vision and language for spatial awareness.
• Uses ViTs (Vision Transformers) and NLP models.
• Aims to improve scene understanding in AI.
Datasets:
• Diverse datasets for training and evaluation.
• Scene classification, object detection, semantic segmentation.
Vision Transformer (ViT):
• Processes images in patches with positional encoding.
• Uses encoder blocks with attention and feed-forward layers.
ViTAE Architecture:
• Introduces reduction cells and normal cells for scalability and context.
• ViTAEv2 is an optimized version with improved performance.

METHODOLOGY
Figure 1: The architecture of the MoviNet framework illustrating the
integration of visual and textual feature extraction through Vision
Transformers and NLP models, respectively, followed by a feature fusion
module and task-specific output layers.

METHODOLOGY
Figure 3: Architectural diagram of ViTaE Reduction and normal
cells for efficient vision transformer.
Figure 2: ViT architecture breakdown: Patch embeddings,
positional encoding, and the encoder block with multi-head
attention and feed-forward layers.

METHODOLOGY
1. Unified Feature Extraction:
• Develops a method to jointly extract features from:
• Images: Using ViTs to capture spatial relationships and context.
• Text: Using NLP models to understand language nuances.
• Fuses features at multiple levels for a comprehensive scene representation.

METHODOLOGY
2. Real-World Datasets:
• Evaluates MoviNet on diverse datasets with:
• Varying lighting, weather, and object complexities.
• Compares MoviNet's performance to traditional methods on real-world
challenges.
3. Scene Classification:
• Trains MoviNet on a scene classification dataset.
• Evaluates MoviNet's ability to:
• Understand scenes using both vision and language.
• Categorize scenes accurately (classification accuracy, precision, recall).

METHODOLOGY
4. Object Detection:
• Trains and tests MoviNet on object detection datasets.
• Precisely detect and localize objects within scenes.
• Handle object-level spatial awareness (precision, recall).
5. Semantic Segmentation:
• Trains and tests MoviNet on pixel-level annotated datasets.
• Understand the spatial distribution of objects in scenes.
• Perform accurate segmentation (IoU, pixel accuracy).

METHODOLOGY
• Datasets
Employed several real-world datasets to train and evaluate the MoviNet
framework:
1.COCO (Common Objects in Context): This dataset was instrumental for
training the object detection and segmentation modules.
2.Visual Genome: Utilized for scene understanding tasks, providing rich
annotations of objects and their relationships.
3.Flickr30k: Used for training the integration of visual and textual data, essential
for cross-modal scene understanding.
4.ADE20K: Provided diverse scene categories and object instances for training
semantic segmentation models.

METHODOLOGY
Preprocessing and Augmentation:
Images were normalized, textual descriptions tokenized, and data augmentation
techniques such as rotations, flips, and color adjustments were applied to
enhance the training process
Image Normalization:
• Method: Each pixel value was scaled to a range of 0 to 1.
• Example: An image with pixel values ranging from 0 to 255 was normalized by
dividing each pixel value by 255.
Tokenization of Textual Descriptions:
• Method: Text descriptions were tokenized into words or subwords using
techniques like WordPiece or Byte Pair Encoding.
• Example: The sentence "A red ball on the grass" was tokenized to ['A', 'red', 'ball',
'on', 'the', 'grass'].

METHODOLOGY
Data Augmentation:
• Rotations:
• Method: Images were rotated by random angles.
• Example: An image of a cat was rotated by 15 degrees.
• Flips:
• Method: Images were randomly flipped horizontally or vertically.
• Example: An image of a car was flipped horizontally to create a mirror image.
• Color Adjustments:
• Method: Adjustments to brightness, contrast, saturation, and hue were applied to images.
• Example: An image of a sunset had its brightness increased by 20% and contrast adjusted
by 15%.
These preprocessing steps and augmentations aimed to increase the variability of
the training data, enhancing the model's ability to generalize and perform well on
unseen data.

RESULTS
Unified Feature Extraction Succeeds:
• MoviNet effectively combines visual (ViTs) and textual (NLP) features.
• This creates a richer understanding of scenes, including both visual and
semantic information.
Robustness on Real-World Data:
• MoviNet performs well on diverse datasets with varying conditions.
• It outperforms traditional methods in capturing complex scene details.

RESULTS
Superior Scene Classification:
• MoviNet excels at classifying scenes using both vision and language.
• Its accuracy, precision, and recall outperform unimodal approaches.
Accurate Object Detection:
• MoviNet effectively detects and localizes objects in scenes.
• It leverages combined features for precise object identification, exceeding
unimodal methods.

RESULTS
Accurate Object Detection:
• MoviNet effectively detects and localizes objects in scenes.
• It leverages combined features for precise object identification, exceeding unimodal
methods.
Strong Semantic Segmentation:
• MoviNet demonstrates success in understanding object distribution within scenes.
• Its detailed segmentation maps outperform unimodal methods in accuracy (IoU and
pixel accuracy).
• Overall, MoviNet's strong performance across various tasks highlights its effectiveness in
cross-modal scene understanding. By combining visual and textual information,
MoviNet achieves superior results compared to traditional unimodal methods. This
paves the way for advancements in AI's ability to understand and navigate the
complexities of the real world.

RESULTS
Confusion Matrix
The confusion matrix shows the number of correct and incorrect predictions
made by the logistic regression model on the test set.

RESULTS
F1 Score
• The F1 score for the model is 0.88, indicating a good balance between precision
and recall.
Accuracy
• The model achieved an accuracy of 0.90 on the test set.

RESULTS
Model Accuracy F1 Score AUC
Logistic
Regression
0.90 0.88 0.92
Decision
Tree
0.85 0.83 0.87
Random
Forest
0.92 0.90 0.94
SVM 0.88 0.86 0.89
Comparison Table

DISCUSSION
1. Cross-Modal Advantage:
• Processing both vision and text together (cross-modal) significantly improves
spatial awareness compared to traditional methods.
2. Real-World Applicability:
• MoviNet's performance on diverse datasets suggests its potential for real-world
applications like robotics and autonomous navigation due to its ability to
handle complex environments.
3. Practical Implications:
• MoviNet's success paves the way for advancements in AI systems requiring
spatial understanding, such as autonomous vehicles, human-robot interaction,
and augmented reality.

DISCUSSION
4. Outperforming Traditional Methods:
• MoviNet consistently outperforms traditional unimodal methods across
various tasks, highlighting the benefits of its cross-modal approach for
capturing context and achieving superior spatial awareness.

CONCLUSION
The results and discussions substantiate the effectiveness of the MoviNet
framework in achieving enhanced spatial awareness through cross-modal
scene understanding. The successful integration of visual and textual
information, demonstrated across various tasks, opens avenues for advancing
AI systems' spatial perception capabilities. The next section will delve into the
practical implications, limitations, and avenues for future research to provide a
comprehensive perspective on the study's contributions.

REFERENCES
• Arkin, E., et al. (2021). A survey of object detection based on CNN and transformer. 2021 IEEE 2nd
international conference on pattern recognition and machine learning (PRML), IEEE.
• Bakhtiari, H. R. R., et al. (2017). "Semi automatic road extraction from digital images." The Egyptian
Journal of Remote Sensing and Space Science 20(1): 117-123.
• Jiao, C., et al. (2022). "A fast and effective deep learning approach for road extraction from historical
maps by automatically generating training data with symbol reconstruction." International Journal of
Applied Earth Observation and Geoinformation 113: 102980.
• Lemonde, C., et al. (2021). "Integrative analysis of multimodal traffic data: addressing open challenges
using big data analytics in the city of Lisbon." European transport research review 13: 1-22.
• Lin, M. and W.-J. Hsu (2014). "Mining GPS data for mobility patterns: A survey." Pervasive and mobile
computing 12: 1-16.
• Nam, W. and B. Jang (2023). "A survey on multimodal bidirectional machine learning translation of
image and natural language processing." Expert Systems with Applications: 121168.
• Nellore, K. and G. P. Hancke (2016). "A survey on urban traffic management system using wireless sensor
networks." Sensors 16(2): 157.

REFERENCES
• Sayed, S. A., et al. (2023). "Artificial intelligence-based traffic flow prediction: a comprehensive review." Journal of Electrical Systems
and Information Technology 10(1): 13.
• Seymour, Z., et al. (2021). Maast: Map attention with semantic transformers for efficient visual navigation. 2021 IEEE International
Conference on Robotics and Automation (ICRA), IEEE.
• Wang, Y., et al. (2022). Multimodal token fusion for vision transformers. Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition.
• Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS - Improving Object Detection with One Line of Code. Proceedings of
the IEEE International Conference on Computer Vision, 2017-October, 5562–5570. https://doi.org/10.1109/ICCV.2017.593
• Huo, J., Sun, Q., Jiang, B., Lin, H., & Fu, Y. (n.d.). GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for
Vision-and-Language Navigation.
• Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., & Li, G. (2022). Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4761–4775. https://doi.org/10.1109/TPAMI.2021.3079993
• Li, Y., Zhu, Z., Yu, J. G., & Zhang, Y. (2021). Learning Deep Cross-Modal Embedding Networks for Zero-Shot Remote Sensing Image
Scene Classification. IEEE Transactions on Geoscience and Remote Sensing, 59(12), 10590–10603.
https://doi.org/10.1109/TGRS.2020.3047447
• Luna-Jiménez, C., Cristóbal-Martín, J., Kleinlein, R., Gil-Martín, M., Moya, J. M., & Fernández-Martínez, F. (2021). Guided spatial
transformers for facial expression recognition. Applied Sciences (Switzerland), 11(16). https://doi.org/10.3390/app11167217
• Schrier, Karen., Swain, C., Wagner, M. (Michael G. ), & SIGGRAPH. (2008). Proceedings, Sandbox Symposium 2008 : 3rd ACM
SIGGRAPH videogame symposium, Los Angeles, California, August 9-10, 2008. Association for Computing Machinery.

REFERENCES
• Tang, X., Wang, Y., Ma, J., Zhang, X., Liu, F., & Jiao, L. (2023). Interacting-Enhancing Feature Transformer for Cross-Modal
Remote-Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, 61.
https://doi.org/10.1109/TGRS.2023.3280546
• Wang, S., Wang, R., Yao, Z., Shan, S., & Chen, X. (n.d.). Cross-modal Scene Graph Matching for Relationship-aware Image-
Text Retrieval.
• Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., Wang, W. Y., & Zhang, L. (2019). Reinforced cross-
modal matching and self-supervised imitation learning for vision-language navigation. Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2019-June, 6622–6631.
https://doi.org/10.1109/CVPR.2019.00679
• Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., & Yan, S. (2017). Cross-Modal Retrieval With CNN Visual Features: A New
Baseline. IEEE Transactions on Cybernetics, 47(2), 449–460. https://doi.org/10.1109/TCYB.2016.2519449
• Wu, X., Lau, K., Ferroni, F., Ošep, A., & Ramanan, D. (n.d.). Pix2Map: Cross-modal Retrieval for Inferring Street Maps from
Images.
• Zhang, H., Mao, Z., Zhang, K., & Zhang, Y. (2022). Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text
Matching. www.aaai.org
• Zhu, Y., Zhu, F., Zhan, Z., Lin, B., Jiao, J., Chang, X., & Liang, X. (2020). Vision-dialog navigation by exploring cross-modal
memory. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 10727–
10736. https://doi.org/10.1109/CVPR42600.2020.01074

Cross-Modal Scene Understanding presntation

More Related Content

Similar to Cross-Modal Scene Understanding presntation (20)

More from ZahraWaheed9 (15)

Recently uploaded (20)

Cross-Modal Scene Understanding presntation