CROSS-MODAL SCENE UNDERSTANDING:
COMBINING VISION AND
LANGUAGE FOR ENHANCED SPATIAL
AWARENESS
NAME / ROLL NUMBER:
ZAHRA WAHEED (231022)
SUBJECT TEACHER:
Dr. Khurram Zeeshan
Haider
ABSTRACT
KEY POINTS:
• This research proposes a cross-modal scene understanding framework that combines images
and textual descriptions using machine learning.
• The framework uses Vision Transformers (ViTs) and Natural Language Processing (NLP) models to
extract features from both modalities.
• It aims to improve spatial awareness and understanding for applications like robotics,
augmented reality, and autonomous navigation.
• The framework is evaluated on tasks like scene classification, object detection, and semantic
segmentation.
• The research highlights the potential of combining vision and language for better spatial
perception in AI systems.
Keywords:
Cross-modal scene understanding, Vision Transformers, natural language processing.
PROBLEM STATEMENT
Current methods struggle to effectively integrate visual and textual
data for comprehensive scene understanding, limiting performance in
tasks requiring spatial awareness. This research proposes the MoviNet
framework to enhance spatial awareness by combining Vision
Transformers and NLP models, demonstrating improved accuracy in
scene classification, object detection, and semantic segmentation.
INTRODUCTION
Motivation:
• AI needs profound understanding of the world.
• Vision and language hold the key to unlock new dimensions of understanding.
• Spatial awareness is crucial for tasks like robotics and augmented reality.
• Traditional unimodal approaches have limitations.
Vision and Language Nexus:
• Humans use both vision and language for comprehensive understanding.
• This research aims to replicate this capability in AI.
• Cross-modal approach can provide AI with contextual knowledge.
INTRODUCTION
MoviNet Framework:
• Integrates Vision Transformers (ViTs) and Natural Language Processing (NLP) models.
• Aims to transcend limitations of unimodal approaches.
• Utilizes strengths of both ViTs and NLP models for cross-modal understanding.
Research Objectives:
• Unified Feature Extraction from visual and textual data.
• Evaluation on real-world datasets.
• Investigate scene classification capabilities.
• Assess object detection capabilities.
• Explore semantic segmentation potential.
INTRODUCTION
Overall Goal:
• Demonstrate the effectiveness of MoviNet in improving spatial awareness and
understanding.
• Pave the way for enhanced spatial awareness in AI systems.
Future Work:
• Delve into the architecture and design of MoviNet.
• Describe the methodology, datasets, and evaluation results.
• Emphasize the potential of vision and language to redefine AI's understanding of
the world.
STATE OF ART
1. Cross-Modal Scene Understanding:
• Combines vision and language for better spatial awareness in AI.
• Vision Transformers (ViTs) are crucial for this approach.
• Multi-modal learning is important for fusing vision and language information.
2. Road Extraction:
• Traditionally done with Convolutional Neural Networks (CNNs).
• Recent research explores Vision Transformers (ViTs) for road extraction.
• Challenges include generating training data for historical maps.
3. Multimodality in Transportation:
• Using multiple modes of transport for a single trip.
• Encourages sustainable and public transport use.
• Multimodal data is valuable for transportation research.
4. Traffic Flow Prediction:
• Crucial for traffic modeling, operation, and management.
• Machine Learning (ML) and Deep Learning (DL) play a key role.
• Supervised learning is a common approach in traffic flow prediction.
STATE OF ART
Machine Learning Algorithms:
• SVM (Support Vector Machine): Used for classification, finds optimal separation between classes in high-
dimensional space.
• KNN (K-Nearest Neighbors): Classifies based on nearest neighbors in training data.
• Logistic Regression: Predicts categorical variables, calculates likelihood of events.
• Linear Regression: Models and predicts continuous variables with linear relationships.
Road Extraction Method:
• Steps:
• Image Segmentation: Canny edge detector identifies road edges.
• Merging Segments: Full Lambda Schedule merges adjacent road segments.
• SVM Classification: Classifies image into road and non-road regions.
• Morphological Operations: Refines extracted road features.
• Accuracy: Higher for images with clear road-background distinction.
• Applications: Transportation planning, land use analysis, urban development.
STATE OF ART
The flowchart in Fig. 1 provides an overview of the
semi-automatic road extraction system, illustrating the
sequence of these steps.
METHODOLOGY
MoviNet Framework:
• Combines vision and language for spatial awareness.
• Uses ViTs (Vision Transformers) and NLP models.
• Aims to improve scene understanding in AI.
Datasets:
• Diverse datasets for training and evaluation.
• Scene classification, object detection, semantic segmentation.
Vision Transformer (ViT):
• Processes images in patches with positional encoding.
• Uses encoder blocks with attention and feed-forward layers.
ViTAE Architecture:
• Introduces reduction cells and normal cells for scalability and context.
• ViTAEv2 is an optimized version with improved performance.
METHODOLOGY
Figure 1: The architecture of the MoviNet framework illustrating the
integration of visual and textual feature extraction through Vision
Transformers and NLP models, respectively, followed by a feature fusion
module and task-specific output layers.
METHODOLOGY
Figure 3: Architectural diagram of ViTaE Reduction and normal
cells for efficient vision transformer.
Figure 2: ViT architecture breakdown: Patch embeddings,
positional encoding, and the encoder block with multi-head
attention and feed-forward layers.
METHODOLOGY
1. Unified Feature Extraction:
• Develops a method to jointly extract features from:
• Images: Using ViTs to capture spatial relationships and context.
• Text: Using NLP models to understand language nuances.
• Fuses features at multiple levels for a comprehensive scene representation.
METHODOLOGY
2. Real-World Datasets:
• Evaluates MoviNet on diverse datasets with:
• Varying lighting, weather, and object complexities.
• Compares MoviNet's performance to traditional methods on real-world
challenges.
3. Scene Classification:
• Trains MoviNet on a scene classification dataset.
• Evaluates MoviNet's ability to:
• Understand scenes using both vision and language.
• Categorize scenes accurately (classification accuracy, precision, recall).
METHODOLOGY
2. Real-World Datasets:
• Evaluates MoviNet on diverse datasets with:
• Varying lighting, weather, and object complexities.
• Compares MoviNet's performance to traditional methods on real-world
challenges.
3. Scene Classification:
• Trains MoviNet on a scene classification dataset.
• Evaluates MoviNet's ability to:
• Understand scenes using both vision and language.
• Categorize scenes accurately (classification accuracy, precision, recall).
METHODOLOGY
4. Object Detection:
• Trains and tests MoviNet on object detection datasets.
• Evaluates MoviNet's ability to:
• Precisely detect and localize objects within scenes.
• Handle object-level spatial awareness (precision, recall).
5. Semantic Segmentation:
• Trains and tests MoviNet on pixel-level annotated datasets.
• Evaluates MoviNet's ability to:
• Understand the spatial distribution of objects in scenes.
• Perform accurate segmentation (IoU, pixel accuracy).
METHODOLOGY
• Datasets
Employed several real-world datasets to train and evaluate the MoviNet
framework:
1.COCO (Common Objects in Context): This dataset was instrumental for
training the object detection and segmentation modules.
2.Visual Genome: Utilized for scene understanding tasks, providing rich
annotations of objects and their relationships.
3.Flickr30k: Used for training the integration of visual and textual data, essential
for cross-modal scene understanding.
4.ADE20K: Provided diverse scene categories and object instances for training
semantic segmentation models.
METHODOLOGY
Preprocessing and Augmentation:
Images were normalized, textual descriptions tokenized, and data augmentation
techniques such as rotations, flips, and color adjustments were applied to
enhance the training process
Image Normalization:
• Method: Each pixel value was scaled to a range of 0 to 1.
• Example: An image with pixel values ranging from 0 to 255 was normalized by
dividing each pixel value by 255.
Tokenization of Textual Descriptions:
• Method: Text descriptions were tokenized into words or subwords using
techniques like WordPiece or Byte Pair Encoding.
• Example: The sentence "A red ball on the grass" was tokenized to ['A', 'red', 'ball',
'on', 'the', 'grass'].
METHODOLOGY
Data Augmentation:
• Rotations:
• Method: Images were rotated by random angles.
• Example: An image of a cat was rotated by 15 degrees.
• Flips:
• Method: Images were randomly flipped horizontally or vertically.
• Example: An image of a car was flipped horizontally to create a mirror image.
• Color Adjustments:
• Method: Adjustments to brightness, contrast, saturation, and hue were applied to images.
• Example: An image of a sunset had its brightness increased by 20% and contrast adjusted
by 15%.
These preprocessing steps and augmentations aimed to increase the variability of
the training data, enhancing the model's ability to generalize and perform well on
unseen data.
RESULTS
Unified Feature Extraction Succeeds:
• MoviNet effectively combines visual (ViTs) and textual (NLP) features.
• This creates a richer understanding of scenes, including both visual and
semantic information.
Robustness on Real-World Data:
• MoviNet performs well on diverse datasets with varying conditions.
• It outperforms traditional methods in capturing complex scene details.
RESULTS
Superior Scene Classification:
• MoviNet excels at classifying scenes using both vision and language.
• Its accuracy, precision, and recall outperform unimodal approaches.
Accurate Object Detection:
• MoviNet effectively detects and localizes objects in scenes.
• It leverages combined features for precise object identification, exceeding
unimodal methods.
RESULTS
Accurate Object Detection:
• MoviNet effectively detects and localizes objects in scenes.
• It leverages combined features for precise object identification, exceeding unimodal
methods.
Strong Semantic Segmentation:
• MoviNet demonstrates success in understanding object distribution within scenes.
• Its detailed segmentation maps outperform unimodal methods in accuracy (IoU and
pixel accuracy).
• Overall, MoviNet's strong performance across various tasks highlights its effectiveness in
cross-modal scene understanding. By combining visual and textual information,
MoviNet achieves superior results compared to traditional unimodal methods. This
paves the way for advancements in AI's ability to understand and navigate the
complexities of the real world.
RESULTS
Confusion Matrix
The confusion matrix shows the number of correct and incorrect predictions
made by the logistic regression model on the test set.
RESULTS
F1 Score
• The F1 score for the model is 0.88, indicating a good balance between precision
and recall.
Accuracy
• The model achieved an accuracy of 0.90 on the test set.
RESULTS
Model Accuracy F1 Score AUC
Logistic
Regression
0.90 0.88 0.92
Decision
Tree
0.85 0.83 0.87
Random
Forest
0.92 0.90 0.94
SVM 0.88 0.86 0.89
Comparison Table
DISCUSSION
1. Cross-Modal Advantage:
• Processing both vision and text together (cross-modal) significantly improves
spatial awareness compared to traditional methods.
2. Real-World Applicability:
• MoviNet's performance on diverse datasets suggests its potential for real-world
applications like robotics and autonomous navigation due to its ability to
handle complex environments.
3. Practical Implications:
• MoviNet's success paves the way for advancements in AI systems requiring
spatial understanding, such as autonomous vehicles, human-robot interaction,
and augmented reality.
DISCUSSION
4. Outperforming Traditional Methods:
• MoviNet consistently outperforms traditional unimodal methods across
various tasks, highlighting the benefits of its cross-modal approach for
capturing context and achieving superior spatial awareness.
CONCLUSION
The results and discussions substantiate the effectiveness of the MoviNet
framework in achieving enhanced spatial awareness through cross-modal
scene understanding. The successful integration of visual and textual
information, demonstrated across various tasks, opens avenues for advancing
AI systems' spatial perception capabilities. The next section will delve into the
practical implications, limitations, and avenues for future research to provide a
comprehensive perspective on the study's contributions.
REFERENCES
• Arkin, E., et al. (2021). A survey of object detection based on CNN and transformer. 2021 IEEE 2nd
international conference on pattern recognition and machine learning (PRML), IEEE.
• Bakhtiari, H. R. R., et al. (2017). "Semi automatic road extraction from digital images." The Egyptian
Journal of Remote Sensing and Space Science 20(1): 117-123.
• Jiao, C., et al. (2022). "A fast and effective deep learning approach for road extraction from historical
maps by automatically generating training data with symbol reconstruction." International Journal of
Applied Earth Observation and Geoinformation 113: 102980.
• Lemonde, C., et al. (2021). "Integrative analysis of multimodal traffic data: addressing open challenges
using big data analytics in the city of Lisbon." European transport research review 13: 1-22.
• Lin, M. and W.-J. Hsu (2014). "Mining GPS data for mobility patterns: A survey." Pervasive and mobile
computing 12: 1-16.
• Nam, W. and B. Jang (2023). "A survey on multimodal bidirectional machine learning translation of
image and natural language processing." Expert Systems with Applications: 121168.
• Nellore, K. and G. P. Hancke (2016). "A survey on urban traffic management system using wireless sensor
networks." Sensors 16(2): 157.
REFERENCES
• Sayed, S. A., et al. (2023). "Artificial intelligence-based traffic flow prediction: a comprehensive review." Journal of Electrical Systems
and Information Technology 10(1): 13.
• Seymour, Z., et al. (2021). Maast: Map attention with semantic transformers for efficient visual navigation. 2021 IEEE International
Conference on Robotics and Automation (ICRA), IEEE.
• Wang, Y., et al. (2022). Multimodal token fusion for vision transformers. Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition.
• Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS - Improving Object Detection with One Line of Code. Proceedings of
the IEEE International Conference on Computer Vision, 2017-October, 5562–5570. https://doi.org/10.1109/ICCV.2017.593
• Huo, J., Sun, Q., Jiang, B., Lin, H., & Fu, Y. (n.d.). GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for
Vision-and-Language Navigation.
• Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., & Li, G. (2022). Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4761–4775. https://doi.org/10.1109/TPAMI.2021.3079993
• Li, Y., Zhu, Z., Yu, J. G., & Zhang, Y. (2021). Learning Deep Cross-Modal Embedding Networks for Zero-Shot Remote Sensing Image
Scene Classification. IEEE Transactions on Geoscience and Remote Sensing, 59(12), 10590–10603.
https://doi.org/10.1109/TGRS.2020.3047447
• Luna-Jiménez, C., Cristóbal-Martín, J., Kleinlein, R., Gil-Martín, M., Moya, J. M., & Fernández-Martínez, F. (2021). Guided spatial
transformers for facial expression recognition. Applied Sciences (Switzerland), 11(16). https://doi.org/10.3390/app11167217
• Schrier, Karen., Swain, C., Wagner, M. (Michael G. ), & SIGGRAPH. (2008). Proceedings, Sandbox Symposium 2008 : 3rd ACM
SIGGRAPH videogame symposium, Los Angeles, California, August 9-10, 2008. Association for Computing Machinery.
REFERENCES
• Tang, X., Wang, Y., Ma, J., Zhang, X., Liu, F., & Jiao, L. (2023). Interacting-Enhancing Feature Transformer for Cross-Modal
Remote-Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, 61.
https://doi.org/10.1109/TGRS.2023.3280546
• Wang, S., Wang, R., Yao, Z., Shan, S., & Chen, X. (n.d.). Cross-modal Scene Graph Matching for Relationship-aware Image-
Text Retrieval.
• Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., Wang, W. Y., & Zhang, L. (2019). Reinforced cross-
modal matching and self-supervised imitation learning for vision-language navigation. Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2019-June, 6622–6631.
https://doi.org/10.1109/CVPR.2019.00679
• Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., & Yan, S. (2017). Cross-Modal Retrieval With CNN Visual Features: A New
Baseline. IEEE Transactions on Cybernetics, 47(2), 449–460. https://doi.org/10.1109/TCYB.2016.2519449
• Wu, X., Lau, K., Ferroni, F., Ošep, A., & Ramanan, D. (n.d.). Pix2Map: Cross-modal Retrieval for Inferring Street Maps from
Images.
• Zhang, H., Mao, Z., Zhang, K., & Zhang, Y. (2022). Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text
Matching. www.aaai.org
• Zhu, Y., Zhu, F., Zhan, Z., Lin, B., Jiao, J., Chang, X., & Liang, X. (2020). Vision-dialog navigation by exploring cross-modal
memory. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 10727–
10736. https://doi.org/10.1109/CVPR42600.2020.01074

More Related Content

PPTX
Optical Flow with Semantic Segmentation and Localized Layers
PDF
ImageHubExplorerPosterReduced
PPT
Multimedia Mining
PDF
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
PPTX
Image Segmentation Using Deep Learning : A survey
PDF
Introduction talk to Computer Vision
PPTX
Fast Wavelet Based Image Characterization for Highly Adaptive Image Retrieval...
PDF
GAN Report 1 Monthly Report Generative Adversarial Part1
Optical Flow with Semantic Segmentation and Localized Layers
ImageHubExplorerPosterReduced
Multimedia Mining
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
Image Segmentation Using Deep Learning : A survey
Introduction talk to Computer Vision
Fast Wavelet Based Image Characterization for Highly Adaptive Image Retrieval...
GAN Report 1 Monthly Report Generative Adversarial Part1

Similar to Cross-Modal Scene Understanding presntation (20)

PDF
SFSCON24 - Giovanni Giannotta & Orneda Lecini - Approaches to Object Detectio...
PDF
Comp4010 lecture11 VR Applications
PDF
Comp4010 lecture11 VR Applications
PDF
Web Image Retrieval Using Visual Dictionary
PDF
Web Image Retrieval Using Visual Dictionary
PDF
Web Image Retrieval Using Visual Dictionary
PPTX
Remotely sensed image segmentation using multiphase level set acm
PDF
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
PPTX
Object Based Image Analysis
PPTX
Semantic-Aware Sky Replacement (SIGGRAPH 2016)
PDF
Enhanced characterness for text detection in the wild
PDF
Searching Images: Recent research at Southampton
PDF
Unsupervised/Self-supervvised visual object tracking
PPTX
Jan Brus - User study for representing the spatial data uncertainty in land c...
PDF
Graph comprehension model talk, Birkbeck and Toulouse Le Mirail, February 2012
PPTX
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
PDF
Implementation of High Dimension Colour Transform in Domain of Image Processing
PPTX
TechnicalBackgroundOverview
PDF
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
PDF
Techniques for effective and efficient fire detection from social media images
SFSCON24 - Giovanni Giannotta & Orneda Lecini - Approaches to Object Detectio...
Comp4010 lecture11 VR Applications
Comp4010 lecture11 VR Applications
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
Web Image Retrieval Using Visual Dictionary
Remotely sensed image segmentation using multiphase level set acm
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
Object Based Image Analysis
Semantic-Aware Sky Replacement (SIGGRAPH 2016)
Enhanced characterness for text detection in the wild
Searching Images: Recent research at Southampton
Unsupervised/Self-supervvised visual object tracking
Jan Brus - User study for representing the spatial data uncertainty in land c...
Graph comprehension model talk, Birkbeck and Toulouse Le Mirail, February 2012
[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Mul...
Implementation of High Dimension Colour Transform in Domain of Image Processing
TechnicalBackgroundOverview
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Techniques for effective and efficient fire detection from social media images
Ad

More from ZahraWaheed9 (15)

PPT
PHP-04-Forms PHP-04-Forms PHP-04-Forms PHP-04-Forms
PPT
Chapter 5 SE Chapter 5 SE.pptChapter 5 SE.ppt
PPTX
Ch 14_Web Mining.pptxCh 14_Web Mining.pptx
PPTX
Open URL in Chrome Browser from Python.pptx
PPTX
Lecture 5 & 6 Advance CSS.pptx for web
PPT
php introduction to the basic student web
PPTX
ch 3 of C# programming in advanced programming
PPTX
Responsive Web Designing for web development
PPTX
Color Theory for web development class for students to understand good websites
PPT
C# wrokig based topics for students in advanced programming
PPT
CSharp POWERPOINT SLIDES C# VISUAL PROGRAMMING
PPT
visual programming GDI presentation powerpoint
PPT
Visual programming Chapter 3: GUI (Graphical User Interface)
PPTX
Review Presentation on develeopment of automated quality
PPTX
Web Minnig and text mining presentation
PHP-04-Forms PHP-04-Forms PHP-04-Forms PHP-04-Forms
Chapter 5 SE Chapter 5 SE.pptChapter 5 SE.ppt
Ch 14_Web Mining.pptxCh 14_Web Mining.pptx
Open URL in Chrome Browser from Python.pptx
Lecture 5 & 6 Advance CSS.pptx for web
php introduction to the basic student web
ch 3 of C# programming in advanced programming
Responsive Web Designing for web development
Color Theory for web development class for students to understand good websites
C# wrokig based topics for students in advanced programming
CSharp POWERPOINT SLIDES C# VISUAL PROGRAMMING
visual programming GDI presentation powerpoint
Visual programming Chapter 3: GUI (Graphical User Interface)
Review Presentation on develeopment of automated quality
Web Minnig and text mining presentation
Ad

Recently uploaded (20)

PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Types of Token_ From Utility to Security.pdf
PPTX
assetexplorer- product-overview - presentation
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
AI Guide for Business Growth - Arna Softech
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Trending Python Topics for Data Visualization in 2025
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
Website Design Services for Small Businesses.pdf
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
iTop VPN Crack Latest Version Full Key 2025
Weekly report ppt - harsh dattuprasad patel.pptx
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Types of Token_ From Utility to Security.pdf
assetexplorer- product-overview - presentation
Computer Software and OS of computer science of grade 11.pptx
AI Guide for Business Growth - Arna Softech
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Trending Python Topics for Data Visualization in 2025
Patient Appointment Booking in Odoo with online payment
Wondershare Recoverit Full Crack New Version (Latest 2025)
Tech Workshop Escape Room Tech Workshop
CCleaner 6.39.11548 Crack 2025 License Key
MCP Security Tutorial - Beginner to Advanced
Topaz Photo AI Crack New Download (Latest 2025)
Website Design Services for Small Businesses.pdf
Oracle Fusion HCM Cloud Demo for Beginners

Cross-Modal Scene Understanding presntation

  • 1. CROSS-MODAL SCENE UNDERSTANDING: COMBINING VISION AND LANGUAGE FOR ENHANCED SPATIAL AWARENESS NAME / ROLL NUMBER: ZAHRA WAHEED (231022) SUBJECT TEACHER: Dr. Khurram Zeeshan Haider
  • 2. ABSTRACT KEY POINTS: • This research proposes a cross-modal scene understanding framework that combines images and textual descriptions using machine learning. • The framework uses Vision Transformers (ViTs) and Natural Language Processing (NLP) models to extract features from both modalities. • It aims to improve spatial awareness and understanding for applications like robotics, augmented reality, and autonomous navigation. • The framework is evaluated on tasks like scene classification, object detection, and semantic segmentation. • The research highlights the potential of combining vision and language for better spatial perception in AI systems. Keywords: Cross-modal scene understanding, Vision Transformers, natural language processing.
  • 3. PROBLEM STATEMENT Current methods struggle to effectively integrate visual and textual data for comprehensive scene understanding, limiting performance in tasks requiring spatial awareness. This research proposes the MoviNet framework to enhance spatial awareness by combining Vision Transformers and NLP models, demonstrating improved accuracy in scene classification, object detection, and semantic segmentation.
  • 4. INTRODUCTION Motivation: • AI needs profound understanding of the world. • Vision and language hold the key to unlock new dimensions of understanding. • Spatial awareness is crucial for tasks like robotics and augmented reality. • Traditional unimodal approaches have limitations. Vision and Language Nexus: • Humans use both vision and language for comprehensive understanding. • This research aims to replicate this capability in AI. • Cross-modal approach can provide AI with contextual knowledge.
  • 5. INTRODUCTION MoviNet Framework: • Integrates Vision Transformers (ViTs) and Natural Language Processing (NLP) models. • Aims to transcend limitations of unimodal approaches. • Utilizes strengths of both ViTs and NLP models for cross-modal understanding. Research Objectives: • Unified Feature Extraction from visual and textual data. • Evaluation on real-world datasets. • Investigate scene classification capabilities. • Assess object detection capabilities. • Explore semantic segmentation potential.
  • 6. INTRODUCTION Overall Goal: • Demonstrate the effectiveness of MoviNet in improving spatial awareness and understanding. • Pave the way for enhanced spatial awareness in AI systems. Future Work: • Delve into the architecture and design of MoviNet. • Describe the methodology, datasets, and evaluation results. • Emphasize the potential of vision and language to redefine AI's understanding of the world.
  • 7. STATE OF ART 1. Cross-Modal Scene Understanding: • Combines vision and language for better spatial awareness in AI. • Vision Transformers (ViTs) are crucial for this approach. • Multi-modal learning is important for fusing vision and language information. 2. Road Extraction: • Traditionally done with Convolutional Neural Networks (CNNs). • Recent research explores Vision Transformers (ViTs) for road extraction. • Challenges include generating training data for historical maps. 3. Multimodality in Transportation: • Using multiple modes of transport for a single trip. • Encourages sustainable and public transport use. • Multimodal data is valuable for transportation research. 4. Traffic Flow Prediction: • Crucial for traffic modeling, operation, and management. • Machine Learning (ML) and Deep Learning (DL) play a key role. • Supervised learning is a common approach in traffic flow prediction.
  • 8. STATE OF ART Machine Learning Algorithms: • SVM (Support Vector Machine): Used for classification, finds optimal separation between classes in high- dimensional space. • KNN (K-Nearest Neighbors): Classifies based on nearest neighbors in training data. • Logistic Regression: Predicts categorical variables, calculates likelihood of events. • Linear Regression: Models and predicts continuous variables with linear relationships. Road Extraction Method: • Steps: • Image Segmentation: Canny edge detector identifies road edges. • Merging Segments: Full Lambda Schedule merges adjacent road segments. • SVM Classification: Classifies image into road and non-road regions. • Morphological Operations: Refines extracted road features. • Accuracy: Higher for images with clear road-background distinction. • Applications: Transportation planning, land use analysis, urban development.
  • 9. STATE OF ART The flowchart in Fig. 1 provides an overview of the semi-automatic road extraction system, illustrating the sequence of these steps.
  • 10. METHODOLOGY MoviNet Framework: • Combines vision and language for spatial awareness. • Uses ViTs (Vision Transformers) and NLP models. • Aims to improve scene understanding in AI. Datasets: • Diverse datasets for training and evaluation. • Scene classification, object detection, semantic segmentation. Vision Transformer (ViT): • Processes images in patches with positional encoding. • Uses encoder blocks with attention and feed-forward layers. ViTAE Architecture: • Introduces reduction cells and normal cells for scalability and context. • ViTAEv2 is an optimized version with improved performance.
  • 11. METHODOLOGY Figure 1: The architecture of the MoviNet framework illustrating the integration of visual and textual feature extraction through Vision Transformers and NLP models, respectively, followed by a feature fusion module and task-specific output layers.
  • 12. METHODOLOGY Figure 3: Architectural diagram of ViTaE Reduction and normal cells for efficient vision transformer. Figure 2: ViT architecture breakdown: Patch embeddings, positional encoding, and the encoder block with multi-head attention and feed-forward layers.
  • 13. METHODOLOGY 1. Unified Feature Extraction: • Develops a method to jointly extract features from: • Images: Using ViTs to capture spatial relationships and context. • Text: Using NLP models to understand language nuances. • Fuses features at multiple levels for a comprehensive scene representation.
  • 14. METHODOLOGY 2. Real-World Datasets: • Evaluates MoviNet on diverse datasets with: • Varying lighting, weather, and object complexities. • Compares MoviNet's performance to traditional methods on real-world challenges. 3. Scene Classification: • Trains MoviNet on a scene classification dataset. • Evaluates MoviNet's ability to: • Understand scenes using both vision and language. • Categorize scenes accurately (classification accuracy, precision, recall).
  • 15. METHODOLOGY 2. Real-World Datasets: • Evaluates MoviNet on diverse datasets with: • Varying lighting, weather, and object complexities. • Compares MoviNet's performance to traditional methods on real-world challenges. 3. Scene Classification: • Trains MoviNet on a scene classification dataset. • Evaluates MoviNet's ability to: • Understand scenes using both vision and language. • Categorize scenes accurately (classification accuracy, precision, recall).
  • 16. METHODOLOGY 4. Object Detection: • Trains and tests MoviNet on object detection datasets. • Evaluates MoviNet's ability to: • Precisely detect and localize objects within scenes. • Handle object-level spatial awareness (precision, recall). 5. Semantic Segmentation: • Trains and tests MoviNet on pixel-level annotated datasets. • Evaluates MoviNet's ability to: • Understand the spatial distribution of objects in scenes. • Perform accurate segmentation (IoU, pixel accuracy).
  • 17. METHODOLOGY • Datasets Employed several real-world datasets to train and evaluate the MoviNet framework: 1.COCO (Common Objects in Context): This dataset was instrumental for training the object detection and segmentation modules. 2.Visual Genome: Utilized for scene understanding tasks, providing rich annotations of objects and their relationships. 3.Flickr30k: Used for training the integration of visual and textual data, essential for cross-modal scene understanding. 4.ADE20K: Provided diverse scene categories and object instances for training semantic segmentation models.
  • 18. METHODOLOGY Preprocessing and Augmentation: Images were normalized, textual descriptions tokenized, and data augmentation techniques such as rotations, flips, and color adjustments were applied to enhance the training process Image Normalization: • Method: Each pixel value was scaled to a range of 0 to 1. • Example: An image with pixel values ranging from 0 to 255 was normalized by dividing each pixel value by 255. Tokenization of Textual Descriptions: • Method: Text descriptions were tokenized into words or subwords using techniques like WordPiece or Byte Pair Encoding. • Example: The sentence "A red ball on the grass" was tokenized to ['A', 'red', 'ball', 'on', 'the', 'grass'].
  • 19. METHODOLOGY Data Augmentation: • Rotations: • Method: Images were rotated by random angles. • Example: An image of a cat was rotated by 15 degrees. • Flips: • Method: Images were randomly flipped horizontally or vertically. • Example: An image of a car was flipped horizontally to create a mirror image. • Color Adjustments: • Method: Adjustments to brightness, contrast, saturation, and hue were applied to images. • Example: An image of a sunset had its brightness increased by 20% and contrast adjusted by 15%. These preprocessing steps and augmentations aimed to increase the variability of the training data, enhancing the model's ability to generalize and perform well on unseen data.
  • 20. RESULTS Unified Feature Extraction Succeeds: • MoviNet effectively combines visual (ViTs) and textual (NLP) features. • This creates a richer understanding of scenes, including both visual and semantic information. Robustness on Real-World Data: • MoviNet performs well on diverse datasets with varying conditions. • It outperforms traditional methods in capturing complex scene details.
  • 21. RESULTS Superior Scene Classification: • MoviNet excels at classifying scenes using both vision and language. • Its accuracy, precision, and recall outperform unimodal approaches. Accurate Object Detection: • MoviNet effectively detects and localizes objects in scenes. • It leverages combined features for precise object identification, exceeding unimodal methods.
  • 22. RESULTS Accurate Object Detection: • MoviNet effectively detects and localizes objects in scenes. • It leverages combined features for precise object identification, exceeding unimodal methods. Strong Semantic Segmentation: • MoviNet demonstrates success in understanding object distribution within scenes. • Its detailed segmentation maps outperform unimodal methods in accuracy (IoU and pixel accuracy). • Overall, MoviNet's strong performance across various tasks highlights its effectiveness in cross-modal scene understanding. By combining visual and textual information, MoviNet achieves superior results compared to traditional unimodal methods. This paves the way for advancements in AI's ability to understand and navigate the complexities of the real world.
  • 23. RESULTS Confusion Matrix The confusion matrix shows the number of correct and incorrect predictions made by the logistic regression model on the test set.
  • 24. RESULTS F1 Score • The F1 score for the model is 0.88, indicating a good balance between precision and recall. Accuracy • The model achieved an accuracy of 0.90 on the test set.
  • 25. RESULTS Model Accuracy F1 Score AUC Logistic Regression 0.90 0.88 0.92 Decision Tree 0.85 0.83 0.87 Random Forest 0.92 0.90 0.94 SVM 0.88 0.86 0.89 Comparison Table
  • 26. DISCUSSION 1. Cross-Modal Advantage: • Processing both vision and text together (cross-modal) significantly improves spatial awareness compared to traditional methods. 2. Real-World Applicability: • MoviNet's performance on diverse datasets suggests its potential for real-world applications like robotics and autonomous navigation due to its ability to handle complex environments. 3. Practical Implications: • MoviNet's success paves the way for advancements in AI systems requiring spatial understanding, such as autonomous vehicles, human-robot interaction, and augmented reality.
  • 27. DISCUSSION 4. Outperforming Traditional Methods: • MoviNet consistently outperforms traditional unimodal methods across various tasks, highlighting the benefits of its cross-modal approach for capturing context and achieving superior spatial awareness.
  • 28. CONCLUSION The results and discussions substantiate the effectiveness of the MoviNet framework in achieving enhanced spatial awareness through cross-modal scene understanding. The successful integration of visual and textual information, demonstrated across various tasks, opens avenues for advancing AI systems' spatial perception capabilities. The next section will delve into the practical implications, limitations, and avenues for future research to provide a comprehensive perspective on the study's contributions.
  • 29. REFERENCES • Arkin, E., et al. (2021). A survey of object detection based on CNN and transformer. 2021 IEEE 2nd international conference on pattern recognition and machine learning (PRML), IEEE. • Bakhtiari, H. R. R., et al. (2017). "Semi automatic road extraction from digital images." The Egyptian Journal of Remote Sensing and Space Science 20(1): 117-123. • Jiao, C., et al. (2022). "A fast and effective deep learning approach for road extraction from historical maps by automatically generating training data with symbol reconstruction." International Journal of Applied Earth Observation and Geoinformation 113: 102980. • Lemonde, C., et al. (2021). "Integrative analysis of multimodal traffic data: addressing open challenges using big data analytics in the city of Lisbon." European transport research review 13: 1-22. • Lin, M. and W.-J. Hsu (2014). "Mining GPS data for mobility patterns: A survey." Pervasive and mobile computing 12: 1-16. • Nam, W. and B. Jang (2023). "A survey on multimodal bidirectional machine learning translation of image and natural language processing." Expert Systems with Applications: 121168. • Nellore, K. and G. P. Hancke (2016). "A survey on urban traffic management system using wireless sensor networks." Sensors 16(2): 157.
  • 30. REFERENCES • Sayed, S. A., et al. (2023). "Artificial intelligence-based traffic flow prediction: a comprehensive review." Journal of Electrical Systems and Information Technology 10(1): 13. • Seymour, Z., et al. (2021). Maast: Map attention with semantic transformers for efficient visual navigation. 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE. • Wang, Y., et al. (2022). Multimodal token fusion for vision transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. • Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS - Improving Object Detection with One Line of Code. Proceedings of the IEEE International Conference on Computer Vision, 2017-October, 5562–5570. https://doi.org/10.1109/ICCV.2017.593 • Huo, J., Sun, Q., Jiang, B., Lin, H., & Fu, Y. (n.d.). GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation. • Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., & Li, G. (2022). Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4761–4775. https://doi.org/10.1109/TPAMI.2021.3079993 • Li, Y., Zhu, Z., Yu, J. G., & Zhang, Y. (2021). Learning Deep Cross-Modal Embedding Networks for Zero-Shot Remote Sensing Image Scene Classification. IEEE Transactions on Geoscience and Remote Sensing, 59(12), 10590–10603. https://doi.org/10.1109/TGRS.2020.3047447 • Luna-Jiménez, C., Cristóbal-Martín, J., Kleinlein, R., Gil-Martín, M., Moya, J. M., & Fernández-Martínez, F. (2021). Guided spatial transformers for facial expression recognition. Applied Sciences (Switzerland), 11(16). https://doi.org/10.3390/app11167217 • Schrier, Karen., Swain, C., Wagner, M. (Michael G. ), & SIGGRAPH. (2008). Proceedings, Sandbox Symposium 2008 : 3rd ACM SIGGRAPH videogame symposium, Los Angeles, California, August 9-10, 2008. Association for Computing Machinery.
  • 31. REFERENCES • Tang, X., Wang, Y., Ma, J., Zhang, X., Liu, F., & Jiao, L. (2023). Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, 61. https://doi.org/10.1109/TGRS.2023.3280546 • Wang, S., Wang, R., Yao, Z., Shan, S., & Chen, X. (n.d.). Cross-modal Scene Graph Matching for Relationship-aware Image- Text Retrieval. • Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., Wang, W. Y., & Zhang, L. (2019). Reinforced cross- modal matching and self-supervised imitation learning for vision-language navigation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June, 6622–6631. https://doi.org/10.1109/CVPR.2019.00679 • Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., & Yan, S. (2017). Cross-Modal Retrieval With CNN Visual Features: A New Baseline. IEEE Transactions on Cybernetics, 47(2), 449–460. https://doi.org/10.1109/TCYB.2016.2519449 • Wu, X., Lau, K., Ferroni, F., Ošep, A., & Ramanan, D. (n.d.). Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images. • Zhang, H., Mao, Z., Zhang, K., & Zhang, Y. (2022). Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching. www.aaai.org • Zhu, Y., Zhu, F., Zhan, Z., Lin, B., Jiao, J., Chang, X., & Liang, X. (2020). Vision-dialog navigation by exploring cross-modal memory. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 10727– 10736. https://doi.org/10.1109/CVPR42600.2020.01074