cityscapes Semantic Segmentation using FCN, U Net and U Net++.pptx
1. Cityscapes Semantic
Segmentation using FCN, U-
Net, and U-Net++
The following slides explore how deep learning architectures such
as FCN, U-Net, and U-Net++ can be applied to pixel-wise image
segmentation on the Cityscapes dataset, enabling a detailed
understanding of urban scenes.
Ashpak Shaikh (33563)
2. Introduction
Semantic segmentation is a computer vision technique where each pixel in an image is classified into a particular
class. It's crucial in applications like:
Autonomous Driving: Understanding roads, lanes, pedestrians.
Medical Imaging: Detecting tumors or organs pixel-wise.
Agriculture and Satellite Imaging: Land use mapping.
In this project, we aim to compare FCN, U-Net, and U-Net++ on the Cityscapes dataset using TensorFlow 2.18 and
TPU v2, focusing on performance, generalization, and accuracy in complex urban scenes.
3. Abstract
This project focuses on semantic segmentation of urban street scenes using three powerful deep learning
architectures: FCN, U-Net, and U-Net++.
Implemented on the Cityscapes dataset featuring high-resolution images of street-level scenes across European cities.
Each model was trained on Google TPU v2 using TensorFlow 2.18, with custom training loops, loss functions, and
callbacks.
Evaluation was performed using multiple metrics, including IoU, Dice Coefficient, and Pixel Accuracy.
The study helps identify the strengths and trade-offs of each architecture in real-world segmentation tasks.toring.
4. Problem Statement
Semantic segmentation is vital for safe and reliable decision-making in vision-based AI systems like autonomous vehicles.
Challenges Addressed:
Processing large-scale, high-resolution images in real-time.
Ensuring high accuracy at pixel level for all classes.
Handling imbalanced classes and fine-grained structures (like poles, pedestrians).
Efficient model training using Google TPUs and custom pipelines.
Balancing training efficiency with model performance across architectures.
5. Project Overview
This project implements and compares three deep learning segmentation models on the Cityscapes dataset. The key components include:
• Data Preprocessing:
• Resizing, normalizing, and label encoding images (512x512).
• Model Architectures:
• FCN, U-Net, and U-Net++ with encoder-decoder design.
• Custom Training Setup:
• Mixed precision, distributed strategy (TPU), and gradient accumulation.
• Loss Functions:
• SemanticSegmentationLoss and DeepSupervisionLoss.
• Metrics:
• IoU, Dice Coefficient, Per-Class Metrics, Pixel Accuracy.
• Callbacks:
• Advanced logging, checkpointing, and learning rate scheduling.
• Codebase:
• Implemented from scratch with modularity and experimentation in mind.
6. Evaluation Metrics
To fairly compare the models, we use the following evaluation metrics:
1. IoU (Intersection over Union):
• Measures overlap between predicted and ground truth masks.
• Higher IoU = better segmentation accuracy.
2. Per-Class IoU:
• Calculates IoU score for each class (e.g., road, pedestrian, car).
• Useful for spotting class imbalance and model bias.
3. Dice Coefficient:
• Measures similarity between two sets.
• Especially useful for overlapping and fine boundaries.
4. Per-Class Dice Score:
• Helps analyze performance on small or rare classes.
5. Pixel Accuracy:
• Fraction of correctly classified pixels.
• Simpler metric but less informative on imbalanced datasets.
7. FCN Architecture
FCN (Fully Convolutional Network) is one of the earliest deep learning models for semantic segmentation.
🔧 How it Works:
• Replaces fully connected layers with convolutional ones.
• Upsamples using transpose convolutions or bilinear interpolation.
• Adds skip connections to recover spatial details.
✅ Advantages:
• Lightweight and easy to implement.
• Faster inference speed.
• Works well for coarse segmentation.
❌ Limitations:
• Loses fine spatial information.
• Less effective for complex or small objects.
• Lower accuracy on fine-grained classes (e.g., poles, signs).
8. U-Net Architecture
🔍 What is U-Net?
U-Net is a popular encoder-decoder model known for its symmetric “U” shape.
🔧 How it Works:
• Encoder captures context via downsampling.
• Decoder restores resolution via upsampling.
• Skip connections bridge encoder and decoder layers to recover detail.
✅ Advantages:
• Great for medical and low-data domains.
• High localization accuracy.
• Efficient use of features through skip connections.
❌ Limitations:
• High memory and computational requirements.
• Sensitive to overfitting without proper regularization.
9. U-Net++ Architecture
U-Net++ builds on U-Net by introducing nested and dense skip connections and deep supervision.
🔧 What’s New:
• Redesigns skip pathways to reduce semantic gap between encoder and decoder.
• Allows multiple levels of intermediate predictions.
• Promotes better feature fusion.
✅ Advantages:
• Improved generalization and fine segmentation.
• Reduces overfitting and vanishing gradients.
• Performs best on complex datasets like Cityscapes.
10. Working of the Project – Step-by-Step Pipeline
Data Preparation
• Resized Cityscapes images to 512x512.
• Split into training, validation, and testing sets.
• Normalized and one-hot encoded masks.
Model Definition
• FCN (4/8/16), U-Net, U-Net++ architectures implemented in TensorFlow.
• Used modular design for easy comparison.
Custom Loss & Metrics
Combined SemanticSegmentationLoss and DeepSupervisionLoss.
• Integrated metrics like IoU, Dice, Pixel Accuracy.
TPU Training Setup
• Google TPU v2 with mixed precision.
• Strategy for batch splitting across 8 cores.
Training Phase
• Used callbacks: EarlyStopping, LR Scheduler, Checkpoints, Master Logger.
• Tuned batch sizes and learning rates per model.
• Evaluation & Visualization
• Generated segmentation masks.
• Plotted side-by-side comparisons with ground truth.
12. Final Highlights & Takeaways
• Key Features
✅ End-to-End Segmentation Pipeline: From data preprocessing to model evaluation.
⚡ TPU-Accelerated Training: Leveraged Google Cloud TPU v2 for faster, distributed learning.
📊 Rich Evaluation Metrics: Used IoU, Dice, and Pixel Accuracy (overall and per-class).
🧠 Model-Agnostic Framework: Easily switch between FCN, U-Net, and U-Net++.
🧪 Custom Loss Functions & Callbacks: Tailored training with Deep Supervision and adaptive scheduling.
️
🖼️Visual Interpretation Tools: Predicted masks vs. ground truth comparisons for validation.
🌐 Cityscapes Dataset: Real-world, high-resolution urban scenes for robust model testing.
🧾 Conclusion
🏆 U-Net++ emerged as the best performer in accuracy and generalization due to its nested skip connections and deep supervision.
✅ All three models succeeded in segmenting complex urban scenes with high fidelity.
️
⏱️Training with TPU + mixed precision greatly improved efficiency.
🔄 Custom training loop and modular codebase allow easy experimentation and tuning.GitHub Repository