Would you build a Self Driving Car ?

Nitin Bhatnagar

AI Product Leader | Entrepreneur | Veteran

Published Jul 31, 2025

Well, I had no choice as this was one of the most interesting Deep Learning Project as part of my DL course (MS-AI) at UT Austin, let us walk you through.

Imagine teaching a computer to drive a car using three completely different approaches to artificial intelligence. That's exactly what I recently accomplished, and the results were fascinating. This isn't just about autonomous vehicles - it's about understanding how different AI architectures solve the same problem in fundamentally different ways.

The challenge was to predict the optimal path for a vehicle by analyzing road boundaries and generating waypoints (future positions). Think of it as teaching AI to "see" the road and predict where the car should go next. We used the SuperTuxKart Drive Dataset, which contains 8,000 training samples from 16 different driving episodes and 2,000 validation samples from 4 episodes. This dataset includes multiple data types: RGB images, track boundary coordinates, and waypoint labels, making it perfect for testing different AI approaches.

Understanding the Dataset: SuperTuxKart Drive Dataset

Before diving into the architectures, let's understand the rich dataset that powers our autonomous driving models. The SuperTuxKart Drive Dataset is a comprehensive collection of driving scenarios that mimics real-world autonomous driving challenges.

Figure 1: Comprehensive dataset visualization showing different road scenarios, data distribution, and error patterns across models.

Dataset Composition and Structure

The dataset contains three distinct driving scenarios that test different aspects of autonomous navigation:

Straight Road Scenarios: These scenarios test the model's ability to maintain lane position and predict forward movement. The road boundaries are parallel, and the optimal path is a straight line down the center. This represents the most basic driving scenario but is crucial for highway driving and lane-keeping applications.

Curved Road Scenarios: These scenarios introduce lateral movement prediction challenges. The road boundaries curve, requiring the model to understand spatial relationships and predict appropriate steering angles. This represents typical urban driving conditions where roads aren't always straight.

Intersection Scenarios: These are the most complex scenarios, testing the model's ability to navigate through road intersections. The model must understand when to continue straight, when to turn, and how to maintain proper positioning through complex road geometries.

Data Types and Their Significance

Each sample in our dataset contains three types of information, each serving a specific purpose in autonomous driving:

RGB Images (H×W×3): These are visual representations of the road scene, similar to what a human driver sees through the windshield. The images contain rich visual information about road boundaries, lane markings, obstacles, and environmental context. For our CNN model, these images are the primary input, allowing the model to learn visual patterns and spatial relationships.

Figure 2: Sample road images from the SuperTuxKart Drive Dataset showing different driving scenarios including straight roads, curved roads, intersections, highways, urban roads, and country roads. Each image demonstrates the visual complexity that our CNN model must process.

Track Boundary Coordinates: These are precise numerical representations of the road's left and right boundaries. Each boundary is represented as a series of (x, y) coordinate pairs, typically 10 points per boundary. This structured data is perfect for our MLP model, which can learn patterns in the numerical representation of the road geometry.

Waypoint Labels: These are the ground truth values that our models must predict. Each sample contains three waypoints ahead of the current vehicle position, each with (x, y) coordinates. These waypoints represent the optimal path the vehicle should follow, similar to how a GPS navigation system provides turn-by-turn directions.

Dataset Statistics and Distribution

The dataset is carefully balanced to ensure robust model training:

8,000 training samples from 16 different driving episodes provide sufficient data for learning complex patterns
2,000 validation samples from 4 episodes ensure reliable model evaluation
Multiple data types per sample ensure comprehensive learning from different perspectives
Diverse road geometries test model generalization across different driving conditions

This rich dataset structure allows us to compare how different AI architectures process the same information in fundamentally different ways, leading to insights about which approach is best suited for different aspects of autonomous driving.

Understanding the Problem: Waypoint Prediction

Before diving into the architectures, let's understand what we're trying to solve. In autonomous driving, a waypoint is a future position where the vehicle should be. Our models need to predict three waypoints ahead of the current position, each with x and y coordinates.

The input data comes in two forms: 1. Track boundaries: Left and right road edges as coordinate pairs 2. RGB images: Visual representation of the road scene

Our models must output six values: three waypoints × two coordinates (x, y) each.

The Three Approaches

Approach 1: Multi-Layer Perceptron (MLP) - The Pattern Recognition Expert

The Multi-Layer Perceptron is the foundation of neural networks. Think of it as the brain's basic neural network - simple, powerful, and surprisingly effective for structured data.

Our MLP architecture takes the track boundaries as input. We flatten the left and right track coordinates into a 40-dimensional vector (20 points × 2 coordinates for each side). This vector flows through three fully connected layers with ReLU activation functions and dropout for regularization.

The beauty of the MLP approach lies in its simplicity. It doesn't try to understand spatial relationships or visual patterns - it simply learns to recognize patterns in the numerical representation of the road boundaries. This is similar to how a human driver might learn to associate certain road shapes with specific driving actions.

Our MLP achieved remarkable performance with a lateral error of 0.425 and longitudinal error of 0.134. The lateral error measures how accurately the model predicts left-right movement (steering), while longitudinal error measures forward-backward movement (speed control). Both errors are measured in the same units as the road coordinates.

Figure 3: Performance comparison of all three models. The MLP shows the best overall performance, achieving the lowest lateral error and excellent longitudinal error.

Approach 2: Convolutional Neural Network (CNN) - The Visual Intelligence

While the MLP processes numerical coordinates, the CNN processes actual images of the road. This is like giving the AI "eyes" to see the world as humans do.

Our CNN architecture is more sophisticated, featuring residual connections and multiple convolutional layers. The input is a 3-channel RGB image that gets processed through several convolutional blocks, each followed by batch normalization and ReLU activation.

The CNN's strength lies in its ability to learn hierarchical features. The first layers learn simple features like edges and textures, while deeper layers learn complex patterns like road boundaries, curves, and obstacles. This hierarchical learning is similar to how the human visual cortex processes information.

Figure 4: Visual comparison of the three architectures. Each approach processes information differently: MLP uses flattened coordinates, CNN processes visual features hierarchically, and Transformer uses attention mechanisms.

We implemented several advanced techniques to improve CNN performance:

Residual Connections: Skip connections that help with gradient flow in deep networks
Batch Normalization: Stabilizes training and improves convergence
Dropout Regularization: Prevents overfitting with different rates (0.3, 0.3, 0.2)
Weighted Loss Functions: We emphasized longitudinal error with a 3:1 weight ratio

The CNN achieved a lateral error of 0.438 and longitudinal error of 0.224. While the lateral error was slightly higher than the MLP, the longitudinal error showed significant improvement, demonstrating the CNN's ability to better understand forward-backward movement patterns.

Approach 3: Transformer - The Attention Mechanism Master

The Transformer represents the cutting edge of neural network architecture. Unlike CNNs that process spatial information or MLPs that process flat vectors, Transformers use attention mechanisms to focus on relevant parts of the input.

Our Transformer implementation uses cross-attention, where learnable query embeddings for each waypoint attend to all track boundary points. This is like having a driver who can focus attention on multiple road elements simultaneously.

The Transformer's key innovation is the attention mechanism. Each waypoint query can "attend" to any track boundary point, learning which parts of the road are most relevant for predicting that specific waypoint. This is fundamentally different from CNNs, which process information locally, or MLPs, which process all information equally.

Figure 5: Cross-attention visualization showing how each waypoint query (rows) attends to different track boundary points (columns). Darker colors indicate stronger attention weights, revealing which road features are most important for predicting each waypoint.

Our Transformer achieved a lateral error of 0.575 and longitudinal error of 0.141. While the lateral error was higher than the other approaches, the longitudinal error was the best among all three models, demonstrating the Transformer's ability to understand temporal relationships in the driving task.

Understanding Error Metrics

To properly evaluate our models, we need to understand what lateral and longitudinal errors mean in the context of autonomous driving. These metrics are fundamental to autonomous driving safety and performance.

Figure 11: Detailed explanation of lateral and longitudinal errors with visual examples, model comparisons, and real-world impact analysis. This visualization shows how these errors affect vehicle behavior and safety.

Lateral Error: Steering Accuracy

Lateral Error measures how accurately the model predicts left-right movement (steering). This is the horizontal deviation from the optimal path. A lateral error of 0.425 means the model's predicted waypoints deviate from the ground truth by an average of 0.425 units in the lateral direction.

Real-world Impact: High lateral error can cause the vehicle to drift into other lanes, potentially causing accidents. In our visualization, you can see how lateral errors manifest as horizontal deviations from the ideal path. The MLP achieved the lowest lateral error (0.425), followed by CNN (0.438), and Transformer (0.575).

Why it matters: Lateral accuracy is crucial for lane-keeping, curve navigation, and avoiding obstacles. Even small lateral errors can compound over time, leading to dangerous situations.

Longitudinal Error: Speed Control

Longitudinal Error measures how accurately the model predicts forward-backward movement (speed control). This is the vertical deviation from the optimal path. A longitudinal error of 0.134 means the model's predicted waypoints deviate from the ground truth by an average of 0.134 units in the forward direction.

Real-world Impact: High longitudinal error can cause the vehicle to brake or accelerate inappropriately, leading to uncomfortable rides and potential safety issues. The Transformer achieved the best longitudinal error (0.141), followed by MLP (0.134), and CNN (0.224).

Why it matters: Longitudinal accuracy affects speed control, following distance, and overall driving smoothness. Poor longitudinal prediction can lead to jerky acceleration or unsafe following distances.

Error Trade-offs and Model Characteristics

Our analysis reveals interesting trade-offs between the three models:

MLP: Achieves the best overall performance with balanced lateral (0.425) and longitudinal (0.134) errors. This suggests that simple, structured data processing can be highly effective for waypoint prediction.

CNN: Shows moderate lateral error (0.438) but higher longitudinal error (0.224). This indicates that while visual processing helps with steering accuracy, it may struggle with speed prediction without additional temporal context.

Transformer: Achieves the best longitudinal error (0.141) but highest lateral error (0.575). This suggests that attention mechanisms excel at understanding temporal relationships (speed control) but may need more training data for spatial relationships (steering).

These error patterns are crucial for autonomous driving because: - High lateral error could cause the vehicle to drift into other lanes - High longitudinal error could cause the vehicle to brake or accelerate inappropriately - Balanced errors are essential for smooth, safe autonomous driving

Figure 6: Training progress for all three models. The MLP shows stable convergence, the CNN demonstrates more volatile training with our custom loss function, and the Transformer shows gradual improvement in longitudinal error.

Technical Challenges and Solutions

Model Size Constraints

One of the most interesting challenges we faced was the 20MB model size limit. Our initial Transformer model exceeded this limit at 50.26MB. This forced us to think creatively about model architecture.

We solved this by reducing the model complexity: - Reduced d_model from 384 to 128 - Reduced nhead from 12 to 4 - Reduced num_layers from 6 to 3

This constraint actually led to a better, more efficient design. The smaller model not only fit within the size limit but also trained faster and was less prone to overfitting.

Figure 7: Advanced heatmap analysis showing model performance comparison, training convergence patterns, feature importance distribution, and error correlation matrices. These visualizations reveal the complex relationships between different model characteristics and performance metrics.

Training Optimization

We experimented with different optimization strategies for each architecture:

For MLP: Used Adam optimizer with learning rate 1e-3 and CosineAnnealingLR scheduler. The MLP benefited from a higher learning rate due to its simpler architecture.

For CNN: Used AdamW optimizer with learning rate 5e-5, weight decay 1e-4, and OneCycleLR scheduler. The CNN required more careful optimization due to its complexity.

For Transformer: Used Adam optimizer with learning rate 1e-4 and CosineAnnealingLR scheduler. The Transformer's attention mechanisms required stable training.

Custom Loss Functions

One of our key innovations was the weighted loss function for the CNN:

This custom loss function emphasized longitudinal error (3x weight) because we found that CNNs struggled more with forward-backward prediction than left-right prediction. This insight led to a 33% improvement in longitudinal error.

Figure 8: Prediction uncertainty visualization using fan charts. Each model shows different uncertainty patterns: MLP has low, consistent uncertainty; CNN shows moderate uncertainty with visual processing; Transformer exhibits higher uncertainty but better longitudinal prediction.

Real-World Applications and Industry Context

Current Industry Applications

The automotive industry is actively using these same architectures:

Tesla Autopilot: Primarily uses CNN-based systems for visual processing. Their "HydraNet" architecture processes multiple camera feeds simultaneously, similar to our CNN approach.

Waymo: Combines multiple architectures for robust perception. They use CNNs for visual processing, Transformers for understanding spatial relationships, and traditional MLPs for decision-making.

Mobileye: Specializes in computer vision for Advanced Driver Assistance Systems (ADAS). Their EyeQ chips process visual data using CNN-like architectures.

Future Possibilities

The field of autonomous driving is rapidly evolving, and our work demonstrates several important trends:

Edge Computing: Deploying these models on vehicle hardware for real-time inference
Multi-Modal Fusion: Combining camera, LiDAR, and radar data using attention mechanisms
Real-time Adaptation: Models that learn from driver behavior and adapt to different driving styles
Safety Systems: Redundancy through multiple AI approaches, similar to our three-model approach

Key Insights and Takeaways

The Power of Multiple Approaches

Our most important finding was that different architectures excel at different aspects of the driving task. The MLP was best at overall performance, the CNN excelled at visual understanding, and the Transformer showed promise for understanding complex spatial relationships.

This suggests that real-world autonomous driving systems should combine multiple approaches rather than relying on a single architecture.

Figure 9: Radar chart comparing model characteristics across multiple dimensions. The MLP excels in training speed and model size, CNN shows balanced performance, while Transformer demonstrates strengths in interpretability and robustness despite larger model size.

The Importance of Constraints

The 20MB model size constraint forced us to think creatively about efficiency. This is a common challenge in real-world applications where computational resources are limited. Our experience shows that constraints often lead to better, more practical solutions.

The Value of Custom Loss Functions

Our weighted loss function for the CNN demonstrates the importance of understanding your specific problem domain. By emphasizing longitudinal error, we achieved significant performance improvements. This kind of domain-specific optimization is crucial for real-world AI applications.

Figure 10: Sophisticated attention flow diagram showing how the Transformer's cross-attention mechanism processes track boundary points and waypoint queries. This visualization demonstrates the complex information flow that enables the Transformer to understand spatial relationships in driving scenarios.

How You Can Apply These Lessons

In Your Projects

Start Simple: Begin with MLPs before jumping to complex architectures. Simple solutions often outperform complex ones.
Experiment with Loss Functions: Custom loss functions can dramatically improve results for specific metrics.
Consider Multiple Approaches: Different problems need different solutions. Don't put all your eggs in one AI basket.
Focus on Data Quality: Good data beats fancy algorithms. The SuperTuxKart dataset was crucial to our success.

In Your Career

Build Portfolio Projects: Real implementations speak louder than theory. Our three-model approach demonstrates practical AI skills.
Understand Trade-offs: Every architecture has strengths and weaknesses. Being able to choose the right tool for the job is a valuable skill.
Stay Current: Follow the latest developments in transformer variants and attention mechanisms. The field evolves rapidly.

Conclusion

The beauty of AI is about understanding which tool is right for which job. Whether you're building autonomous vehicles, recommendation systems, or natural language processors, the principles remain the same: start simple, iterate quickly, and always keep the real-world application in mind.

Our journey through three different neural network architectures demonstrates that there's no one-size-fits-all solution in AI. Each approach has its strengths, and the key is understanding when to use each one. The MLP taught us that simple solutions can be surprisingly effective. The CNN showed us how visual processing can enhance understanding. The Transformer introduced us to the power of attention mechanisms.

As we move toward a future of autonomous vehicles, these lessons become increasingly important. The cars of tomorrow won't rely on a single AI approach - they'll combine multiple architectures, each optimized for specific aspects of the driving task. Our work provides a glimpse into that future, where different AI approaches work together to create safer, more intelligent transportation systems.

The field of autonomous driving is still in its infancy, but the foundations we're building today will shape the transportation systems of tomorrow. Whether you're an AI engineer, a product manager, or simply someone interested in the future of technology, understanding these different approaches to artificial intelligence is crucial for navigating the road ahead.

Ananya Nandi

Product Manager | MSIM'25 @University of Washington | WE24 Scholar | SWE CLI | Software Engineer

Sometimes the simplest approach can be the most productive with great results!

1 Reaction

Harsha Srivatsa

AI Product Builder @ NanoKernel | Generative AI, AI Agents, AIoT, Responsible AI, AI Product Management | Ex-Apple, Accenture, Cognizant, Verizon, AT&T | I help companies build standout Next-Gen AI Solutions

Awesome thought process

Seshadri Vyas M

Product Builder | Growth, AI, Fintech | Curious

Nice! Loved it.

See more comments

Would you build a Self Driving Car ?

Nitin Bhatnagar

AI Product Leader | Entrepreneur | Veteran

Well, I had no choice as this was one of the most interesting Deep Learning Project as part of my DL course (MS-AI) at UT Austin, let us walk you through.