[DSC Europe 24] Aleksandar Cvejic - Rivian Autonomy and AI: Building a Scalable Platform for Data and ML Workloads

Rivian Internal
Rivian Autonomy and AI
Building a Scalable Platform for Data and ML Workloads

Rivian Internal
Aleksandar Cvejic
Data Applications and ML Infrastructure
Autonomy and AI, Rivian

Rivian Internal
Agenda
- ADAS challenges
- ADAS at Rivian
- Polaris - Data Platform
- Polaris Apps
- Simulations
- ML workloads

Rivian Internal
ADAS - Trends and Challenges
- Data volume and variety
- Processing and orchestration
- Cost/time to market

Rivian Internal
ADAS Data Variety
- 11 Cameras
55 MP of real-time imaging
State-of-the-art resolution
and dynamic range
- 5 Radars
Multi-modal sensing supported
by 5 advanced radars
300 meters of forward-facing
detection range
- 360° Sensing
Overlapping sensors provide
redundancy, robustness,
and performance

Rivian Internal
ADAS Data Volumes
- RPD (Records Per Day)
6b -> 40b -> 2t
- Total Volume
14TB -> 50TB -> 1.6PB
- SLA (data availability)
150h -> 38h -> 4h

Rivian Internal
ADAS Data Loop
Uploaded & Anonymized Labelling in Cloud
New Software OTA Model Training

Rivian Internal
ADAS - Data pipeline
Loggers Ingestion
Metadata
Tagging
Curation Labeling
Data Set
Management
Model
Training
Deployment
Data
Collection
OTA
Updates

Rivian Internal
POLARIS - Autonomy Data Platform
Principles:
- Ease of use and Reusability
- Data Management, Governance and Security
- Data Operations, Automation and Collaboration Platform
- Data Visualization and Analytics
- Support Simulations and ML Workloads
Apps:
- Data Discovery
- Metadata Management Framework
- Labeling, Visualization
- Simdash, Cosmo and Rivx

Rivian Internal
ADAS Architecture for ML workload
Model Training
S3
EKS KubeRay mlflow
Simulation and Validation
S3
AWS Batch Databrick
s
Gitlab
Data ingestion and processing
SQS Argo
Workflows
AWS Batch S3

Rivian Internal
Simulations Architecture
Bottlenecks:
- TPS
- Startup time
- Large simulations
Optimizations
- Bundling jobs
- Caching docker images
- Fair-share policies
- shareIdentifier
- weightFactor
- shareDecay
- Compute Capacity reservation
Simdash
Databricks
Amazon S3
AWS Batch
Streamlit
PolarisViz
Simulation and validation
Gitlab CI
Ondemand
EventBridge
Scheduled

Rivian Internal
ML Infrastructure
AWS CloudWatch AWS Managed Grafana AWS Managed
Prometheus
AWS EC2 Amazon S3
EFA Fluentd EBS Prometheus EFS GPU operator S3 Mountpoint
Plugins
Job Management
Kueue Rayjob PyTorch
Rayjobs Experiment Tracking
Rivx CLI
Polaris
Cosmo

Rivian Internal
ML Infrastructure Framework - rivx

Rivian Internal
ML Infrastructure Framework

Rivian Internal
ML Infrastructure Framework - Cosmo

Rivian Internal
What’s next?
- Stability
- Optimization
- Capabilities
300x
115x
40x

[DSC Europe 24] Aleksandar Cvejic - Rivian Autonomy and AI: Building a Scalable Platform for Data and ML Workloads

More Related Content

Similar to [DSC Europe 24] Aleksandar Cvejic - Rivian Autonomy and AI: Building a Scalable Platform for Data and ML Workloads (20)

More from DataScienceConferenc1 (20)

Recently uploaded (20)

[DSC Europe 24] Aleksandar Cvejic - Rivian Autonomy and AI: Building a Scalable Platform for Data and ML Workloads

Editor's Notes