SlideShare a Scribd company logo
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Spark to develop AI-
enabled products and services at
Bosch
Agenda
Manufacturing Analytics
Solution
Prasanth Lade
Financial Forecasting
Goktug Cinar
Robert Bosch – a worldwide leading IoT Company
268
Manufacturing
sites
1000s
Assembly lines
409,881
Associates
60 Countries
460 Local
subsidiaries
Four business sectors
Mobility Solutions Industrial
Technology
Energy & Building
Technology
Consumer Goods
Sunnyvale
Pittsburgh
Renningen
Tubingen
Haifa
Bangalore
Shanghai
Bosch Center for Artificial Intelligence
Manufacturing Analytics Solution
Manufacturing Analytics using Spark
Self-Serve Analytics Pipeline
• Automate data pipelining and preparation
• Centralize data storage across assembly
lines and plants
• Scalable compute and storage resources
• Standard analytics dashboards
• Self-service analysis
• Advanced analytics tools like Root
cause analysis
Data Preparation Root Cause Analysis
Apache Impala
Tableau Extracts
Hadoop File System
Bosch
Manufacturing
Plants
Kafka
Tableau Server
Manufacturing Analytics using Spark
Why are parts failing quality checks?
Process 1
Process 2
Process 5Process 4Process 3
Potential root causes
• Measured process
parameters
• Machine
configurations
• Tools and
components used
• Locations visited
Target of interest
Identify quality test
failures for certain parts.
Manufacturing Analytics using Spark
Root Cause Analysis: Modules
Part graph
generation
Feature
extraction
Feature matrix
generation
Root cause
modeling
Assembly process of
every unique part is
represented as a graph.
Features are extracted
from the part graph.
Target variables are
mapped to features.
Statistical models are
applied to extract
potential root causes.
Parameters
Tests
Tools etc.
Parameters
Tests
Tools etc.
Parameters
Tests
Tools etc.
Parameters
Tests
Tools etc.
Manufacturing Analytics using Spark
Root Cause Analysis: Sample code
PART_ID PART_GRAPH
B6788098
FF556828
A6678B34
LOC 1 LOC 2 LOC 3 LOC 4
Sample part graph
Part graphs
PART_ID FEATURES
B678809
8
[f1, f2]
FF55682
8
[f1, f2, f3, f4]
A6678B3
4
[f2, f3]
Features
Feature extractor
Manufacturing Analytics using Spark
Root Cause Analysis: Sample code
Feature extractor example
Manufacturing Analytics using Spark
▪ The volume of computations needed to identify root causes on a monthly
basis:
Root Cause Analysis: Computational Complexity
Total assembly lines:
~ 10000
Avg. # of parts produced
(per assembly line):
~ 2 Million
Avg. # of data records in HDFS
(per assembly line) : ~ 30 Billion
Manufacturing Analytics using Spark
Root Cause Analysis: The Challenge
Feature matrix generation
PART_ID FEATURES
B6788098 [f1, f2]
FF556828 [f1, f2, f3, f4]
A6678B34 [f2, f3]
PART_ID FEATURES
B6788098 [g1]
FF556828 [g1, g5, g6]
A6678B34 [g1, g2]
X =
DEPENDENT INDEPENDENT
f1 [ [g1],
[g1],
[g1] ]
f2 [ [g1,None],
[g1, None],
[g1, g2] ]
f3 [ [None, None],
[g5, g6],
[None, None] ]
• How to scale feature matrix
generation for products with
increasing volumes.
• Replaced loops with python
functional constructs like:
map, filter, reduce and partial
functions
Challenge Solution
7 hours
2 hours
Before After
Financial Forecasting
Large Scale Forecasting using Spark: Background
and Motivation
▪ Collaboration between
controllers,
programmers, data
engineers, and data
scientists
• Automatically generate
sales forecasts
• Increase efficiency,
objectivity, and accuracy
• Improve financial decision
making for Bosch
GoalTeam
• Monthly forecast of KPIs
(>300.000 time series;
target 3-4M time series)
• Combination of +15 cutting-
edge mathematical models
(with two different data
transformations) in one tool
• Automated model
selection and hierarchically
consistent forecasts
Results
Large Scale Forecasting using Spark
15+ companies under the Bosch
group
• Each company has specific business
structure
• First application is for revenue forecasting
• Revenue can be broken down by customer,
product, region, and business divisions
Scale of the task
• Forecasts are needed monthly,
immediately after the month-closing
calculations.
Task: Millions of forecasts within a
few hours
• Assume we have 1 million time series
• 5 models per time series  5M forecasts
• ~5 seconds per model  Compute time of
15M seconds
• 1000s cores needed
Large Scale Forecasting Using Spark
Technical Architecture
1. Create
Hierarchical Time
Series
3. AI based Time
Series Forecast
4 Consolidate
Hierarchy
2. Automated
Model Selection
using AI
Traditional Models Hybrid Models
Hierarchical
Models
State Space
Models
Kubernetes
Large Scale Forecasting using Spark
▪ The task is embarrassingly parallelizable!
Why R?
Latest and most popular models for forecasting are published in R.
• We can utilize these packages via user defined functions in Spark.
Why Spark?
Each core can receive one
time series and the names
of the models to be applied.
Compute forecasts.
Return the combined
results back to master
node.
Large Scale Forecasting using Spark
▪ Sparklyr
▪ Accepts data frames
▪ Returns data frames
Sparklyr vs. SparkR
▪ SparkR
▪ Accepts data frames or lists
▪ Returns data frames or lists
▪ More flexibility
Sparklyr UDF API
spark_apply
Applies a function to
each row or group of
SparkDataFrame
spark_apply()
Large Scale Forecasting using Spark
▪ User-defined functions (UDFs) in SparkR
via spark.lapply ()
▪ UDF over lists are more flexible
▪ Enables the change of modeling and use of
heterogeneous data without a lot of change to the
overall architecture
▪ Use SparkR::spark.addFile for sending
files needed in all executors
▪ SparkR::spark.lapply () fails when we have
a list with more than ~46k+ elements
(solved in JIRA Issue: [SPARK-25234])
Spark – lessons learned
Large Scale Forecasting using Spark
Performance Gains
*computation time for 1893 time series
Thank you!
Abhirup Mallik (Bosch)
Abishek Prasanna (Bosch)
Jeff Thompson (Bosch)
Kasia Vitanachy (Bosch)
Lisa Marion Garcia (Bosch)
Matthew Jones (Bosch)
Nicolas Douard (Virtue Foundation)
Patrick Emmerich (Bosch)
Phil Gaudreau (LinkedIn)
Ruobing Chen (Facebook)
Sascha Vetter (Bosch)
Zichu Li (University of Rochester)
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch

More Related Content

PDF
Advertising Analytics 2.0
PDF
Netflix Report
PPT
Audience rating
PDF
GroupM ESP Report
PDF
International Roaming
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PDF
Very large scale distributed deep learning on BigDL
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Advertising Analytics 2.0
Netflix Report
Audience rating
GroupM ESP Report
International Roaming
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Very large scale distributed deep learning on BigDL
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Similar to Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch (20)

PDF
Apache spark 2.4 and beyond
PDF
What's New in Upcoming Apache Spark 2.3
PPTX
Applying linear regression and predictive analytics
PDF
Exploring Neo4j Graph Database as a Fast Data Access Layer
PDF
Supercharge your data analytics with BigQuery
PDF
AI at Scale
PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
PDF
Trends towards the merge of HPC + Big Data systems
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
PPTX
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
PDF
2018 02-08-what's-new-in-apache-spark-2.3
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Track A-2 基於 Spark 的數據分析
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
PDF
Peek into Neo4j Product Strategy and Roadmap
PDF
Spark + AI Summit 2020 イベント概要
PDF
Media_Entertainment_Veriticals
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Building A Product Assortment Recommendation Engine
Apache spark 2.4 and beyond
What's New in Upcoming Apache Spark 2.3
Applying linear regression and predictive analytics
Exploring Neo4j Graph Database as a Fast Data Access Layer
Supercharge your data analytics with BigQuery
AI at Scale
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Trends towards the merge of HPC + Big Data systems
dbt Python models - GoDataFest by Guillermo Sanchez
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
2018 02-08-what's-new-in-apache-spark-2.3
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Track A-2 基於 Spark 的數據分析
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Peek into Neo4j Product Strategy and Roadmap
Spark + AI Summit 2020 イベント概要
Media_Entertainment_Veriticals
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Building A Product Assortment Recommendation Engine
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Ad

Recently uploaded (20)

PPTX
Logistic Regression ml machine learning.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Database Infoormation System (DBIS).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
Logistic Regression ml machine learning.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
A Quantitative-WPS Office.pptx research study
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Introduction to machine learning and Linear Models
Taxes Foundatisdcsdcsdon Certificate.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............

Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch

  • 2. Leveraging Spark to develop AI- enabled products and services at Bosch
  • 4. Robert Bosch – a worldwide leading IoT Company 268 Manufacturing sites 1000s Assembly lines 409,881 Associates 60 Countries 460 Local subsidiaries Four business sectors Mobility Solutions Industrial Technology Energy & Building Technology Consumer Goods Sunnyvale Pittsburgh Renningen Tubingen Haifa Bangalore Shanghai Bosch Center for Artificial Intelligence
  • 6. Manufacturing Analytics using Spark Self-Serve Analytics Pipeline • Automate data pipelining and preparation • Centralize data storage across assembly lines and plants • Scalable compute and storage resources • Standard analytics dashboards • Self-service analysis • Advanced analytics tools like Root cause analysis Data Preparation Root Cause Analysis Apache Impala Tableau Extracts Hadoop File System Bosch Manufacturing Plants Kafka Tableau Server
  • 7. Manufacturing Analytics using Spark Why are parts failing quality checks? Process 1 Process 2 Process 5Process 4Process 3 Potential root causes • Measured process parameters • Machine configurations • Tools and components used • Locations visited Target of interest Identify quality test failures for certain parts.
  • 8. Manufacturing Analytics using Spark Root Cause Analysis: Modules Part graph generation Feature extraction Feature matrix generation Root cause modeling Assembly process of every unique part is represented as a graph. Features are extracted from the part graph. Target variables are mapped to features. Statistical models are applied to extract potential root causes.
  • 9. Parameters Tests Tools etc. Parameters Tests Tools etc. Parameters Tests Tools etc. Parameters Tests Tools etc. Manufacturing Analytics using Spark Root Cause Analysis: Sample code PART_ID PART_GRAPH B6788098 FF556828 A6678B34 LOC 1 LOC 2 LOC 3 LOC 4 Sample part graph Part graphs PART_ID FEATURES B678809 8 [f1, f2] FF55682 8 [f1, f2, f3, f4] A6678B3 4 [f2, f3] Features Feature extractor
  • 10. Manufacturing Analytics using Spark Root Cause Analysis: Sample code Feature extractor example
  • 11. Manufacturing Analytics using Spark ▪ The volume of computations needed to identify root causes on a monthly basis: Root Cause Analysis: Computational Complexity Total assembly lines: ~ 10000 Avg. # of parts produced (per assembly line): ~ 2 Million Avg. # of data records in HDFS (per assembly line) : ~ 30 Billion
  • 12. Manufacturing Analytics using Spark Root Cause Analysis: The Challenge Feature matrix generation PART_ID FEATURES B6788098 [f1, f2] FF556828 [f1, f2, f3, f4] A6678B34 [f2, f3] PART_ID FEATURES B6788098 [g1] FF556828 [g1, g5, g6] A6678B34 [g1, g2] X = DEPENDENT INDEPENDENT f1 [ [g1], [g1], [g1] ] f2 [ [g1,None], [g1, None], [g1, g2] ] f3 [ [None, None], [g5, g6], [None, None] ] • How to scale feature matrix generation for products with increasing volumes. • Replaced loops with python functional constructs like: map, filter, reduce and partial functions Challenge Solution 7 hours 2 hours Before After
  • 14. Large Scale Forecasting using Spark: Background and Motivation ▪ Collaboration between controllers, programmers, data engineers, and data scientists • Automatically generate sales forecasts • Increase efficiency, objectivity, and accuracy • Improve financial decision making for Bosch GoalTeam • Monthly forecast of KPIs (>300.000 time series; target 3-4M time series) • Combination of +15 cutting- edge mathematical models (with two different data transformations) in one tool • Automated model selection and hierarchically consistent forecasts Results
  • 15. Large Scale Forecasting using Spark 15+ companies under the Bosch group • Each company has specific business structure • First application is for revenue forecasting • Revenue can be broken down by customer, product, region, and business divisions Scale of the task • Forecasts are needed monthly, immediately after the month-closing calculations. Task: Millions of forecasts within a few hours • Assume we have 1 million time series • 5 models per time series  5M forecasts • ~5 seconds per model  Compute time of 15M seconds • 1000s cores needed
  • 16. Large Scale Forecasting Using Spark Technical Architecture 1. Create Hierarchical Time Series 3. AI based Time Series Forecast 4 Consolidate Hierarchy 2. Automated Model Selection using AI Traditional Models Hybrid Models Hierarchical Models State Space Models Kubernetes
  • 17. Large Scale Forecasting using Spark ▪ The task is embarrassingly parallelizable! Why R? Latest and most popular models for forecasting are published in R. • We can utilize these packages via user defined functions in Spark. Why Spark? Each core can receive one time series and the names of the models to be applied. Compute forecasts. Return the combined results back to master node.
  • 18. Large Scale Forecasting using Spark ▪ Sparklyr ▪ Accepts data frames ▪ Returns data frames Sparklyr vs. SparkR ▪ SparkR ▪ Accepts data frames or lists ▪ Returns data frames or lists ▪ More flexibility Sparklyr UDF API spark_apply Applies a function to each row or group of SparkDataFrame spark_apply()
  • 19. Large Scale Forecasting using Spark ▪ User-defined functions (UDFs) in SparkR via spark.lapply () ▪ UDF over lists are more flexible ▪ Enables the change of modeling and use of heterogeneous data without a lot of change to the overall architecture ▪ Use SparkR::spark.addFile for sending files needed in all executors ▪ SparkR::spark.lapply () fails when we have a list with more than ~46k+ elements (solved in JIRA Issue: [SPARK-25234]) Spark – lessons learned
  • 20. Large Scale Forecasting using Spark Performance Gains *computation time for 1893 time series
  • 21. Thank you! Abhirup Mallik (Bosch) Abishek Prasanna (Bosch) Jeff Thompson (Bosch) Kasia Vitanachy (Bosch) Lisa Marion Garcia (Bosch) Matthew Jones (Bosch) Nicolas Douard (Virtue Foundation) Patrick Emmerich (Bosch) Phil Gaudreau (LinkedIn) Ruobing Chen (Facebook) Sascha Vetter (Bosch) Zichu Li (University of Rochester)
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.