SlideShare a Scribd company logo
Spark Technology Center
IBM Apache Spark
The start of something big in data and design.
J. White Bear
Spark Technology Center
IBM Spark
IBM Investment in Computing
Linux, 1999
13,000,000 lines of code.
500+ Server Solutions
Ushered in Computer Science
System 360, 1964
10,000,000 lines of code.
54 Peripheral Solutions
Ushered in Information Science
Apache Spark, 2015
400,000 lines of code.
20 Data & Analytics Solutions
Ushered in Data Science
IBM Spark
About Me
3
Education
• University of Michigan- Computer Science
• Databases, Machine Learning/Computational Biology,
Cryptography
• University of California San Francisco-
• Multi-objective Optimization/Computational
Biology/Bioinformatics
• McGill University
• Machine Learning/ Multi-objective Optimization for Path
Planning/ Cryptography
Industry
• IBM (6 months)
• Amazon
• TeraGrid
• Pfizer
• Research at UC Berkeley, Purdue University, and every
university I ever attended. 
Fun Facts (?)
I love research for its own sake. I like robots,
helping to cure diseases, advocating for
social change and reform, and breaking
encryptions. Also, all activities involving the
Ocean and I usually hate taking pictures. 
IBM Spark
Outline
4
• Brief overview of the state and direction of robotics
Introduction
• Definition of SLAM
• Key Challenges
What is SLaM?
• Benefits
• Current Approaches
Why SLaM on IoT/Spark?
• The Approach
• Framework and Architecture
The Framework
• Challenges / Recommendations
The Results
Next Steps
Demo with Gazebo
Questions and Answers
IBM Spark
Introduction: Robotics Today
5
FIRST Robotics World Championship
NASA Glenn Research Center in Cleveland sponsored
Tri-C's team.
Tartan Racing’s Boss, the robotic SUV that won the 2007
DARPA Urban Challenge,
South Korean Team, KAIST wins the DARPA Robot
Challenge
Amazon Drones
IBM Spark
Introduction: Robotics Tomorrow
6
Navigate stores, museums and other indoor locations, with
directions overlaid onto your surroundings. Google Tango
Nanorobots wade through blood to deliver drugs
Space/underground/underwater rescue and
exploration. Places humans can’t go.
SLaM and ML on automated wheelchair
IBM Spark
What is SLaM?
7
Simultaneous Localization and Mapping (SLAM)
• Formal Definition
• Given a series of sensor observations over discrete time steps the SLAM problem is to compute an estimate
of the agent's and a map of the environment. All quantities are usually probabilistic, so the objective is to
compute (as an example variant):
•Computational problem of constructing or updating a map of an unknown environment while
simultaneously keeping track of an agent's location within it.
•SLAM algorithms use various implementations to attempt to find heuristics to make this problem tractable
using machine learning and probabilistic models
•GPS cannot account for unknown barriers, precision navigation, moving objects, or any areas with satellite
interference including weather phenomena.
IBM Spark
What is SLAM?
8
What are some of the key challenges in SLAM?
•Computer vision correctly and identifying images observed
•Moving objects Non-static environments, such as those containing other vehicles or
pedestrians, continue to present research challenges. (collision detection)
•Data Association-refers to the problem of ascertaining which parts of one image
correspond to which parts of another image, where differences are due to movement
of the camera, the elapse of time, and/or movement of objects in the photos.
•Loop closure is the problem of recognizing a previously visited location and updating
the states accordingly.
IBM Spark
What is SLAM?
9
IBM Spark
Why SLAM on IoT?
10
SLAM in IoT
• "[SLAM] is one of the fundamental challenges of robotics . . . [but it] seems that almost all the
current approaches can not perform consistent maps for large areas, mainly due to the increase of
the computational cost and due to the uncertainties that become prohibitive when the scenario
becomes larger."[12] Generally, complete 3D SLAM solutions are highly computationally intensive as
they use complex real-time particle filters, sub-mapping strategies or hierarchical combination of
metric topological representations, etc. (Wiki)
• Computational costs become prohibitive on embedded systems, especially smaller robotic
modules. The data becomes large and the calculations and corrections over time and space
become much more important. Specifically, SlaM increases exponentially with the number of
landmarks found.
• The state uncertainty increase with time and space, and must be bounded by some form of
machine learning to predict and use accurate corrections in the algorithm
• Additional sensors, rapid movements, processing visual input adds additional computational
burdens…
IBM Spark
Why SLAM on IoT?
11
The Benefits
•Seamless integration and scaling allowing users to easily improve
the heuristics of the algorithm without losing any of the
performance expectations of an embedded system.
•Including smart cities, lawn mowing, dog walking, kitchen
appliances, or even communication inside the human body
creating a truly unique interaction between humans and robotics
•Large scale evaluation of performance metrics for all IoT systems
(Big Data)
•Monitoring and control of sensors based on stored data (eg
reducing sensor usage to conserve power)
IBM Spark
Why SLaM on IoT?
12
Current Approaches
• Robot Operating System (ROS) a collection of software frameworks for robot software development
• Providing operating system-like functionality on a heterogeneous computer cluster.
• Hardware abstraction, low-level device control, implementation of commonly used functionality,
message-passing between processes, and package management.
• No true real-time analytics! Despite the importance of reactivity and low latency in robot control,
ROS is not a Realtime OS
• Difficult to scale in IoT! Adding a heterogenous swarm, or integrating interactions requires significant
planning.
• There is a need! Are there any plans to build Kalman filtering and system identification into this
framework? https://github.com/sryza/spark-timeseries/issues/19
• We need a framework that can do this! Enter Apache Kafka and Spark Streaming!
IBM Spark
The Framework
13
The Approach
•Extended Kalman Filter (matrix based update/estimation)
•Nonlinear version of the Kalman filter which linearizes about
an estimate of the current mean and covariance. de facto
standard in the theory of nonlinear state estimation eg
navigation systems and GPS. (wiki)
•TurtleBot (standard robotics research bot)
•Gazebo Simulator (3D simulator with sensors input and
feedback)
IBM Spark
The Framework
14
The Approach
IBM Spark
The Framework
15
IBM Spark
The Framework
16
Our cluster: IBM SoftLayer
cluster with 3 Nodes.
IBM Spark
The Framework
17
IBM SoftLayer cluster with 3 Nodes.
Node 1:
Management
Node
Apache Kafka
(Multithreaded
Producers are each
assigned a sensor)
Simulator/Sensor
Data
Mapping Agent
Node 2:
Hadoop/Spark
Spark Streaming
Consumer/ Apache
Kafka Producer to
Simulator
Spark Streaming
Spark ML
Analytics
Node 3:
Hadoop/Spark
Spark Streaming
Consumer/ Apache
Kafka Producer to
Simulator
Spark Streaming
Spark ML/
Analytics
IBM Spark
The Framework
18
Apache Kafka
Spark
Streaming
Spark ML/
Analytics and
Computation
Apache Kafka
Simulated
Turtlebot
• Odometry, pose and orientation data
for every movement.
• Laser scan data every 30ms with
over 1200 data points per read!
• One robot and not even all the
sensors!
A high performing plug n play cloud for
smart robotics, drones and intelligent
systems that allows easily tuneable
interactions for scientists and industry in
any environment!
IBM Spark
The Framework
19
A high performing plug n play cloud for smart robotics, drones and intelligent systems
that allows easily tuneable interactions for scientists and industry in any environment!
•EKF is calculated primarily using matrix operations!
•Distributed raw sensor data using Apache Kafka. Number of sensors
limited only by Kafka cluster!
•Improved performance using RDDs and Spark ML for computational
intensive tasks!
•Fast/optimized learning and analytics!
•Real-time sensor messaging!
•Easy sensor integration and scaling!
•Retention of data over time for improved optimizations and accuracy!
IBM Spark
The Framework: Apache Kafka
20
Kafka Integration
•Multithreaded Producers for easy scaling and hardware timing
•Apache Kafka Java Api backed by a thread pool to handle concurrency
•Allows shared instances of Producer
•Large scale sensors distributions can be partitioned for easier analysis, and significantly
decreased latency
IBM Spark
The Framework: Apache Kafka
21
IBM Spark
The Framework: Spark Streaming
22
Spark Streaming Integration
Apache Spark Streaming Apache Kafka Consumer
Replaces Kafka Consumer Producer feeds directly to Spark
Streaming
Adheres to fault tolerance policies
incl. WAL (write ahead logs to HDFS)
Not necessarily thread safe (Java Api)
KafkaUtils.createDirectStreamDirect
w/o Receivers in new version, better
access to low level Kafka metadata
Auto-commit feature, partition
replication, integration with
Zookeeper. Finely tuned metadata
access and storage by topic and
partition
Microbatch processing and better
integration into Spark incl online
learning
Buffered batches, developing
streaming analytics capabilities
IBM Spark
The Framework: Spark Streaming
23
IBM Spark
The Framework: Spark ML, RANSAC
24
Spark ML with RANSAC
•RANSAC
• One of many iterative method to estimate parameters of a mathematical model from a set of
observed data which contains outliers.
• Default methodology for determining whether a series of landmark forms a wall or structure
•Ideal for consumption with high-throughput batches in Spark Streaming!
•Integrated as an online learning algorithm (This framework) as back-end iterative process in
Spark Streaming/ Spark!
IBM Spark
The Framework: RANSAC
25
IBM Spark
The Results
26
Key Challenges
•Network Latency
•Embedded vs Framework
•Matrix computations and updates to large matrices
•Jacobian (derivatives), Inversion, Transpositon, Multiplication,
Addition/ Subtraction, Gaussian
•Covariance/Estimation computations
•Coordinating movement with computation
•Spark ML to correctly interpret visual landmark data, minimizing errors
IBM Spark
The Results
27
Challenges
•~4KLOC (Java != verbose )
•Java lambda documentation
•Kafka topics from Spark Streaming consumer
•Real-life latency depends on the type of connection and creates
additional noise
•Matrix computation
•Defining heuristics
•Communicator to sim, need a solid class
IBM Spark
The Results
28
Measuring Network Latency in artificially throttled IO simulators. Timing was kept static to
measure real delays in the messages over the cluster and between the simulator against file IO.
PERF1 (w/ Sim) vs PERF2 (file
IO) Iterations: 10
Iterations: 200
IBM Spark
The Results
29
Measuring landmark acquisition and cpu time Embedded vs Framework at 500 iterations.
IBM Spark
The Results
31
Measuring landmark acquisition and cpu time Embedded vs Framework at 500 iterations.
Framework completed 500 iterations
with expected exponential growth
Embedded failed to complete
at 500 iterations (up to ~300)
IBM Spark
The Results
32
Measuring landmark acquisition and cpu time Embedded vs Framework for complete map.
Both installations were run until the number of landmarks/maps were roughly equivalent and
iterations marked.
Iterations: ~100, Time ~2 min Iterations: ~100, Time ~30-40s
IBM Spark
The Results
33
Forthcoming Benchmarks.
Iterations: ~100 Iterations: ~20
• Apache Kafka latency to brokers
• RANSAC convergence of Spark Streaming batches
• Spark Streaming batch processing throughput in relation to
processing time
IBM Spark
The Results
34
Performance Tuning and Optimization
•Sparse and distributed matrices in Spark ML
•Optimize matrix computations (EKF)
•Separate threads for Apache Kafka producers
•Spark Streaming batches timed to sensor input cycles to avoid heavy loads
misaligned updates (This could also be tuned using device profiles).
•Slower movement/reduced data points to synchronize calculations with
movement and discovery
•Rapid movement are larger RDDs should create new RDDs and matrices for
updates using existing heuristics, updates can sometimes create bottlenecks
•Standard Spark performance tuning: cpu core maximization, and executors
•*Scheduled feature extraction to minimize accumulated error in long runs
•*New parameters/ large skew from ground truth should trigger updates
IBM Spark
Next Steps
35
• Expanded stochastic analysis beyond gradient descent
• Kalman Filter and Extended Kalman Filter
• Improving accuracy and precision with an end to end pipeline that allows
customization/optimization
• Path Planning algorithms to improve search and search times
• Incorporate swarms/particles
• A complete robotics library or even extension to handle robotics, computer vision or any
of the ai/machine learning problems specifics to robotics is publishable and opens the
door to a whole new group of scientists.
• Further scaling and optimization with robotic swarms and rapid/increased volume sensor
data
IBM Spark
Conclusion
36
IBM IoT Cloud Open Platform for Industries
IBM Bluemix IoT Zone
IBM IoT Ecosystem
More to come….!!!
IBM Spark
Demo (Simulation)
38
IBM Spark
Q & A
39
Contact Information:
J. White Bear (jwhiteb@us.ibm.com)
IBM Spark Technology Center
425 Market St San Francisco, CA
Special thanks to IBM, the IBM Spark team at Spark Technology Center for your input,
taking time to discuss, and allowing me time to work on this project.
Sampada Basakar
Vijay Bommireddipalli
Fred Reiss
Luciano Resende

More Related Content

PPTX
Real Time Machine Learning Visualization with Spark
PDF
Conviva spark
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PPTX
Debunking Common Myths in Stream Processing
PDF
Spark Uber Development Kit
PPTX
Next Gen Big Data Analytics with Apache Apex
Real Time Machine Learning Visualization with Spark
Conviva spark
Spark Summit EU talk by Kaarthik Sivashanmugam
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Debunking Common Myths in Stream Processing
Spark Uber Development Kit
Next Gen Big Data Analytics with Apache Apex

What's hot (20)

PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Interactive Visualization of Streaming Data Powered by Spark
PPTX
Active Learning for Fraud Prevention
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
Sherlock: an anomaly detection service on top of Druid
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Meeting Performance Goals in multi-tenant Hadoop Clusters
PDF
Hadoop summit 2010, HONU
PPTX
Analysis of Major Trends in Big Data Analytics
PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
The Next Generation of Data Processing and Open Source
PPTX
Integrating Apache Phoenix with Distributed Query Engines
PDF
Opal: Simple Web Services Wrappers for Scientific Applications
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
ODP
Lambda Architecture with Spark
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Combining Machine Learning frameworks with Apache Spark
Spark Summit EU talk by Zoltan Zvara
Interactive Visualization of Streaming Data Powered by Spark
Active Learning for Fraud Prevention
Unified, Efficient, and Portable Data Processing with Apache Beam
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Sherlock: an anomaly detection service on top of Druid
What's new in SQL on Hadoop and Beyond
Meeting Performance Goals in multi-tenant Hadoop Clusters
Hadoop summit 2010, HONU
Analysis of Major Trends in Big Data Analytics
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
The Next Generation of Data Processing and Open Source
Integrating Apache Phoenix with Distributed Query Engines
Opal: Simple Web Services Wrappers for Scientific Applications
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Lambda Architecture with Spark
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Ad

Viewers also liked (20)

PPTX
Rob Bearden Keynote Hadoop Summit San Jose
PPTX
Building a Smarter Home with Apache NiFi and Spark
PPTX
7 Predictive Analytics, Spark , Streaming use cases
PDF
Budapest Big Data Meetup Nov 26 2015
PPT
Topfoison product catalog
PDF
Sparkstreaming
PDF
Devops Spark Streaming
PPTX
Scala training workshop 02
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
PPTX
What’s New in the Berkeley Data Analytics Stack
PPTX
Remote temperature monitor (DHT11)
PPTX
October 2014 HUG : Hive On Spark
PPTX
What's New in Spark 2?
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
PPTX
YARN Ready: Apache Spark
PPTX
Electronic governance steps in the right direction?
PDF
Low Latency Execution For Apache Spark
PPTX
Hortonworks Technical Workshop: HBase For Mission Critical Applications
PPTX
IoT Analytics from Edge to Cloud - using IBM Informix
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Rob Bearden Keynote Hadoop Summit San Jose
Building a Smarter Home with Apache NiFi and Spark
7 Predictive Analytics, Spark , Streaming use cases
Budapest Big Data Meetup Nov 26 2015
Topfoison product catalog
Sparkstreaming
Devops Spark Streaming
Scala training workshop 02
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
What’s New in the Berkeley Data Analytics Stack
Remote temperature monitor (DHT11)
October 2014 HUG : Hive On Spark
What's New in Spark 2?
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
YARN Ready: Apache Spark
Electronic governance steps in the right direction?
Low Latency Execution For Apache Spark
Hortonworks Technical Workshop: HBase For Mission Critical Applications
IoT Analytics from Edge to Cloud - using IBM Informix
How Spark Enables the Internet of Things- Paula Ta-Shma
Ad

Similar to Spark Technology Center IBM (20)

PDF
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
PPTX
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
PPTX
01-Approaches and Challenges in Internet of Robotic Things.pptx
PDF
slam_research_paper
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
Real-time DeepLearning on IoT Sensor Data
PPTX
Predictive maintenance withsensors_in_utilities_
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PDF
Apache Spark and future of advanced analytics
PDF
Warp10, a horizontal framework for Time Series data, OW2con'18, June 7-8, 201...
 
PPTX
Data Science at Scale by Sarah Guido
PDF
Map r chicago_advanalytics_oct_meetup
PDF
IRJET Autonomous Simultaneous Localization and Mapping
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PPTX
Build your open source data science platform
PDF
Ncku csie talk about Spark
PPTX
Sundance's presentation at B:RAI 2020
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
01-Approaches and Challenges in Internet of Robotic Things.pptx
slam_research_paper
Predictive Maintenance Using Recurrent Neural Networks
Real-time DeepLearning on IoT Sensor Data
Predictive maintenance withsensors_in_utilities_
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Apache Spark and future of advanced analytics
Warp10, a horizontal framework for Time Series data, OW2con'18, June 7-8, 201...
 
Data Science at Scale by Sarah Guido
Map r chicago_advanalytics_oct_meetup
IRJET Autonomous Simultaneous Localization and Mapping
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Build your open source data science platform
Ncku csie talk about Spark
Sundance's presentation at B:RAI 2020
Processing Large Datasets for ADAS Applications using Apache Spark

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
GamePlan Trading System Review: Professional Trader's Honest Take
Understanding_Digital_Forensics_Presentation.pptx
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced Soft Computing BINUS July 2025.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Spark Technology Center IBM

  • 1. Spark Technology Center IBM Apache Spark The start of something big in data and design. J. White Bear Spark Technology Center
  • 2. IBM Spark IBM Investment in Computing Linux, 1999 13,000,000 lines of code. 500+ Server Solutions Ushered in Computer Science System 360, 1964 10,000,000 lines of code. 54 Peripheral Solutions Ushered in Information Science Apache Spark, 2015 400,000 lines of code. 20 Data & Analytics Solutions Ushered in Data Science
  • 3. IBM Spark About Me 3 Education • University of Michigan- Computer Science • Databases, Machine Learning/Computational Biology, Cryptography • University of California San Francisco- • Multi-objective Optimization/Computational Biology/Bioinformatics • McGill University • Machine Learning/ Multi-objective Optimization for Path Planning/ Cryptography Industry • IBM (6 months) • Amazon • TeraGrid • Pfizer • Research at UC Berkeley, Purdue University, and every university I ever attended.  Fun Facts (?) I love research for its own sake. I like robots, helping to cure diseases, advocating for social change and reform, and breaking encryptions. Also, all activities involving the Ocean and I usually hate taking pictures. 
  • 4. IBM Spark Outline 4 • Brief overview of the state and direction of robotics Introduction • Definition of SLAM • Key Challenges What is SLaM? • Benefits • Current Approaches Why SLaM on IoT/Spark? • The Approach • Framework and Architecture The Framework • Challenges / Recommendations The Results Next Steps Demo with Gazebo Questions and Answers
  • 5. IBM Spark Introduction: Robotics Today 5 FIRST Robotics World Championship NASA Glenn Research Center in Cleveland sponsored Tri-C's team. Tartan Racing’s Boss, the robotic SUV that won the 2007 DARPA Urban Challenge, South Korean Team, KAIST wins the DARPA Robot Challenge Amazon Drones
  • 6. IBM Spark Introduction: Robotics Tomorrow 6 Navigate stores, museums and other indoor locations, with directions overlaid onto your surroundings. Google Tango Nanorobots wade through blood to deliver drugs Space/underground/underwater rescue and exploration. Places humans can’t go. SLaM and ML on automated wheelchair
  • 7. IBM Spark What is SLaM? 7 Simultaneous Localization and Mapping (SLAM) • Formal Definition • Given a series of sensor observations over discrete time steps the SLAM problem is to compute an estimate of the agent's and a map of the environment. All quantities are usually probabilistic, so the objective is to compute (as an example variant): •Computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. •SLAM algorithms use various implementations to attempt to find heuristics to make this problem tractable using machine learning and probabilistic models •GPS cannot account for unknown barriers, precision navigation, moving objects, or any areas with satellite interference including weather phenomena.
  • 8. IBM Spark What is SLAM? 8 What are some of the key challenges in SLAM? •Computer vision correctly and identifying images observed •Moving objects Non-static environments, such as those containing other vehicles or pedestrians, continue to present research challenges. (collision detection) •Data Association-refers to the problem of ascertaining which parts of one image correspond to which parts of another image, where differences are due to movement of the camera, the elapse of time, and/or movement of objects in the photos. •Loop closure is the problem of recognizing a previously visited location and updating the states accordingly.
  • 10. IBM Spark Why SLAM on IoT? 10 SLAM in IoT • "[SLAM] is one of the fundamental challenges of robotics . . . [but it] seems that almost all the current approaches can not perform consistent maps for large areas, mainly due to the increase of the computational cost and due to the uncertainties that become prohibitive when the scenario becomes larger."[12] Generally, complete 3D SLAM solutions are highly computationally intensive as they use complex real-time particle filters, sub-mapping strategies or hierarchical combination of metric topological representations, etc. (Wiki) • Computational costs become prohibitive on embedded systems, especially smaller robotic modules. The data becomes large and the calculations and corrections over time and space become much more important. Specifically, SlaM increases exponentially with the number of landmarks found. • The state uncertainty increase with time and space, and must be bounded by some form of machine learning to predict and use accurate corrections in the algorithm • Additional sensors, rapid movements, processing visual input adds additional computational burdens…
  • 11. IBM Spark Why SLAM on IoT? 11 The Benefits •Seamless integration and scaling allowing users to easily improve the heuristics of the algorithm without losing any of the performance expectations of an embedded system. •Including smart cities, lawn mowing, dog walking, kitchen appliances, or even communication inside the human body creating a truly unique interaction between humans and robotics •Large scale evaluation of performance metrics for all IoT systems (Big Data) •Monitoring and control of sensors based on stored data (eg reducing sensor usage to conserve power)
  • 12. IBM Spark Why SLaM on IoT? 12 Current Approaches • Robot Operating System (ROS) a collection of software frameworks for robot software development • Providing operating system-like functionality on a heterogeneous computer cluster. • Hardware abstraction, low-level device control, implementation of commonly used functionality, message-passing between processes, and package management. • No true real-time analytics! Despite the importance of reactivity and low latency in robot control, ROS is not a Realtime OS • Difficult to scale in IoT! Adding a heterogenous swarm, or integrating interactions requires significant planning. • There is a need! Are there any plans to build Kalman filtering and system identification into this framework? https://github.com/sryza/spark-timeseries/issues/19 • We need a framework that can do this! Enter Apache Kafka and Spark Streaming!
  • 13. IBM Spark The Framework 13 The Approach •Extended Kalman Filter (matrix based update/estimation) •Nonlinear version of the Kalman filter which linearizes about an estimate of the current mean and covariance. de facto standard in the theory of nonlinear state estimation eg navigation systems and GPS. (wiki) •TurtleBot (standard robotics research bot) •Gazebo Simulator (3D simulator with sensors input and feedback)
  • 16. IBM Spark The Framework 16 Our cluster: IBM SoftLayer cluster with 3 Nodes.
  • 17. IBM Spark The Framework 17 IBM SoftLayer cluster with 3 Nodes. Node 1: Management Node Apache Kafka (Multithreaded Producers are each assigned a sensor) Simulator/Sensor Data Mapping Agent Node 2: Hadoop/Spark Spark Streaming Consumer/ Apache Kafka Producer to Simulator Spark Streaming Spark ML Analytics Node 3: Hadoop/Spark Spark Streaming Consumer/ Apache Kafka Producer to Simulator Spark Streaming Spark ML/ Analytics
  • 18. IBM Spark The Framework 18 Apache Kafka Spark Streaming Spark ML/ Analytics and Computation Apache Kafka Simulated Turtlebot • Odometry, pose and orientation data for every movement. • Laser scan data every 30ms with over 1200 data points per read! • One robot and not even all the sensors! A high performing plug n play cloud for smart robotics, drones and intelligent systems that allows easily tuneable interactions for scientists and industry in any environment!
  • 19. IBM Spark The Framework 19 A high performing plug n play cloud for smart robotics, drones and intelligent systems that allows easily tuneable interactions for scientists and industry in any environment! •EKF is calculated primarily using matrix operations! •Distributed raw sensor data using Apache Kafka. Number of sensors limited only by Kafka cluster! •Improved performance using RDDs and Spark ML for computational intensive tasks! •Fast/optimized learning and analytics! •Real-time sensor messaging! •Easy sensor integration and scaling! •Retention of data over time for improved optimizations and accuracy!
  • 20. IBM Spark The Framework: Apache Kafka 20 Kafka Integration •Multithreaded Producers for easy scaling and hardware timing •Apache Kafka Java Api backed by a thread pool to handle concurrency •Allows shared instances of Producer •Large scale sensors distributions can be partitioned for easier analysis, and significantly decreased latency
  • 21. IBM Spark The Framework: Apache Kafka 21
  • 22. IBM Spark The Framework: Spark Streaming 22 Spark Streaming Integration Apache Spark Streaming Apache Kafka Consumer Replaces Kafka Consumer Producer feeds directly to Spark Streaming Adheres to fault tolerance policies incl. WAL (write ahead logs to HDFS) Not necessarily thread safe (Java Api) KafkaUtils.createDirectStreamDirect w/o Receivers in new version, better access to low level Kafka metadata Auto-commit feature, partition replication, integration with Zookeeper. Finely tuned metadata access and storage by topic and partition Microbatch processing and better integration into Spark incl online learning Buffered batches, developing streaming analytics capabilities
  • 23. IBM Spark The Framework: Spark Streaming 23
  • 24. IBM Spark The Framework: Spark ML, RANSAC 24 Spark ML with RANSAC •RANSAC • One of many iterative method to estimate parameters of a mathematical model from a set of observed data which contains outliers. • Default methodology for determining whether a series of landmark forms a wall or structure •Ideal for consumption with high-throughput batches in Spark Streaming! •Integrated as an online learning algorithm (This framework) as back-end iterative process in Spark Streaming/ Spark!
  • 26. IBM Spark The Results 26 Key Challenges •Network Latency •Embedded vs Framework •Matrix computations and updates to large matrices •Jacobian (derivatives), Inversion, Transpositon, Multiplication, Addition/ Subtraction, Gaussian •Covariance/Estimation computations •Coordinating movement with computation •Spark ML to correctly interpret visual landmark data, minimizing errors
  • 27. IBM Spark The Results 27 Challenges •~4KLOC (Java != verbose ) •Java lambda documentation •Kafka topics from Spark Streaming consumer •Real-life latency depends on the type of connection and creates additional noise •Matrix computation •Defining heuristics •Communicator to sim, need a solid class
  • 28. IBM Spark The Results 28 Measuring Network Latency in artificially throttled IO simulators. Timing was kept static to measure real delays in the messages over the cluster and between the simulator against file IO. PERF1 (w/ Sim) vs PERF2 (file IO) Iterations: 10 Iterations: 200
  • 29. IBM Spark The Results 29 Measuring landmark acquisition and cpu time Embedded vs Framework at 500 iterations.
  • 30. IBM Spark The Results 31 Measuring landmark acquisition and cpu time Embedded vs Framework at 500 iterations. Framework completed 500 iterations with expected exponential growth Embedded failed to complete at 500 iterations (up to ~300)
  • 31. IBM Spark The Results 32 Measuring landmark acquisition and cpu time Embedded vs Framework for complete map. Both installations were run until the number of landmarks/maps were roughly equivalent and iterations marked. Iterations: ~100, Time ~2 min Iterations: ~100, Time ~30-40s
  • 32. IBM Spark The Results 33 Forthcoming Benchmarks. Iterations: ~100 Iterations: ~20 • Apache Kafka latency to brokers • RANSAC convergence of Spark Streaming batches • Spark Streaming batch processing throughput in relation to processing time
  • 33. IBM Spark The Results 34 Performance Tuning and Optimization •Sparse and distributed matrices in Spark ML •Optimize matrix computations (EKF) •Separate threads for Apache Kafka producers •Spark Streaming batches timed to sensor input cycles to avoid heavy loads misaligned updates (This could also be tuned using device profiles). •Slower movement/reduced data points to synchronize calculations with movement and discovery •Rapid movement are larger RDDs should create new RDDs and matrices for updates using existing heuristics, updates can sometimes create bottlenecks •Standard Spark performance tuning: cpu core maximization, and executors •*Scheduled feature extraction to minimize accumulated error in long runs •*New parameters/ large skew from ground truth should trigger updates
  • 34. IBM Spark Next Steps 35 • Expanded stochastic analysis beyond gradient descent • Kalman Filter and Extended Kalman Filter • Improving accuracy and precision with an end to end pipeline that allows customization/optimization • Path Planning algorithms to improve search and search times • Incorporate swarms/particles • A complete robotics library or even extension to handle robotics, computer vision or any of the ai/machine learning problems specifics to robotics is publishable and opens the door to a whole new group of scientists. • Further scaling and optimization with robotic swarms and rapid/increased volume sensor data
  • 35. IBM Spark Conclusion 36 IBM IoT Cloud Open Platform for Industries IBM Bluemix IoT Zone IBM IoT Ecosystem More to come….!!!
  • 37. IBM Spark Q & A 39 Contact Information: J. White Bear (jwhiteb@us.ibm.com) IBM Spark Technology Center 425 Market St San Francisco, CA Special thanks to IBM, the IBM Spark team at Spark Technology Center for your input, taking time to discuss, and allowing me time to work on this project. Sampada Basakar Vijay Bommireddipalli Fred Reiss Luciano Resende

Editor's Notes

  • #3: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/system360/breakthroughs/ http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/linux/breakthroughs/
  • #7: http://ici.radio-canada.ca/regions/ontario/2016/06/14/016-robotique-chercheurs-sudbury-robots-drones.shtml
  • #9: Increase in error over time and readjusting this Identifying landmarks
  • #10: Not so simple after all. Actually it’s very computationally challenging which is why we decided to move things to the cloud.
  • #13: ROS is great, but can you really fine tune your parameters and ML algorithms. Is it easily portable and integrated with the next generation of robots that are going to need realtime processing and fast analytics spanning robots and sensors over time?
  • #14: Unlike its linear counterpart, the extended Kalman filter in general is not an optimal estimator (of course it is optimal if the measurement and the state transition model are both linear, as in that case the extended Kalman filter is identical to the regular one). In addition, if the initial estimate of the state is wrong, or if the process is modeled incorrectly, the filter may quickly diverge, owing to its linearization. Another problem with the extended Kalman filter is that the estimated covariance matrix tends to underestimate the true covariance matrix and therefore risks becoming inconsistent in the statistical sense without the addition of "stabilising noise"[citation needed] . Having stated this, the extended Kalman filter can give reasonable performance, and is arguably the de facto standard in navigation systems and GPS.
  • #15: Simple graph what this looks like in code is quite different updating the main H matrix alone is the largest computation in both size and cpu usage. It holds all the state and landmark data and must updated based on all the corresponding matrices. The code is large with this one, but it doesn’t have to be building this alone as a library would cut down on over a 1000 lines of code.  Bayesian inference and estimating a joint probability distributionover the variables for each timeframe.
  • #17: Standard architecture, add ibm ambari etc
  • #19: This is clearly a problem that announces itself in the big data space
  • #23: Kafka streaming
  • #37: The takeaway is that ibm is already in the IoT space and preparing for the next generation of smart cities. Our continued open source innovation is a part of that.
  • #38: The takeaway is that ibm is already in the IoT space and preparing for the next generation of smart cities. Our continued open source innovation is a part of that.