SlideShare a Scribd company logo
Train++: An Incremental ML Model Training
Algorithm to Create Self-Learning IoT
Devices
Bharath Sudharsan, Piyush Yadav, John G. Breslin,
Muhammad Intizar Ali
Introduction
To improve inference accuracy, edge devices log unseen data in cloud. Initial model re-trained then sent as OTA update
✓ Edge devices hardware cost increase - wireless module addition
✓ Increases cyber-security risks and power consumption
✓ Not self-contained ubiquitous systems since they depend on the cloud services for inference and re-training
✓ Latency, privacy, other concerns
A typical IoT scenario: Edge devices (Endpoints) depends on cloud for inference and model updates
Introduction
▪ In the real-world, every new scene generates a fresh, unseen data pattern
▪ When the model deployed in edge devices sees such fresh patterns, it will
either not know how to react to that specific scenario or lead to false or
less accurate results
▪ A model trained using data from one context will not produce the expected
results when deployed in another context
▪ It is not feasible to train multiple models for multiple environments and
contexts
Accuracy: 92 %
Accuracy: 54 %
Mobile
edge
device
Scene A
Input
0 - 40 C
5 - 400 Lux
1.1 - 1.8
Bar
Mobile
edge
device
Scene B
Input
-10 - 1 C
1 - 10 Lux
2 - 4 Bar
IoT edge devices exposed to new
scenes/environments
Background
▪ To enable the edge devices to learn offline after deployment - ML model deployed and running on the edge
devices need to be re-trained or updated at the device level
Top ML Frameworks for running ML models on MCUs. It does not yet support training models on MCUs
▪ ML frameworks like TensorFlow Micro, FeatherCNN, Open-NN, etc. do not yet enable training models directly on
MCUs, small CPUs and IoT FPGAs
▪ Currently models are trained on GPUs then deployed on IoT edge devices after deep compression/optimization
OpenNN
Background
▪ Bonsai: non-linear tree-based classifier which is designed to solve traditional ML problem with 2KB sized models
▪ EMI-RNN: speeding-up RNN inference up to 72x when compared to traditional implementations
▪ FastRNN & FastGRNN: can be used instead of LSTM and GRU. 35x smaller and faster with size less than 10KB
▪ SeeDot: floating-point to fixed-point quantization tool including a new language and compiler
Microsoft EdgeML provides Algorithms and Tools to enable ML inference on the edge devices
On Device ML Model Training
▪ Numerous papers, libraries, algorithms, tools
exists to enable self-learning and re-training of
ML models on better resourced devices like SBCs
✓ Due to ML framework support i.e., TF Lite
can run on SBCs and not on MCUs
▪ SEFR paper: Classification algorithm specifically
designed to perform both training and testing on
ultra-low power devices
▪ ML in Embedded Sensor Systems paper: Enable
real-time data analytics and continuous training
and re-training of the ML algorithm
Powerful CPU + basic GPU based
SBCs (single board computers)
Billions of IoT devices are designed
using such MCUs and small CPUs
Research Question
▪ After real-world deployment, can the highly resource-constrained MCU-based IoT devices
autonomously (offline) improve their intelligence by performing onboard re-training/updating
(self-learning) of the local ML model using the live IoT use-case data?
Create self-learning HVACs for
superior thermal comfort
▪ HVAC controllers are resource-constrained IoT edge
devices. For example, Google Nest, Echobee, etc.
▪ Every building/infrastructure has differences.
Standard/one-size-fits-all strategy fails to provide
superior level of thermal comfort for people
▪ Proposed algorithms can make HVAC controllers
learn best strategy (offline) to perform tailored
control of the HVAC unit for any building types
▪ Eliminates the need to find and set distinct HVAC
control strategies for each building
Use case 1: Self-learning HVACs
▪ Data required for most researches are sensitive in
nature as it revolves around private individuals
▪ Resource-constrained medical devices like insulin-
delivery devices, electric steam sterilizers, BP
apparatus, etc., can use proposed algorithms to
train offline using the sensitive data
▪ Transmit models instead of the actual medical data
Use case 2: Sensitive Medical Data
Providing sensitive medical data for research
▪ Proposed algorithms can train models on MCUs. Thus,
even the tiny IoT devices can self-learn data patterns
✓ Learn usual residential electricity consumption
patterns and raise alerts in the event of unusual
usage or overconsumption. Thus can reduce bills,
detect leaks, etc.
✓ Learn by monitoring the contextual sensor data
corresponding to regular vibration patterns from the
pump's crosshead, cylinder and frame
✓ Generate alerts using the learned knowledge if
anomaly patterns are predicted or detected
Use case 3: Learning any Patterns
Self-learning the data patterns
Train++
AI inside
Use case: Learn to monitor
vibrations on pump's crosshead,
cylinder and frame. Generate alerts
if anomaly patterns are detected
Use case: Learn to
detect Leaks,
monitor efficiency
& consumptions
Train++
AI inside
Train++ Incremental Training Algorithm
▪ Train++ is designed for MCUs to perform on-board binary classifier
model training, inference. Train++ algorithm main highlights/benefits
✓ Dataset size: Incremental training characteristic enables training
on limited Flash and SRAM footprints while also allowing to use
full n-samples of the high-dimensional datasets
✓ Training speed and energy: Achieves significant energy
conservations due to its high-performance characteristics i.e., it
trained and inferred at much higher speeds than other methods
✓ Performance: Train++ trained classifiers show classification
accuracy close to those of Python scikit-learn trained classifiers
on high-resource CPUs
High Accuracy
and Precision
Fast training
and Inference
Learn from large
datasets
Significant
energy
conservation
Benefits of Training and Inference
on MCUs using Train++
Train++: Pipeline
▪ Four-steps to use our Train++ algorithm to enable the edge
MCUs to self-train offline for any IoT use cases
✓ Select MCU of choice depending on the target use case
✓ Contextual sensor data corresponding to normal working
will be collected as the local dataset and used for training.
The labels/ground truth for the collected training data rows
shall be computed using our method from Edge2Train
✓ Edge MCUs learns/trains using our core Train++ algorithm
✓ The resultant MCU-trained models can start inference over
previously unseen data
Train++: Core Algorithm
▪ The full Train++ algorithm is shown. Characteristics
✓ Highly resource-friendly and low complexity design
✓ Implementation less than 100 lines in C++
✓ Does not depend on external or third-party libraries
✓ Executable even by 8-bit MCUs without FPU, KPU, APU support
✓ Incremental data loading and training characteristics
✓ Compatible with datasets that have high feature dimensions
✓ Keeps reading live data -> incrementally learns from it ->
discards data (no storing required)
Train++: Core Algorithm
▪ Input and output for Train++ core algorithm. Eqn (1) is the local dataset that is based on the IoT use-case data
✓ X is the dataset rows that contain features of input data samples
✓ Y holds the corresponding labels
✓ dimension of time (t) is considered indefinite as real-time sensor data keeps arriving with indefinite length
First part: Dataset, Inputs and Output for Train++
Train++: Core Algorithm
Second part
▪ Initially the algorithm infers using a binary classification
function that updates from round to round and the vector of
weights w ∈ R n takes the sign (x . w) form
✓ The magnitude |x . w| is the confidence score of
prediction
✓ wt is the weight vector that Train++ uses on round t
✓ yt (xt . wt) is the margin obtained at t
▪ Whenever algorithm makes a correct prediction sign (xt . wt) = yt
✓ After prediction, we instruct Train++ to predict again with
a higher confidence score
✓ Hence, the goal becomes to achieve at least 1 as the
margin, as frequently as possible
Train++: Core Algorithm
Third part
▪ Whenever yt (xt . wt) < 1 Hinge-loss makes Train++ suffer
an instantaneous loss
✓ If margin exceeds 1 the loss is zero
✓ Else, it is the difference between the margin and 1
✓ Now we require an update rule to modify the
weight vector for each round
Train++: Core Algorithm
Fourth part
▪ Update rule to modify the weight vector for each round. ξ is a slack variable, C is
the parameter to control influence of ξ
✓ Whenever a correct prediction occurs, the loss function is 0
✓ The argmin is wt, hence Train++ algorithm becomes permissive
✓ Whereas on the rounds when misclassifications occur, loss is positive and
Train++ offensively forces wt+1 to satisfy the constrain l (w; (xt , yt)) = 0
✓ Larger C values produce strong offensiveness, which might increase
the risk of destabilization when input data is noisy
✓ Lower C values improve adaptiveness
Train++: Core Algorithm
Fourth part
▪ To keep wt+1 close to wt, to retain the information learned in previous rounds. The
update rule in its simple closed-form is wt + τt yt xt. Substituting τ then substituting loss
lt we get Eqn (4)
✓ This update rule meets our expectations since the weight vector is updated
with a value whose sign is determined by yt (magnitude proportional to error)
✓ During correct classification, the nominator of Eqn (4) becomes 0, so wt+1 = wt
✓ During misclassification the value of the weight vector will move towards xt
✓ After this movement, the dot product in the update rule becomes negative.
Hence input is classified correctly as +1
Train++ Evaluation Setup
Datasets and MCUs used for Train++ evaluation
▪ Using Train++, for datasets D1-D7, we train a binary classifier on MCUs 1-5. This multiple datasets and
MCUs based extensive experimental evaluation aims to answer
✓ Is Train++ compatible with different MCU boards, and can it train ML models on MCUs using various
datasets with various feature dimensions and sizes?
✓ Can Train++ load, train, and infer using high features and size datasets on limited memory MCU
boards that have low hardware specification and no FPU, APU, KPU support?
Results: Train Time
▪ Training models on MCUs using Train++: Comparing training set size vs
training time and accuracy for selected datasets
✓ We show results only for MCUs 2, 5 as other MCUs trained in 2 ms
✓ The gap in y-axis is difference in train time between datasets
✓ Train time grows almost linearly with samples for all datasets
✓ For the Iris dataset MCU2 only took 4 ms to train on 105 samples
✓ It took 83 ms to train on the Digits dataset
✓ MCU5 is slowest since it only has a 48 MHz clock, does not have
FPU support. Hence it took 12 ms to train on 105 samples of the
Iris dataset (3x times slower than MCU2) and 304 ms to train on
the Digits dataset (3.6x times slower than MCU2)
Results: Infer Time and Memory
▪ Memory: Flash and SRAM consumption is calculated during compilation by the IDE
✓ For MCU1, Iris dataset and Train++ in total used only 4.1 % of Flash and 3.4 % SRAM. For
Digits, the same MCU1 requires 6.54 % and 29.93 %
✓ When using Edge2Train, for MCUs 2, 5, we cannot train using the Digits dataset because
SRAM overflowed by +52.0 kB and +63.62 kB. Similarly for Heart Disease both the Flash and
SRAM requirements exceed the MCU's capacity
✓ Even after overflow on MCU2, Train++ incremental training characteristics could load
the dataset incrementally and complete training in 83 ms
✓ Similarly, even after overflow on MCU5, Train++ was able to load the entire dataset
incrementally and train in 304 ms
▪ Inference time: For the selected datasets, on MCUs 1-5, Train++ method infer for the entire test
set in lesser time than the Edge2Train model's unit inference time
Results: Accuracy
▪ Accuracy: The highest onboard classification accuracy is 97.33 % for the Iris (D1), 82.08 % for Heart Disease (D2),
85.0 % for Breast Cancer (D3), 98.0 % for Digits dataset (D4)
▪ Accuracy comparison: When comparing the accuracy of Train++ trained models with Edge2Train trained models
✓ For the same Iris dataset, the accuracy improved by 7.3% and by 5.15% for the Digits dataset. This
improvement is because our training algorithm enabled incremental loading of the full dataset
✓ Other algorithms like SVMs work in batch mode, requiring full training data to be available in the limited
MCU memory, thus sets an upper bound on the train set size
✓ Train++ trained models achieved overall improved accuracy compared to the Edge2Train models, which
were trained with limited data (unable to load full dataset due to memory constraints)
Results: Edge2Train vs Train++
▪ From the shown Figure (y-axis in base-10 log scale) it is apparent
that Train++ consumed
✓ 34000 - 65000x times less energy to train
✓ 34 - 66x times less energy for unit inference
▪ For a given task that needs to be completed by using the same
datasets on the same MCUs, Train++ achieved such significant
energy conservations due to its high-performance characteristics
(i.e., it trained and inferred at much higher speeds)
▪ When Train++ is used, MCUs can perform onboard model training
and inference at the lowest power costs, thus enabling offline
learning and model inference without affecting the IoT edge
application routine and operating time of battery-powered devices
Edge2Train vs
Train++
Demo
▪ Incremental ultra-fast @ Train++: https://github.com/bharathsudharsan/Train_plus_plus
Train++ Summary
▪ Summary of Train++ benefits
✓ The proposed method reduces the onboard binary classifier training time by ≈ 10 - 226 sec across
various commodity MCUs
✓ Train++ can infer on MCUs for the entire test set in real-time of 1 ms
✓ The accuracy improved by 5.15 - 7.3 % since the incremental characteristic of Train++ enabled
loading of full n-samples of high-dimensional datasets even on MCUs with only a few hundred kBs
of memory
✓ Train++ is compatible with various MCU boards and multiple datasets
✓ Across various MCUs, Train++ consumed ≈ 34000 - 65000x times less energy to train and
consumed ≈ 34 - 66x times less energy for unit inference
Contact: Bharath Sudharsan
Email: bharath.sudharsan@insight-centre.org
www.confirm.ie

More Related Content

PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part I: ML for IoT Devices
PDF
Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural ...
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part II: Creating ML based Self learning...
PDF
Toward Distributed, Global, Deep Learning Using IoT Devices
PDF
Ensemble Methods for Collective Intelligence: Combining Ubiquitous ML Models ...
PDF
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
PDF
Introduction to Parallel Distributed Computer Systems
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part I: ML for IoT Devices
Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural ...
ECML PKDD 2021 ML meets IoT Tutorial Part II: Creating ML based Self learning...
Toward Distributed, Global, Deep Learning Using IoT Devices
Ensemble Methods for Collective Intelligence: Combining Ubiquitous ML Models ...
Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...
Introduction to Parallel Distributed Computer Systems

What's hot (20)

PPT
Lecture 1
PPTX
Parallel computing in india
PDF
Lecture02 types
PPTX
Advanced computer architecture
PPSX
Research Scope in Parallel Computing And Parallel Programming
PPT
Parallel Computing
PPTX
Introduction to Parallel and Distributed Computing
PPTX
network ram parallel computing
DOCX
Introduction to parallel computing
PPTX
Parallel processing
PPTX
Applications of paralleL processing
PPTX
Modern processor art
PDF
Machine Learning with New Hardware Challegens
PPTX
Parallel Algorithms Advantages and Disadvantages
PPT
Parallel computing
PPTX
Parallel Programming
PPTX
Tensor Processing Unit (TPU)
PPTX
Full introduction to_parallel_computing
PPTX
Parallel computing and its applications
PPTX
Indian Contribution towards Parallel Processing
Lecture 1
Parallel computing in india
Lecture02 types
Advanced computer architecture
Research Scope in Parallel Computing And Parallel Programming
Parallel Computing
Introduction to Parallel and Distributed Computing
network ram parallel computing
Introduction to parallel computing
Parallel processing
Applications of paralleL processing
Modern processor art
Machine Learning with New Hardware Challegens
Parallel Algorithms Advantages and Disadvantages
Parallel computing
Parallel Programming
Tensor Processing Unit (TPU)
Full introduction to_parallel_computing
Parallel computing and its applications
Indian Contribution towards Parallel Processing
Ad

Similar to Train++: An Incremental ML Model Training Algorithm to Create Self Learning IoT Devices (20)

PDF
Edge AI: Bringing Intelligence to Embedded Devices
PDF
Tensorflow for IoT
PDF
Tensorflow 2.0 and Coral Edge TPU
PDF
Deep-Learning-with-PydddddddddddddTorch.pdf
PDF
0b85886e-4490-4af0-8b46-7ff3caf5dc2e.pdf
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
PPTX
Machine learning and Deep learning on edge devices using TensorFlow
PDF
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
PPTX
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
PDF
Machine Learning Applications to IoT
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
IRJET - Autonomous Navigation System using Deep Learning
PDF
IRJET - Implementation of Neural Network on FPGA
PDF
OpenPOWER Workshop in Silicon Valley
PDF
Reproducible AI using MLflow and PyTorch
PDF
Predictive Maintenance - Predict the Unpredictable
PDF
DeepLearning and Advanced Machine Learning on IoT
PDF
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
PDF
Deep Learning with Databricks
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Edge AI: Bringing Intelligence to Embedded Devices
Tensorflow for IoT
Tensorflow 2.0 and Coral Edge TPU
Deep-Learning-with-PydddddddddddddTorch.pdf
0b85886e-4490-4af0-8b46-7ff3caf5dc2e.pdf
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Machine learning and Deep learning on edge devices using TensorFlow
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Machine Learning Applications to IoT
Predictive Maintenance Using Recurrent Neural Networks
IRJET - Autonomous Navigation System using Deep Learning
IRJET - Implementation of Neural Network on FPGA
OpenPOWER Workshop in Silicon Valley
Reproducible AI using MLflow and PyTorch
Predictive Maintenance - Predict the Unpredictable
DeepLearning and Advanced Machine Learning on IoT
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Deep Learning with Databricks
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Lecture1 pattern recognition............
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to machine learning and Linear Models
Lecture1 pattern recognition............
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
STERILIZATION AND DISINFECTION-1.ppthhhbx
climate analysis of Dhaka ,Banglades.pptx
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Train++: An Incremental ML Model Training Algorithm to Create Self Learning IoT Devices

  • 1. Train++: An Incremental ML Model Training Algorithm to Create Self-Learning IoT Devices Bharath Sudharsan, Piyush Yadav, John G. Breslin, Muhammad Intizar Ali
  • 2. Introduction To improve inference accuracy, edge devices log unseen data in cloud. Initial model re-trained then sent as OTA update ✓ Edge devices hardware cost increase - wireless module addition ✓ Increases cyber-security risks and power consumption ✓ Not self-contained ubiquitous systems since they depend on the cloud services for inference and re-training ✓ Latency, privacy, other concerns A typical IoT scenario: Edge devices (Endpoints) depends on cloud for inference and model updates
  • 3. Introduction ▪ In the real-world, every new scene generates a fresh, unseen data pattern ▪ When the model deployed in edge devices sees such fresh patterns, it will either not know how to react to that specific scenario or lead to false or less accurate results ▪ A model trained using data from one context will not produce the expected results when deployed in another context ▪ It is not feasible to train multiple models for multiple environments and contexts Accuracy: 92 % Accuracy: 54 % Mobile edge device Scene A Input 0 - 40 C 5 - 400 Lux 1.1 - 1.8 Bar Mobile edge device Scene B Input -10 - 1 C 1 - 10 Lux 2 - 4 Bar IoT edge devices exposed to new scenes/environments
  • 4. Background ▪ To enable the edge devices to learn offline after deployment - ML model deployed and running on the edge devices need to be re-trained or updated at the device level Top ML Frameworks for running ML models on MCUs. It does not yet support training models on MCUs ▪ ML frameworks like TensorFlow Micro, FeatherCNN, Open-NN, etc. do not yet enable training models directly on MCUs, small CPUs and IoT FPGAs ▪ Currently models are trained on GPUs then deployed on IoT edge devices after deep compression/optimization OpenNN
  • 5. Background ▪ Bonsai: non-linear tree-based classifier which is designed to solve traditional ML problem with 2KB sized models ▪ EMI-RNN: speeding-up RNN inference up to 72x when compared to traditional implementations ▪ FastRNN & FastGRNN: can be used instead of LSTM and GRU. 35x smaller and faster with size less than 10KB ▪ SeeDot: floating-point to fixed-point quantization tool including a new language and compiler Microsoft EdgeML provides Algorithms and Tools to enable ML inference on the edge devices
  • 6. On Device ML Model Training ▪ Numerous papers, libraries, algorithms, tools exists to enable self-learning and re-training of ML models on better resourced devices like SBCs ✓ Due to ML framework support i.e., TF Lite can run on SBCs and not on MCUs ▪ SEFR paper: Classification algorithm specifically designed to perform both training and testing on ultra-low power devices ▪ ML in Embedded Sensor Systems paper: Enable real-time data analytics and continuous training and re-training of the ML algorithm Powerful CPU + basic GPU based SBCs (single board computers) Billions of IoT devices are designed using such MCUs and small CPUs
  • 7. Research Question ▪ After real-world deployment, can the highly resource-constrained MCU-based IoT devices autonomously (offline) improve their intelligence by performing onboard re-training/updating (self-learning) of the local ML model using the live IoT use-case data?
  • 8. Create self-learning HVACs for superior thermal comfort ▪ HVAC controllers are resource-constrained IoT edge devices. For example, Google Nest, Echobee, etc. ▪ Every building/infrastructure has differences. Standard/one-size-fits-all strategy fails to provide superior level of thermal comfort for people ▪ Proposed algorithms can make HVAC controllers learn best strategy (offline) to perform tailored control of the HVAC unit for any building types ▪ Eliminates the need to find and set distinct HVAC control strategies for each building Use case 1: Self-learning HVACs
  • 9. ▪ Data required for most researches are sensitive in nature as it revolves around private individuals ▪ Resource-constrained medical devices like insulin- delivery devices, electric steam sterilizers, BP apparatus, etc., can use proposed algorithms to train offline using the sensitive data ▪ Transmit models instead of the actual medical data Use case 2: Sensitive Medical Data Providing sensitive medical data for research
  • 10. ▪ Proposed algorithms can train models on MCUs. Thus, even the tiny IoT devices can self-learn data patterns ✓ Learn usual residential electricity consumption patterns and raise alerts in the event of unusual usage or overconsumption. Thus can reduce bills, detect leaks, etc. ✓ Learn by monitoring the contextual sensor data corresponding to regular vibration patterns from the pump's crosshead, cylinder and frame ✓ Generate alerts using the learned knowledge if anomaly patterns are predicted or detected Use case 3: Learning any Patterns Self-learning the data patterns Train++ AI inside Use case: Learn to monitor vibrations on pump's crosshead, cylinder and frame. Generate alerts if anomaly patterns are detected Use case: Learn to detect Leaks, monitor efficiency & consumptions Train++ AI inside
  • 11. Train++ Incremental Training Algorithm ▪ Train++ is designed for MCUs to perform on-board binary classifier model training, inference. Train++ algorithm main highlights/benefits ✓ Dataset size: Incremental training characteristic enables training on limited Flash and SRAM footprints while also allowing to use full n-samples of the high-dimensional datasets ✓ Training speed and energy: Achieves significant energy conservations due to its high-performance characteristics i.e., it trained and inferred at much higher speeds than other methods ✓ Performance: Train++ trained classifiers show classification accuracy close to those of Python scikit-learn trained classifiers on high-resource CPUs High Accuracy and Precision Fast training and Inference Learn from large datasets Significant energy conservation Benefits of Training and Inference on MCUs using Train++
  • 12. Train++: Pipeline ▪ Four-steps to use our Train++ algorithm to enable the edge MCUs to self-train offline for any IoT use cases ✓ Select MCU of choice depending on the target use case ✓ Contextual sensor data corresponding to normal working will be collected as the local dataset and used for training. The labels/ground truth for the collected training data rows shall be computed using our method from Edge2Train ✓ Edge MCUs learns/trains using our core Train++ algorithm ✓ The resultant MCU-trained models can start inference over previously unseen data
  • 13. Train++: Core Algorithm ▪ The full Train++ algorithm is shown. Characteristics ✓ Highly resource-friendly and low complexity design ✓ Implementation less than 100 lines in C++ ✓ Does not depend on external or third-party libraries ✓ Executable even by 8-bit MCUs without FPU, KPU, APU support ✓ Incremental data loading and training characteristics ✓ Compatible with datasets that have high feature dimensions ✓ Keeps reading live data -> incrementally learns from it -> discards data (no storing required)
  • 14. Train++: Core Algorithm ▪ Input and output for Train++ core algorithm. Eqn (1) is the local dataset that is based on the IoT use-case data ✓ X is the dataset rows that contain features of input data samples ✓ Y holds the corresponding labels ✓ dimension of time (t) is considered indefinite as real-time sensor data keeps arriving with indefinite length First part: Dataset, Inputs and Output for Train++
  • 15. Train++: Core Algorithm Second part ▪ Initially the algorithm infers using a binary classification function that updates from round to round and the vector of weights w ∈ R n takes the sign (x . w) form ✓ The magnitude |x . w| is the confidence score of prediction ✓ wt is the weight vector that Train++ uses on round t ✓ yt (xt . wt) is the margin obtained at t ▪ Whenever algorithm makes a correct prediction sign (xt . wt) = yt ✓ After prediction, we instruct Train++ to predict again with a higher confidence score ✓ Hence, the goal becomes to achieve at least 1 as the margin, as frequently as possible
  • 16. Train++: Core Algorithm Third part ▪ Whenever yt (xt . wt) < 1 Hinge-loss makes Train++ suffer an instantaneous loss ✓ If margin exceeds 1 the loss is zero ✓ Else, it is the difference between the margin and 1 ✓ Now we require an update rule to modify the weight vector for each round
  • 17. Train++: Core Algorithm Fourth part ▪ Update rule to modify the weight vector for each round. ξ is a slack variable, C is the parameter to control influence of ξ ✓ Whenever a correct prediction occurs, the loss function is 0 ✓ The argmin is wt, hence Train++ algorithm becomes permissive ✓ Whereas on the rounds when misclassifications occur, loss is positive and Train++ offensively forces wt+1 to satisfy the constrain l (w; (xt , yt)) = 0 ✓ Larger C values produce strong offensiveness, which might increase the risk of destabilization when input data is noisy ✓ Lower C values improve adaptiveness
  • 18. Train++: Core Algorithm Fourth part ▪ To keep wt+1 close to wt, to retain the information learned in previous rounds. The update rule in its simple closed-form is wt + τt yt xt. Substituting τ then substituting loss lt we get Eqn (4) ✓ This update rule meets our expectations since the weight vector is updated with a value whose sign is determined by yt (magnitude proportional to error) ✓ During correct classification, the nominator of Eqn (4) becomes 0, so wt+1 = wt ✓ During misclassification the value of the weight vector will move towards xt ✓ After this movement, the dot product in the update rule becomes negative. Hence input is classified correctly as +1
  • 19. Train++ Evaluation Setup Datasets and MCUs used for Train++ evaluation ▪ Using Train++, for datasets D1-D7, we train a binary classifier on MCUs 1-5. This multiple datasets and MCUs based extensive experimental evaluation aims to answer ✓ Is Train++ compatible with different MCU boards, and can it train ML models on MCUs using various datasets with various feature dimensions and sizes? ✓ Can Train++ load, train, and infer using high features and size datasets on limited memory MCU boards that have low hardware specification and no FPU, APU, KPU support?
  • 20. Results: Train Time ▪ Training models on MCUs using Train++: Comparing training set size vs training time and accuracy for selected datasets ✓ We show results only for MCUs 2, 5 as other MCUs trained in 2 ms ✓ The gap in y-axis is difference in train time between datasets ✓ Train time grows almost linearly with samples for all datasets ✓ For the Iris dataset MCU2 only took 4 ms to train on 105 samples ✓ It took 83 ms to train on the Digits dataset ✓ MCU5 is slowest since it only has a 48 MHz clock, does not have FPU support. Hence it took 12 ms to train on 105 samples of the Iris dataset (3x times slower than MCU2) and 304 ms to train on the Digits dataset (3.6x times slower than MCU2)
  • 21. Results: Infer Time and Memory ▪ Memory: Flash and SRAM consumption is calculated during compilation by the IDE ✓ For MCU1, Iris dataset and Train++ in total used only 4.1 % of Flash and 3.4 % SRAM. For Digits, the same MCU1 requires 6.54 % and 29.93 % ✓ When using Edge2Train, for MCUs 2, 5, we cannot train using the Digits dataset because SRAM overflowed by +52.0 kB and +63.62 kB. Similarly for Heart Disease both the Flash and SRAM requirements exceed the MCU's capacity ✓ Even after overflow on MCU2, Train++ incremental training characteristics could load the dataset incrementally and complete training in 83 ms ✓ Similarly, even after overflow on MCU5, Train++ was able to load the entire dataset incrementally and train in 304 ms ▪ Inference time: For the selected datasets, on MCUs 1-5, Train++ method infer for the entire test set in lesser time than the Edge2Train model's unit inference time
  • 22. Results: Accuracy ▪ Accuracy: The highest onboard classification accuracy is 97.33 % for the Iris (D1), 82.08 % for Heart Disease (D2), 85.0 % for Breast Cancer (D3), 98.0 % for Digits dataset (D4) ▪ Accuracy comparison: When comparing the accuracy of Train++ trained models with Edge2Train trained models ✓ For the same Iris dataset, the accuracy improved by 7.3% and by 5.15% for the Digits dataset. This improvement is because our training algorithm enabled incremental loading of the full dataset ✓ Other algorithms like SVMs work in batch mode, requiring full training data to be available in the limited MCU memory, thus sets an upper bound on the train set size ✓ Train++ trained models achieved overall improved accuracy compared to the Edge2Train models, which were trained with limited data (unable to load full dataset due to memory constraints)
  • 23. Results: Edge2Train vs Train++ ▪ From the shown Figure (y-axis in base-10 log scale) it is apparent that Train++ consumed ✓ 34000 - 65000x times less energy to train ✓ 34 - 66x times less energy for unit inference ▪ For a given task that needs to be completed by using the same datasets on the same MCUs, Train++ achieved such significant energy conservations due to its high-performance characteristics (i.e., it trained and inferred at much higher speeds) ▪ When Train++ is used, MCUs can perform onboard model training and inference at the lowest power costs, thus enabling offline learning and model inference without affecting the IoT edge application routine and operating time of battery-powered devices Edge2Train vs Train++
  • 24. Demo ▪ Incremental ultra-fast @ Train++: https://github.com/bharathsudharsan/Train_plus_plus
  • 25. Train++ Summary ▪ Summary of Train++ benefits ✓ The proposed method reduces the onboard binary classifier training time by ≈ 10 - 226 sec across various commodity MCUs ✓ Train++ can infer on MCUs for the entire test set in real-time of 1 ms ✓ The accuracy improved by 5.15 - 7.3 % since the incremental characteristic of Train++ enabled loading of full n-samples of high-dimensional datasets even on MCUs with only a few hundred kBs of memory ✓ Train++ is compatible with various MCU boards and multiple datasets ✓ Across various MCUs, Train++ consumed ≈ 34000 - 65000x times less energy to train and consumed ≈ 34 - 66x times less energy for unit inference
  • 26. Contact: Bharath Sudharsan Email: bharath.sudharsan@insight-centre.org www.confirm.ie