SlideShare a Scribd company logo
4
Most read
9
Most read
16
Most read
Feature Engineering for IoT
Darryl Ng
#ISSLearningFest
Rise of IoT
#ISSLearningFest
https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/
IoT Reference Architecture
#ISSLearningFest
https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/iot
Sense Connect Collect Process Act
Devices generate events
•Through platform to
application
Insights based on data
•Derived by evaluating
incoming device events
Actions based on insights
•Execute processes and
workflows in the application
DATA
IoT and Cloud Providers
1. Capabilities added to the devices
a. Device side processing
• Real-time analytics, edge ML capabilities
2. Gateway to communicate with downstream, heterogeneous devices
3. Cloud services
a. Device management capabilities, i.e. device shadowing, provisioning,
OTA updates, security
a. Stream processing
b. Big data stack
• Analytics and visualization
#ISSLearningFest
Cloud-centric Device/Gateway-
centric
Handling Data
#ISSLearningFest
Volume
Velocity
Variety
Veracity
Value
Data
reduction
Data
transformation
Data
integration
Data cleaning
Data
discretization
Data Collection
• Data collection can be a significant effort
in machine learning
• Types of Data
• Historical Data (e.g. past weather)
• Generated data (e.g. weather from sensors)
• Manually collected (e.g. observe or visual
inputs at different times of the day)
• Collect data to infer its probability
distribution
• Generate more data from the probability
distribution
#ISSLearningFest
Feature Engineering
• Extracting features out of data and transforming them into something
that can be used as a learning model in machine learning algorithm
• Accuracy of machine learning model depends on the quality of data
used for learning
• Good Features => Model learns quickly
• Bad Features => Model doesn’t learn
#ISSLearningFest
Features, Samples and Label
date precipitation temp_max temp_min wind weather
1/1/2012 0 12.8 5 4.7drizzle
1/2/2012 10.9 10.6 2.8 4.5rain
1/3/2012 0.8 11.7 7.2 2.3rain
1/4/2012 20.3 12.2 5.6 4.7rain
1/5/2012 1.3 8.9 2.8 6.1rain
1/6/2012 2.5 4.4 2.2 2.2rain
1/7/2012 0 7.2 2.8 2.3rain
1/8/2012 0 10 2.8 2sun
1/9/2012 4.3 9.4 5 3.4rain
1/10/2012 1 6.1 0.6 3.4rain
1/11/2012 0 6.1 -1.1 5.1sun
1/12/2012 0 6.1 -1.7 1.9sun
1/13/2012 0 5 -2.8 1.3sun
1/14/2012 0 16.1 1.7 4.3sun
1/15/2012 0 21.1 7.2 4.1sun
1/16/2012 0 20 6.1 2.1sun
1/17/2012 0 14.4 3.9 3sun
1/18/2012 0 18.3 4.4 4.3sun
1/19/2012 0 25.6 12.8 2.2drizzle
1/20/2012 0 18.9 13.9 2.8drizzle
1/21/2012 0 22.2 13.3 1.7drizzle
#ISSLearningFest
sample
features label
Imputation
Categories of missing data:
1. Missing at Random (MAR)
• More data available on a different sample.
2. Missing Completely at Random
• No relationship exists between missing values and
other observations.
3. Missing Not at Random
• There’s a reason why the values are missing and
records should be flagged.
• Numerical
• Categorical
#ISSLearningFest
date precipitation temp_max temp_min wind
1/1/2012 0 12.8 5 4.7
1/2/2012 10.9 10.6 2.8 4.5
1/3/2012 0.8 11.7 7.2 2.3
1/4/2012 20.3 12.2 5.6 4.7
1/5/2012 1.3 8.9 2.8 6.1
1/6/2012 2.5 4.4 2.2 2.2
1/7/2012 0 7.2 2.8 2.3
1/8/2012 0 10 2.8 2
1/9/2012 4.3 9.4 5 3.4
1/10/2012 1 6.1 0.6 3.4
1/11/2012 0 6.1 -1.1 5.1
1/12/2012 0 6.1 -1.7 1.9
1/13/2012 0 5 -2.8 1.3
1/14/2012 0 16.1 1.7 4.3
1/15/2012 0 21.1 7.2 4.1
1/16/2012 20 6.1 2.1
1/17/2012 14.4 3.9 3
1/18/2012 18.3 4.4 4.3
1/19/2012 0 25.6 12.8 2.2
1/20/2012 0 18.9 13.9 2.8
1/21/2012 0 22.2 13.3 1.7
date precipitation temp_max temp_min wind
1/1/2012 0 12.8 5 4.7
1/2/2012 10.9 10.6 2.8 4.5
1/3/2012 0.8 11.7 7.2 2.3
1/4/2012 20.3 12.2 5.6 4.7
1/5/2012 1.3 8.9 2.8 6.1
1/6/2012 2.5 4.4 2.2 2.2
1/7/2012 0 7.2 2.8 2.3
1/8/2012 0 10 2.8 2
1/9/2012 4.3 9.4 5 3.4
1/10/2012 1 6.1 0.6 3.4
1/11/2012 0 6.1 -1.1 5.1
1/12/2012 0 6.1 -1.7 1.9
1/13/2012 0 5 -2.8 1.3
1/14/2012 0 16.1 1.7 4.3
1/15/2012 0 21.1 7.2 4.1
1/16/2012 0 20 6.1 2.1
1/17/2012 0 14.4 3.9 3
1/18/2012 0 18.3 4.4 4.3
1/19/2012 0 25.6 12.8 2.2
1/20/2012 0 18.9 13.9 2.8
1/21/2012 0 22.2 13.3 1.7
Handling Outliers
• Removal
• Replacing values
• Capping
• Discretization
• Binning
#ISSLearningFest
date precipitation temp_max temp_min wind
1/1/2012 0 12.8 5 4.7
1/2/2012 10.9 10.6 2.8 4.5
1/3/2012 0.8 11.7 7.2 2.3
1/4/2012 20.3 12.2 5.6 4.7
1/5/2012 1.3 8.9 2.8 6.1
1/6/2012 2.5 4.4 2.2 2.2
1/7/2012 0 7.2 2.8 2.3
1/8/2012 0 10 2.8 2
1/9/2012 4.3 9.4 5 3.4
1/10/2012 1 6.1 0.6 3.4
1/11/2012 0 6.1 -1.1 5.1
1/12/2012 0 6.1 -1.7 1.9
1/13/2012 0 5 -2.8 1.3
1/14/2012 0 16.1 1.7 4.3
1/15/2012 0 21.1 7.2 4.1
1/16/2012 0 20 6.1 2.1
1/17/2012 0 14.4 3.9 3
1/18/2012 0 18.3 4.4 4.3
1/19/2012 0 25.6 12.8 2.2
1/20/2012 0 18.9 13.9 2.8
1/21/2012 0 22.2 13.3 1.7
date precipitation temp_max temp_min wind
1/1/2012 0 12.8 5 4.7
1/2/2012 10.9 10.6 2.8 4.5
1/3/2012 0.8 11.7 7.2 2.3
1/4/2012 20.3 12.2 5.6 4.7
1/5/2012 1.3 8.9 2.8 6.1
1/6/2012 2.5 4.4 2.2 2.2
1/7/2012 0 7.2 2.8 2.3
1/8/2012 0 10 2.8 2
1/9/2012 4.3 9.4 5 3.4
1/10/2012 1 6.1 0.6 3.4
1/11/2012 0 6.1 -1.1 5.1
1/12/2012 0 6.1 -1.7 1.9
1/13/2012 0 5 -2.8 1.3
1/14/2012 0 16.1 1.7 4.3
1/15/2012 0 21.1 7.2 4.1
1/16/2012 0 20 6.1 2.1
1/17/2012 0 14.4 3.9 3
1/18/2012 0 18.3 4.4 4.3
1/19/2012 0 25.6 12.8 2.2
1/20/2012 0 18.9 13.9 2.8
1/21/2012 0 22.2 13.3 1.7
Feature Selection
• Select features that are highly correlated
to target
• Pick the most representative features from
existing features
• For selected features, look for sets of
features that are highly correlated with
each other
• In each set, select feature with highest
correlation to target
• Use final selected features to train the
model
#ISSLearningFest
date precipitation temp_max temp_min wind weather
1/1/2012 0 12.8 5 4.7 drizzle
1/2/2012 10.9 10.6 2.8 4.5 rain
1/3/2012 0.8 11.7 7.2 2.3 rain
1/4/2012 20.3 12.2 5.6 4.7 rain
1/5/2012 1.3 8.9 2.8 6.1 rain
1/6/2012 2.5 4.4 2.2 2.2 rain
1/7/2012 0 7.2 2.8 2.3 rain
1/8/2012 0 10 2.8 2 sun
1/9/2012 4.3 9.4 5 3.4 rain
1/10/2012 1 6.1 0.6 3.4 rain
1/11/2012 0 6.1 -1.1 5.1 sun
1/12/2012 0 6.1 -1.7 1.9 sun
1/13/2012 0 5 -2.8 1.3 sun
1/14/2012 0 16.1 1.7 4.3 sun
1/15/2012 0 21.1 7.2 4.1 sun
1/16/2012 0 20 6.1 2.1 sun
1/17/2012 0 14.4 3.9 3 sun
1/18/2012 0 18.3 4.4 4.3 sun
1/19/2012 0 25.6 12.8 2.2 drizzle
1/20/2012 0 18.9 13.9 2.8 drizzle
1/21/2012 0 22.2 13.3 1.7 drizzle
Selected features implies state
Pearson Correlation
• Measure of the extend to which two random variables change in
tandem
• Value between -1 to +1
• -1 indicates strong negative linear correlation
• 0 indicates no correlation
• +1 indicates strong positive correlation
#ISSLearningFest
Correlation between variables
#ISSLearningFest
Feature Extraction
• Analyse existing features to generate new features
• Dimension Reduction
• Reducing a 4D/3D space  2D space
#ISSLearningFest
date precipitation temp_max temp_min wind weather
1/1/2012 0 12.8 5 4.7 drizzle
1/2/2012 10.9 10.6 2.8 4.5 rain
1/3/2012 0.8 11.7 7.2 2.3 rain
1/4/2012 20.3 12.2 5.6 4.7 rain
1/5/2012 1.3 8.9 2.8 6.1 rain
1/6/2012 2.5 4.4 2.2 2.2 rain
1/7/2012 0 7.2 2.8 2.3 rain
1/8/2012 0 10 2.8 2 sun
1/9/2012 4.3 9.4 5 3.4 rain
1/10/2012 1 6.1 0.6 3.4 rain
1/11/2012 0 6.1 -1.1 5.1 sun
1/12/2012 0 6.1 -1.7 1.9 sun
1/13/2012 0 5 -2.8 1.3 sun
1/14/2012 0 16.1 1.7 4.3 sun
1/15/2012 0 21.1 7.2 4.1 sun
1/16/2012 0 20 6.1 2.1 sun
1/17/2012 0 14.4 3.9 3 sun
1/18/2012 0 18.3 4.4 4.3 sun
1/19/2012 0 25.6 12.8 2.2 drizzle
1/20/2012 0 18.9 13.9 2.8 drizzle
1/21/2012 0 22.2 13.3 1.7 drizzle
PCA Analysis
precipitation temp_max weather
0 12.8 drizzle
10.9 10.6 rain
0.8 11.7 rain
20.3 12.2 rain
1.3 8.9 rain
2.5 4.4 rain
0 7.2 rain
0 10 sun
4.3 9.4 rain
1 6.1 rain
0 6.1 sun
0 6.1 sun
0 5 sun
0 16.1 sun
0 21.1 sun
0 20 sun
0 14.4 sun
0 18.3 sun
0 25.6 drizzle
0 18.9 drizzle
0 22.2 drizzle
Feature Scaling
• Different scales in our dataset
• Different techniques
• Normalization: min-max scaling
• Values in column bounded between fixed range 0 and 1
• Standardization: Z-score normalization
• Values in column rescale to Gaussian distribution, i.e. show
mean and variance
• Standardization
• Reduces each feature to similar scale for ease of
comparison
• Performed within each feature, not across features
• Shift dataset to origin allows learning models to learn
faster and better
#ISSLearningFest
date precipitation temp_max temp_min wind weather
1/1/2012 0 12.8 5 4.7 drizzle
1/2/2012 10.9 10.6 2.8 4.5 rain
1/3/2012 0.8 11.7 7.2 2.3 rain
1/4/2012 20.3 12.2 5.6 4.7 rain
1/5/2012 1.3 8.9 2.8 6.1 rain
1/6/2012 2.5 4.4 2.2 2.2 rain
1/7/2012 0 7.2 2.8 2.3 rain
1/8/2012 0 10 2.8 2 sun
1/9/2012 4.3 9.4 5 3.4 rain
1/10/2012 1 6.1 0.6 3.4 rain
1/11/2012 0 6.1 -1.1 5.1 sun
1/12/2012 0 6.1 -1.7 1.9 sun
1/13/2012 0 5 -2.8 1.3 sun
1/14/2012 0 16.1 1.7 4.3 sun
1/15/2012 0 21.1 7.2 4.1 sun
1/16/2012 0 20 6.1 2.1 sun
1/17/2012 0 14.4 3.9 3 sun
1/18/2012 0 18.3 4.4 4.3 sun
1/19/2012 0 25.6 12.8 2.2 drizzle
1/20/2012 0 18.9 13.9 2.8 drizzle
1/21/2012 0 22.2 13.3 1.7 drizzle
Small scale
Implementing ML algorithm for IoT solution
• Sampling
• Split dataset into training dataset
(80%) and test dataset (20%)
• Build ML model
• Put training dataset to ML algorithm for
training
• Output: Trained model/Predictor
generated
• Test ML model
• Use test dataset passed to
predictor/model
• Evaluate model
• determine the accuracy of our model
#ISSLearningFest
Summary
• Data Cleaning
• Impute missing values
• Encode categorical features
• Data Transformation
• Transform and scale numerical variables
• Feature Extraction
• Perform discretization
• Remove outliers
• Feature selection
• Perform feature extraction from date and
time
• Create new features from existing ones
• Feature Iteration
• Pump to ML algorithm to produce trained
model
#ISSLearningFest
Give Us Your Feedback
#ISSLearningFest
Day 2 Programme
Question & Answer
#ISSLearningFest
Thank You!
darrylng@nus.edu.sg
#ISSLearningFest

More Related Content

PDF
IRJET- Analyze Weather Condition using Machine Learning Algorithms
PDF
Analysis of indian weather data sets using data mining techniques
PDF
Choosing allowability boundaries for describing objects in subject areas
PPTX
Weather forcasting.pptx
PPTX
CBM Variable Speed Machinery
PDF
Using machine learning for understanding changes in the Earth system
PPTX
Geography UGC NET Question Answers Model
IRJET- Analyze Weather Condition using Machine Learning Algorithms
Analysis of indian weather data sets using data mining techniques
Choosing allowability boundaries for describing objects in subject areas
Weather forcasting.pptx
CBM Variable Speed Machinery
Using machine learning for understanding changes in the Earth system
Geography UGC NET Question Answers Model

Similar to Feature Engineering for IoT (20)

PDF
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
PPTX
"Induction of Decision Trees" @ Papers We Love Bucharest
PDF
A framework for cloud cover prediction using machine learning with data imput...
PDF
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
PDF
Data analysis of weather forecasting
PDF
Learning new climate science by thinking creatively with machine learning
PDF
IRJET- Different Data Mining Techniques for Weather Prediction
PPTX
Ads final report - Team 8
DOCX
Clustering and Classification in Support of Climatology to mine Weather Data ...
PDF
Learning by Redundancy: how climate multi-model ensembles can help to fight t...
PDF
Handling missing data and outliers
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PPTX
slide-02-data-mining-Input_output-1.pptx
DOC
Final report
PDF
IRJET- Large & Complex Data Streams using Big Data
PDF
International Refereed Journal of Engineering and Science (IRJES)
DOC
report2.doc
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
IRJET- Survey of Estimation of Crop Yield using Agriculture Data
"Induction of Decision Trees" @ Papers We Love Bucharest
A framework for cloud cover prediction using machine learning with data imput...
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Data analysis of weather forecasting
Learning new climate science by thinking creatively with machine learning
IRJET- Different Data Mining Techniques for Weather Prediction
Ads final report - Team 8
Clustering and Classification in Support of Climatology to mine Weather Data ...
Learning by Redundancy: how climate multi-model ensembles can help to fight t...
Handling missing data and outliers
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
slide-02-data-mining-Input_output-1.pptx
Final report
IRJET- Large & Complex Data Streams using Big Data
International Refereed Journal of Engineering and Science (IRJES)
report2.doc
Feature Subset Selection for High Dimensional Data using Clustering Techniques
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
Anomaly detection (Unsupervised Learning) in Machine Learning
Ad

More from NUS-ISS (20)

PDF
Designing Impactful Services and User Experience - Lim Wee Khee
PDF
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
PDF
The Importance of Cybersecurity for Digital Transformation
PDF
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
PDF
Understanding GenAI/LLM and What is Google Offering - Felix Goh
PDF
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
PDF
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
PDF
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
PDF
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
PDF
Future of Learning - Yap Aye Wee.pdf
PDF
Future of Learning - Khoong Chan Meng
PPTX
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
PDF
Product Management in The Trenches for a Cloud Service
PDF
Overview of Data and Analytics Essentials and Foundations
PDF
Predictive Analytics
PDF
Master of Technology in Software Engineering
PDF
Master of Technology in Enterprise Business Analytics
PDF
Diagnosing Complex Problems Using System Archetypes
PPTX
Satisfying the ‘-ilities’ of an Enterprise Cloud Service
PDF
Preparing and Acing your Kubernetes Certification
Designing Impactful Services and User Experience - Lim Wee Khee
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
The Importance of Cybersecurity for Digital Transformation
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Khoong Chan Meng
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Product Management in The Trenches for a Cloud Service
Overview of Data and Analytics Essentials and Foundations
Predictive Analytics
Master of Technology in Software Engineering
Master of Technology in Enterprise Business Analytics
Diagnosing Complex Problems Using System Archetypes
Satisfying the ‘-ilities’ of an Enterprise Cloud Service
Preparing and Acing your Kubernetes Certification
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development

Feature Engineering for IoT

  • 1. Feature Engineering for IoT Darryl Ng #ISSLearningFest
  • 3. IoT Reference Architecture #ISSLearningFest https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/iot Sense Connect Collect Process Act Devices generate events •Through platform to application Insights based on data •Derived by evaluating incoming device events Actions based on insights •Execute processes and workflows in the application DATA
  • 4. IoT and Cloud Providers 1. Capabilities added to the devices a. Device side processing • Real-time analytics, edge ML capabilities 2. Gateway to communicate with downstream, heterogeneous devices 3. Cloud services a. Device management capabilities, i.e. device shadowing, provisioning, OTA updates, security a. Stream processing b. Big data stack • Analytics and visualization #ISSLearningFest Cloud-centric Device/Gateway- centric
  • 6. Data Collection • Data collection can be a significant effort in machine learning • Types of Data • Historical Data (e.g. past weather) • Generated data (e.g. weather from sensors) • Manually collected (e.g. observe or visual inputs at different times of the day) • Collect data to infer its probability distribution • Generate more data from the probability distribution #ISSLearningFest
  • 7. Feature Engineering • Extracting features out of data and transforming them into something that can be used as a learning model in machine learning algorithm • Accuracy of machine learning model depends on the quality of data used for learning • Good Features => Model learns quickly • Bad Features => Model doesn’t learn #ISSLearningFest
  • 8. Features, Samples and Label date precipitation temp_max temp_min wind weather 1/1/2012 0 12.8 5 4.7drizzle 1/2/2012 10.9 10.6 2.8 4.5rain 1/3/2012 0.8 11.7 7.2 2.3rain 1/4/2012 20.3 12.2 5.6 4.7rain 1/5/2012 1.3 8.9 2.8 6.1rain 1/6/2012 2.5 4.4 2.2 2.2rain 1/7/2012 0 7.2 2.8 2.3rain 1/8/2012 0 10 2.8 2sun 1/9/2012 4.3 9.4 5 3.4rain 1/10/2012 1 6.1 0.6 3.4rain 1/11/2012 0 6.1 -1.1 5.1sun 1/12/2012 0 6.1 -1.7 1.9sun 1/13/2012 0 5 -2.8 1.3sun 1/14/2012 0 16.1 1.7 4.3sun 1/15/2012 0 21.1 7.2 4.1sun 1/16/2012 0 20 6.1 2.1sun 1/17/2012 0 14.4 3.9 3sun 1/18/2012 0 18.3 4.4 4.3sun 1/19/2012 0 25.6 12.8 2.2drizzle 1/20/2012 0 18.9 13.9 2.8drizzle 1/21/2012 0 22.2 13.3 1.7drizzle #ISSLearningFest sample features label
  • 9. Imputation Categories of missing data: 1. Missing at Random (MAR) • More data available on a different sample. 2. Missing Completely at Random • No relationship exists between missing values and other observations. 3. Missing Not at Random • There’s a reason why the values are missing and records should be flagged. • Numerical • Categorical #ISSLearningFest date precipitation temp_max temp_min wind 1/1/2012 0 12.8 5 4.7 1/2/2012 10.9 10.6 2.8 4.5 1/3/2012 0.8 11.7 7.2 2.3 1/4/2012 20.3 12.2 5.6 4.7 1/5/2012 1.3 8.9 2.8 6.1 1/6/2012 2.5 4.4 2.2 2.2 1/7/2012 0 7.2 2.8 2.3 1/8/2012 0 10 2.8 2 1/9/2012 4.3 9.4 5 3.4 1/10/2012 1 6.1 0.6 3.4 1/11/2012 0 6.1 -1.1 5.1 1/12/2012 0 6.1 -1.7 1.9 1/13/2012 0 5 -2.8 1.3 1/14/2012 0 16.1 1.7 4.3 1/15/2012 0 21.1 7.2 4.1 1/16/2012 20 6.1 2.1 1/17/2012 14.4 3.9 3 1/18/2012 18.3 4.4 4.3 1/19/2012 0 25.6 12.8 2.2 1/20/2012 0 18.9 13.9 2.8 1/21/2012 0 22.2 13.3 1.7 date precipitation temp_max temp_min wind 1/1/2012 0 12.8 5 4.7 1/2/2012 10.9 10.6 2.8 4.5 1/3/2012 0.8 11.7 7.2 2.3 1/4/2012 20.3 12.2 5.6 4.7 1/5/2012 1.3 8.9 2.8 6.1 1/6/2012 2.5 4.4 2.2 2.2 1/7/2012 0 7.2 2.8 2.3 1/8/2012 0 10 2.8 2 1/9/2012 4.3 9.4 5 3.4 1/10/2012 1 6.1 0.6 3.4 1/11/2012 0 6.1 -1.1 5.1 1/12/2012 0 6.1 -1.7 1.9 1/13/2012 0 5 -2.8 1.3 1/14/2012 0 16.1 1.7 4.3 1/15/2012 0 21.1 7.2 4.1 1/16/2012 0 20 6.1 2.1 1/17/2012 0 14.4 3.9 3 1/18/2012 0 18.3 4.4 4.3 1/19/2012 0 25.6 12.8 2.2 1/20/2012 0 18.9 13.9 2.8 1/21/2012 0 22.2 13.3 1.7
  • 10. Handling Outliers • Removal • Replacing values • Capping • Discretization • Binning #ISSLearningFest date precipitation temp_max temp_min wind 1/1/2012 0 12.8 5 4.7 1/2/2012 10.9 10.6 2.8 4.5 1/3/2012 0.8 11.7 7.2 2.3 1/4/2012 20.3 12.2 5.6 4.7 1/5/2012 1.3 8.9 2.8 6.1 1/6/2012 2.5 4.4 2.2 2.2 1/7/2012 0 7.2 2.8 2.3 1/8/2012 0 10 2.8 2 1/9/2012 4.3 9.4 5 3.4 1/10/2012 1 6.1 0.6 3.4 1/11/2012 0 6.1 -1.1 5.1 1/12/2012 0 6.1 -1.7 1.9 1/13/2012 0 5 -2.8 1.3 1/14/2012 0 16.1 1.7 4.3 1/15/2012 0 21.1 7.2 4.1 1/16/2012 0 20 6.1 2.1 1/17/2012 0 14.4 3.9 3 1/18/2012 0 18.3 4.4 4.3 1/19/2012 0 25.6 12.8 2.2 1/20/2012 0 18.9 13.9 2.8 1/21/2012 0 22.2 13.3 1.7 date precipitation temp_max temp_min wind 1/1/2012 0 12.8 5 4.7 1/2/2012 10.9 10.6 2.8 4.5 1/3/2012 0.8 11.7 7.2 2.3 1/4/2012 20.3 12.2 5.6 4.7 1/5/2012 1.3 8.9 2.8 6.1 1/6/2012 2.5 4.4 2.2 2.2 1/7/2012 0 7.2 2.8 2.3 1/8/2012 0 10 2.8 2 1/9/2012 4.3 9.4 5 3.4 1/10/2012 1 6.1 0.6 3.4 1/11/2012 0 6.1 -1.1 5.1 1/12/2012 0 6.1 -1.7 1.9 1/13/2012 0 5 -2.8 1.3 1/14/2012 0 16.1 1.7 4.3 1/15/2012 0 21.1 7.2 4.1 1/16/2012 0 20 6.1 2.1 1/17/2012 0 14.4 3.9 3 1/18/2012 0 18.3 4.4 4.3 1/19/2012 0 25.6 12.8 2.2 1/20/2012 0 18.9 13.9 2.8 1/21/2012 0 22.2 13.3 1.7
  • 11. Feature Selection • Select features that are highly correlated to target • Pick the most representative features from existing features • For selected features, look for sets of features that are highly correlated with each other • In each set, select feature with highest correlation to target • Use final selected features to train the model #ISSLearningFest date precipitation temp_max temp_min wind weather 1/1/2012 0 12.8 5 4.7 drizzle 1/2/2012 10.9 10.6 2.8 4.5 rain 1/3/2012 0.8 11.7 7.2 2.3 rain 1/4/2012 20.3 12.2 5.6 4.7 rain 1/5/2012 1.3 8.9 2.8 6.1 rain 1/6/2012 2.5 4.4 2.2 2.2 rain 1/7/2012 0 7.2 2.8 2.3 rain 1/8/2012 0 10 2.8 2 sun 1/9/2012 4.3 9.4 5 3.4 rain 1/10/2012 1 6.1 0.6 3.4 rain 1/11/2012 0 6.1 -1.1 5.1 sun 1/12/2012 0 6.1 -1.7 1.9 sun 1/13/2012 0 5 -2.8 1.3 sun 1/14/2012 0 16.1 1.7 4.3 sun 1/15/2012 0 21.1 7.2 4.1 sun 1/16/2012 0 20 6.1 2.1 sun 1/17/2012 0 14.4 3.9 3 sun 1/18/2012 0 18.3 4.4 4.3 sun 1/19/2012 0 25.6 12.8 2.2 drizzle 1/20/2012 0 18.9 13.9 2.8 drizzle 1/21/2012 0 22.2 13.3 1.7 drizzle Selected features implies state
  • 12. Pearson Correlation • Measure of the extend to which two random variables change in tandem • Value between -1 to +1 • -1 indicates strong negative linear correlation • 0 indicates no correlation • +1 indicates strong positive correlation #ISSLearningFest
  • 14. Feature Extraction • Analyse existing features to generate new features • Dimension Reduction • Reducing a 4D/3D space  2D space #ISSLearningFest date precipitation temp_max temp_min wind weather 1/1/2012 0 12.8 5 4.7 drizzle 1/2/2012 10.9 10.6 2.8 4.5 rain 1/3/2012 0.8 11.7 7.2 2.3 rain 1/4/2012 20.3 12.2 5.6 4.7 rain 1/5/2012 1.3 8.9 2.8 6.1 rain 1/6/2012 2.5 4.4 2.2 2.2 rain 1/7/2012 0 7.2 2.8 2.3 rain 1/8/2012 0 10 2.8 2 sun 1/9/2012 4.3 9.4 5 3.4 rain 1/10/2012 1 6.1 0.6 3.4 rain 1/11/2012 0 6.1 -1.1 5.1 sun 1/12/2012 0 6.1 -1.7 1.9 sun 1/13/2012 0 5 -2.8 1.3 sun 1/14/2012 0 16.1 1.7 4.3 sun 1/15/2012 0 21.1 7.2 4.1 sun 1/16/2012 0 20 6.1 2.1 sun 1/17/2012 0 14.4 3.9 3 sun 1/18/2012 0 18.3 4.4 4.3 sun 1/19/2012 0 25.6 12.8 2.2 drizzle 1/20/2012 0 18.9 13.9 2.8 drizzle 1/21/2012 0 22.2 13.3 1.7 drizzle PCA Analysis precipitation temp_max weather 0 12.8 drizzle 10.9 10.6 rain 0.8 11.7 rain 20.3 12.2 rain 1.3 8.9 rain 2.5 4.4 rain 0 7.2 rain 0 10 sun 4.3 9.4 rain 1 6.1 rain 0 6.1 sun 0 6.1 sun 0 5 sun 0 16.1 sun 0 21.1 sun 0 20 sun 0 14.4 sun 0 18.3 sun 0 25.6 drizzle 0 18.9 drizzle 0 22.2 drizzle
  • 15. Feature Scaling • Different scales in our dataset • Different techniques • Normalization: min-max scaling • Values in column bounded between fixed range 0 and 1 • Standardization: Z-score normalization • Values in column rescale to Gaussian distribution, i.e. show mean and variance • Standardization • Reduces each feature to similar scale for ease of comparison • Performed within each feature, not across features • Shift dataset to origin allows learning models to learn faster and better #ISSLearningFest date precipitation temp_max temp_min wind weather 1/1/2012 0 12.8 5 4.7 drizzle 1/2/2012 10.9 10.6 2.8 4.5 rain 1/3/2012 0.8 11.7 7.2 2.3 rain 1/4/2012 20.3 12.2 5.6 4.7 rain 1/5/2012 1.3 8.9 2.8 6.1 rain 1/6/2012 2.5 4.4 2.2 2.2 rain 1/7/2012 0 7.2 2.8 2.3 rain 1/8/2012 0 10 2.8 2 sun 1/9/2012 4.3 9.4 5 3.4 rain 1/10/2012 1 6.1 0.6 3.4 rain 1/11/2012 0 6.1 -1.1 5.1 sun 1/12/2012 0 6.1 -1.7 1.9 sun 1/13/2012 0 5 -2.8 1.3 sun 1/14/2012 0 16.1 1.7 4.3 sun 1/15/2012 0 21.1 7.2 4.1 sun 1/16/2012 0 20 6.1 2.1 sun 1/17/2012 0 14.4 3.9 3 sun 1/18/2012 0 18.3 4.4 4.3 sun 1/19/2012 0 25.6 12.8 2.2 drizzle 1/20/2012 0 18.9 13.9 2.8 drizzle 1/21/2012 0 22.2 13.3 1.7 drizzle Small scale
  • 16. Implementing ML algorithm for IoT solution • Sampling • Split dataset into training dataset (80%) and test dataset (20%) • Build ML model • Put training dataset to ML algorithm for training • Output: Trained model/Predictor generated • Test ML model • Use test dataset passed to predictor/model • Evaluate model • determine the accuracy of our model #ISSLearningFest
  • 17. Summary • Data Cleaning • Impute missing values • Encode categorical features • Data Transformation • Transform and scale numerical variables • Feature Extraction • Perform discretization • Remove outliers • Feature selection • Perform feature extraction from date and time • Create new features from existing ones • Feature Iteration • Pump to ML algorithm to produce trained model #ISSLearningFest
  • 18. Give Us Your Feedback #ISSLearningFest Day 2 Programme