SlideShare a Scribd company logo
Feature Engineering for
Machine Learning
Amanda Casari
Principal Product Manager + Data Scientist
Concur Labs @ SAP Concur
@amcasari
here to there via random walk
product + data
@ SAP Concur
control systems
engineering +
robotics + legos
officer in US Navy
operations research
analyst
wandering dirtbag +
conservation volunteer
EE + applied math
+ complex systems
underwater robotics
consultant
extraordinaire
stay at home mom
co-author
NASA Datanaut
@amcasarihere to there via random walk
data science is not magic…
@amcasari
…but it is a process (sometimes painful)
@amcasari@MROGATI
it is easy to get turned around….
@amcasari
idea
research
exploration
hypotheses
model
outcomes
feedback
…and it is easy to get mixed up
xkcd #1838
@amcasari
…so let’s focus on getting from data to models
feature engineering goes here!
@amcasari
when we say…
DATA SCIENCE
• …. the interdisciplinary intersection of methods, processes,
algorithms and problem solving techniques to extract
knowledge from data1
MACHINE LEARNING [ML]
§ …. fitting mathematical models to data in order to
derive insights or make predictions.2
FEATURE
§ …. a numeric representation of an aspect of raw data2
FEATURE ENGINEERING
§ …. the act of extracting features from raw data and
transforming them into formats that are suitable for the
machine learning model2
hint: our community is well represented in Wikipedia @amcasari
[n.b. ethics]
DATA
SOCIAL CONSTRUCT
§ …. “jointly constructed understandings of the world that
form the basis for shared assumptions about reality”1
BIAS
§ … results from unfair sampling of a population, or from an
estimation process that does not give accurate results on
average2
ACCOUNTABILITY
§ … you are answerable for your decisions and obligated
to be able to explain the resulting consequences3
hint: much more about this w/ @kjam at 14:30
§ …. is an abstract representation of reality, not reality itself.
Data is a part of the system of record, but not the actual
system itself.
@amcasari
how to choose?
1/ FRAME YOUR PROBLEM
2/ UNDERSTAND YOUR DATA
§ What data will be most helpful to understand and
generate a better understanding of this problem?
3/ FRAME YOUR FEATURE GOALS
§ What are you optimizing for?
§ Iteration speed
§ Model performance
4 / TEST, ITERATE, TEST AGAIN
§ Check your choices for robustness
§ Validate but realize this will still change
§ Can you frame your problem in a way that machine
learning could be useful? e.g. prediction
@amcasari
vector space
scalar: single numeric
feature
vector: ordered list of
scalars
Example:
1/ two-dimensional
vector, v = [1, -1]
@amcasari
feature space
In data, abstract vectors
take on actual meaning
Examples:
• 1/ a vector can
represent a person’s
preference for songs
• Song = feature
• +1: Thumbs-up
• -1: Thumbs-down
• 2/ song represents ind.
preferences in a group
@amcasari
Counts: Fancy Tricks with Simple Numbers
counts: binarization
@amcasari
counts: binning
@amcasari
counts: fixed width binning
@amcasari
@amcasari
counts: adaptive binning
@amcasari
loga(ax) = x, where a is a positive
constant and x can be any positive
number
a0=1, loga(1)=0
tl;dr
the log function compresses the
range of large numbers and
expands the range of small numbers
counts:
log transform binning
@amcasari
What does scaling do for features?
normalization: feature scaling
@amcasari
@amcasari
normalization: feature scaling
@amcasari
normalization: feature scaling
@amcasari
proper scaling preserves underlying shape
Text: Flatten, Filter, Chunk
why text?
@amcasari
hedonometer.org
flatten: bag-of-words (BoW)
@amcasari
filter: frequency based filtering (stopwords)
@amcasari
These NLP libraries
have both English +
Portuguese
corpora, models,
etc
1/ spacy
2/ NLTK
3/ OpenNLP
chunk: parts of speech matter
@amcasari
Pop Chart Lab, npr.org
@amcasari
thank you
@RainyData
code repobuy the book here!

More Related Content

PPTX
Text extraction from images
PDF
EMOTION DETECTION USING AI
PPTX
Azure Fundamentals Part 1
 
PPTX
Getting started with containers on Azure
DOCX
Project report of OCR Recognition
PPTX
PDF
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
PDF
Azure Serverless with Functions, Logic Apps, and Event Grid
Text extraction from images
EMOTION DETECTION USING AI
Azure Fundamentals Part 1
 
Getting started with containers on Azure
Project report of OCR Recognition
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
Azure Serverless with Functions, Logic Apps, and Event Grid

What's hot (20)

PPTX
Power of Azure Devops
PPTX
How To Become A Cloud Engineer | Cloud Engineer Salary | Cloud Computing Engi...
PPTX
Administering power platform deployment planning
PPTX
ABN AMRO DevSecOps Journey
PDF
Project report
PPTX
Image processing ppt
PDF
Microsoft Azure Cloud Services
PPTX
Software reuse ppt.
PPTX
Edge Computing.pptx
PDF
Introdution to Dataops and AIOps (or MLOps)
PPTX
Journey to Cloud: Fast Track to Azure
PDF
Serverless computing and Function-as-a-Service (FaaS)
PDF
Video Processing Applications
PPTX
Object detection.pptx
PPTX
Google Cloud Platform (GCP)
PPTX
Azure App Service
PPTX
Opinion Mining or Sentiment Analysis
PPTX
Virtualization Vs. Containers
PPTX
Task programming
Power of Azure Devops
How To Become A Cloud Engineer | Cloud Engineer Salary | Cloud Computing Engi...
Administering power platform deployment planning
ABN AMRO DevSecOps Journey
Project report
Image processing ppt
Microsoft Azure Cloud Services
Software reuse ppt.
Edge Computing.pptx
Introdution to Dataops and AIOps (or MLOps)
Journey to Cloud: Fast Track to Azure
Serverless computing and Function-as-a-Service (FaaS)
Video Processing Applications
Object detection.pptx
Google Cloud Platform (GCP)
Azure App Service
Opinion Mining or Sentiment Analysis
Virtualization Vs. Containers
Task programming
Ad

Similar to Feature Engineering for Machine Learning at QConSP (20)

PDF
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
PDF
Introduction to data structure and algorithm
PPTX
Machine Learning.pptx
PPTX
Overview of Machine Learning and Feature Engineering
PDF
ML.pdf
PPTX
Data oriented design and c++
PDF
On Impact in Software Engineering Research (HU Berlin 2021)
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PPTX
Machine Learning Summary for Caltech2
PDF
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
PPTX
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
PDF
Barga Data Science lecture 9
PDF
From DBA to DE: Becoming a Data Engineer
PPTX
Demystifying Machine Learning
PPTX
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
PPT
Computer notes - data structures
PPTX
Introduction to Artificial Intelligence...pptx
PDF
Spark MLlib and Viral Tweets
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Introduction to data structure and algorithm
Machine Learning.pptx
Overview of Machine Learning and Feature Engineering
ML.pdf
Data oriented design and c++
On Impact in Software Engineering Research (HU Berlin 2021)
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
Machine Learning Summary for Caltech2
From BasicToAdvanced-FFN, Neuron, Activation Function.pdf
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
Barga Data Science lecture 9
From DBA to DE: Becoming a Data Engineer
Demystifying Machine Learning
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Computer notes - data structures
Introduction to Artificial Intelligence...pptx
Spark MLlib and Viral Tweets
Ad

More from Amanda Casari (7)

PDF
When Privacy Scales - Intelligent Product Design under GDPR
PDF
Scaling Data Science Products, Not Data Science Teams
PDF
Spark Hearts GraphLab Create
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
20160512 apache-spark-for-everyone
PDF
Design for X: Exploring Product Design with Apache Spark and GraphLab
PDF
PyLadies Seattle - Lessons in Interactive Visualizations
When Privacy Scales - Intelligent Product Design under GDPR
Scaling Data Science Products, Not Data Science Teams
Spark Hearts GraphLab Create
Apache Spark for Everyone - Women Who Code Workshop
20160512 apache-spark-for-everyone
Design for X: Exploring Product Design with Apache Spark and GraphLab
PyLadies Seattle - Lessons in Interactive Visualizations

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
L1 - Introduction to python Backend.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Nekopoi APK 2025 free lastest update
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
L1 - Introduction to python Backend.pptx
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
VVF-Customer-Presentation2025-Ver1.9.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Nekopoi APK 2025 free lastest update
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
Which alternative to Crystal Reports is best for small or large businesses.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms II-SECS-1021-03

Feature Engineering for Machine Learning at QConSP

  • 1. Feature Engineering for Machine Learning Amanda Casari Principal Product Manager + Data Scientist Concur Labs @ SAP Concur @amcasari
  • 2. here to there via random walk product + data @ SAP Concur control systems engineering + robotics + legos officer in US Navy operations research analyst wandering dirtbag + conservation volunteer EE + applied math + complex systems underwater robotics consultant extraordinaire stay at home mom co-author NASA Datanaut @amcasarihere to there via random walk
  • 3. data science is not magic… @amcasari
  • 4. …but it is a process (sometimes painful) @amcasari@MROGATI
  • 5. it is easy to get turned around…. @amcasari idea research exploration hypotheses model outcomes feedback
  • 6. …and it is easy to get mixed up xkcd #1838 @amcasari
  • 7. …so let’s focus on getting from data to models feature engineering goes here! @amcasari
  • 8. when we say… DATA SCIENCE • …. the interdisciplinary intersection of methods, processes, algorithms and problem solving techniques to extract knowledge from data1 MACHINE LEARNING [ML] § …. fitting mathematical models to data in order to derive insights or make predictions.2 FEATURE § …. a numeric representation of an aspect of raw data2 FEATURE ENGINEERING § …. the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model2 hint: our community is well represented in Wikipedia @amcasari
  • 9. [n.b. ethics] DATA SOCIAL CONSTRUCT § …. “jointly constructed understandings of the world that form the basis for shared assumptions about reality”1 BIAS § … results from unfair sampling of a population, or from an estimation process that does not give accurate results on average2 ACCOUNTABILITY § … you are answerable for your decisions and obligated to be able to explain the resulting consequences3 hint: much more about this w/ @kjam at 14:30 § …. is an abstract representation of reality, not reality itself. Data is a part of the system of record, but not the actual system itself. @amcasari
  • 10. how to choose? 1/ FRAME YOUR PROBLEM 2/ UNDERSTAND YOUR DATA § What data will be most helpful to understand and generate a better understanding of this problem? 3/ FRAME YOUR FEATURE GOALS § What are you optimizing for? § Iteration speed § Model performance 4 / TEST, ITERATE, TEST AGAIN § Check your choices for robustness § Validate but realize this will still change § Can you frame your problem in a way that machine learning could be useful? e.g. prediction @amcasari
  • 11. vector space scalar: single numeric feature vector: ordered list of scalars Example: 1/ two-dimensional vector, v = [1, -1] @amcasari
  • 12. feature space In data, abstract vectors take on actual meaning Examples: • 1/ a vector can represent a person’s preference for songs • Song = feature • +1: Thumbs-up • -1: Thumbs-down • 2/ song represents ind. preferences in a group @amcasari
  • 13. Counts: Fancy Tricks with Simple Numbers
  • 16. counts: fixed width binning @amcasari
  • 18. @amcasari loga(ax) = x, where a is a positive constant and x can be any positive number a0=1, loga(1)=0 tl;dr the log function compresses the range of large numbers and expands the range of small numbers counts: log transform binning
  • 19. @amcasari What does scaling do for features?
  • 27. filter: frequency based filtering (stopwords) @amcasari These NLP libraries have both English + Portuguese corpora, models, etc 1/ spacy 2/ NLTK 3/ OpenNLP
  • 28. chunk: parts of speech matter @amcasari Pop Chart Lab, npr.org