SlideShare a Scribd company logo
FeatureByte Inc. ©
Do you have too many meaningless
features?
Semantic Feature Engineering may
be the cure
Sergey Yurgenson
May 2023, Boston
FeatureByte Inc. © 2
FeatureByte Inc. © 3
FeatureByte Inc. ©
What do we know about feature
engineering?
FeatureByte Inc. © 5
Feature engineering or feature extraction or feature discovery is the process of using domain
knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to
use these extra features to improve the quality of results from a machine learning process, compared
with supplying only the raw data to the machine learning process.
- Wikipedia
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’
is basically feature engineering”.
- Andrew Ng
“At the end of the day, some machine learning projects succeed and some fail. What makes the difference?
Easily the most important factor is the features used.”
- Prof. Pedro Domingos
FeatureByte Inc. © 6
Why do we need feature engineering?
Improves Model
Performance
Better features lead to more
accurate models, which
leads to better predictive
power and ultimately, better
decisions.
Feature = X - Y
Do not force an algorithm to learn
something that you already know
X
Y
FeatureByte Inc. © 7
Current state of feature engineering - usual approach
Brute force: Create many-many features and let ML algorithm to select most predictive.
● Human - hunt for “golden feature”
○ “For example , we create a “golden” feature called “mostly_dead”, that grouped insights from
different biological functions or measures that were not compatible with life” (Nurit Cohen
Inger, 2020)
○ “Golden Features: before SearchStrea.tsv filter by ObjectiveType=3: Sum of Objective==1 by
SearchID, Sum of Objective==2 by SerchID, number of instances by Search ID. Combine these
3 features.” (Giba, Avido Ad Click competition)
● Machine - automated feature generation
■ Libraries : Featuretools, TSFresh, AutoFeat…
FeatureByte Inc. © 8
Current state of feature engineering - usual outcome
● Features which lack explainability and do not pass a common
sense filter
● A lot of correlated features
That requires additional feature selection or dimensionality
reduction (making them even less explainable)
● Increased danger of overfitting of validation or even test sets
● Features are difficult to put in production, support and maintain
FeatureByte Inc. ©
How to solve that
feature mess problem?
FeatureByte Inc. © 10
Easy to maintain, less prone to errors
and easy to debug if error happened,
easy to trace feature drift.
Easy to understand and explain.
Models looks more trustworthy
Business stakeholders
are happy
What if we create smarter, explainable features from the start?
Data Scientist and Data
Engineers are happy
FeatureByte Inc. ©
Systematic approach
based on data semantics
FeatureByte Inc. © 12
FeatureByte Inc. ©
Feature engineering through semantics
DT var23
02/04/2022
04:23:18
-5.37
03/05/2022
09:01:36
12.56
04/05/2022
11:39:02
38.10
04/08/2022
10:42:54
29.94
Who Semantic Feature
DS Numeric Aggregations: mean, min, max, std, number of
records (?)
DS Additive numeric (?) + sum
Data owner,
SME
Transaction amount
[additive numeric]
Aggregations: mean, min, max, std, number of
records, sum, mean/min/max/std for positive and
negative separately…
Data owner,
SME
Temperature at specific
location [non-additive
numeric]
Aggregations: mean, min, max, std [taking into
account time between data points] number of
records (?)
Number of days below specific temperature…
The more we know about a data field, the more meaningful and
domain specific feature engineering is.
Data
FeatureByte Inc. © 14
Requirements:
● Easy to use and automate
● Easy to modify and expand
● Easy to share
How to organise domain knowledge and feature engineering knowledge ?
FeatureByte Inc. ©
15
Feature engineering through semantics
Numeric
Mean, std, min, max…
Categorical
Most frequent value, entropy…
Text
Additive Numeric
Sum
Non-Additive
Numeric
Nominal Categorical Ordinal Categorical
Count of Items
Most frequent count
Amount Codes
Number of Items in a
basket
Is number > N
Number of patients
Sample of data semantic ontology
Zip code
Long, Lat
Main principle:
● Each data field has a specific
semantic
● Each semantic has its own specific
engineered features
● Features are also inherited from
ancestors
● Engineered features also belong to
the semantic ontology
FeatureByte Inc. ©
16
Feature engineering through semantics: zip code
Numeric
Mean, std, min, max
Categorical
Last known value, Most frequent
value, Number of unique values,
Fraction of most frequent value,
Entropy
Non-Additive Numeric Nominal Categorical
Codes
Zip code
Long, Lat
Circular
● Zip code
FeatureByte Inc. ©
17
Feature engineering through semantics: zip code
Numeric
Mean, std, min, max
Categorical
Last known value, Most frequent
value, Number of unique values,
Fraction of most frequent value,
Entropy
Non-Additive Numeric Nominal Categorical
Codes
Zip code
Long, Lat
Circular
● Last known zip
● Most frequent zip
● Number of unique zip
● Fraction of most frequent zip
● Entropy
FeatureByte Inc. ©
18
Feature engineering through semantics: zip code
Numeric
Mean, std, min, max
Categorical
Last known value, Most frequent
value, Number of unique values,
Fraction of most frequent value,
Entropy
Non-Additive Numeric Nominal Categorical
Codes
Zip code
Long, Lat
Circular
● Longitude
● Latitude
FeatureByte Inc. ©
19
Feature engineering through semantics: zip code
Numeric
Last known value, Mean,
std, min, max
Categorical
Last known value, Most frequent
value, Number of unique values,
Fraction of most frequent value,
Entropy
Non-Additive Numeric Nominal Categorical
Codes
Zip code
Long, Lat
Circular
● Last known Longitude
● Last known Latitude
● Mean Longitude
● Mean Latitude
● STD Longitude
● STD Latitude
● Min Longitude
● Min Latitude
● Max Longitude
● Max Latitude
FeatureByte Inc. ©
Feature engineering through semantics: zip code
Zip code
17 Features
Last known Longitude/Latitude
Longitude, Latitude
Mean location (long/lat),
Location boundaries (long/lat min/max)
location distribution (long/lat std)
Last known zip code value
Most frequent zip code value
Number of unique zip codes values
Fraction of most frequent zip code values
Entropy of zip codes
FeatureByte Inc. ©
21
Feature engineering through semantics
Numeric
Mean, std, min, max…
Categorical
Most frequent value, entropy…
Text
Additive Numeric
Sum
Non-Additive
Numeric
Nominal Categorical Ordinal Categorical
Count of Items
Most frequent count
Amount Codes
Number of Items in a
basket
Is number > N
Number of patients
Sample of data semantic ontology
Zip code
Long, Lat 21
ICD-10-CM
Sequence of codes
FeatureByte Inc. © 22
● Facilitate communication between data SME and data science SME
● Support knowledge continuity
● Allows data science collaboration and taps into power of community
● Allows feature engineering automation based on data semantics and
accumulated community knowledge
● Easy to extend domain specific semantics without need to change the
rest of the taxonomy
● Facilitates very structured way to deal feature engineering
Benefits
FeatureByte Inc. © 23
Manual process
Requires domain expertise for domain
specific semantics
Current bottleneck:
FeatureByte Inc. ©
Can LLMs help ?
FeatureByte Inc. © 25
Main LLM problems when using for ML
(beyond IP, privacy, regulatory requirements…)
● Hallucinations
● Randomness
● LLM may have limited knowledge of areas with limited public
information available
FeatureByte Inc. ©
Still, lets try…
FeatureByte Inc. © 27
Frame questions in a more deterministic way. Use LLM as a classification algorithm
FeatureByte Inc. © 28
Human in the loop - take LLM recommendations, make your own conclusions
FeatureByte Inc. © 29
Make question more specific.
FeatureByte Inc. ©
Lesson: LLMs could be helpful, but:
1. Human in the loop is needed
2. Reduce degree of freedom in prompt
3. Many recommended features are too
generic
FeatureByte Inc. ©
Codifying a systematic and simple approach to Feature
Engineering
Embedding data semantic into the platform
Semantic based feature recommendation (coming soon)
Building a community to extend domain specific
knowledge (coming soon)
Featurebyte - open source feature engineering library
https://github.com/featurebyte/featurebyte
www.featurebyte.com

More Related Content

PPTX
Generative AI and Large Language Models (LLMs)
PDF
Pydata Chicago - work hard once
PDF
AI for Software Engineering
PDF
GDG Harare - Devfest 2024 Combined Speaker Slides
PDF
Single Source of Truth for Network Automation
PPTX
Step by step AI Day 3: AI Technologies
PDF
Using Data Science to Build an End-to-End Recommendation System
PPTX
Moving from BI to AI : For decision makers
Generative AI and Large Language Models (LLMs)
Pydata Chicago - work hard once
AI for Software Engineering
GDG Harare - Devfest 2024 Combined Speaker Slides
Single Source of Truth for Network Automation
Step by step AI Day 3: AI Technologies
Using Data Science to Build an End-to-End Recommendation System
Moving from BI to AI : For decision makers

Similar to Do you have too many meaningless features? — Featurebyte @ ODSC East 2023 (20)

DOC
Resume_PankajTaneja_Infosystem
PDF
[Fortifier] Case study
PDF
ICIC 2013 New Product Introductions CEPT
DOCX
M.Tech._2014_1.8 yr_exp
PPTX
Major Project Presentation (7th Sem) - Code Detection.pptx
PDF
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
PDF
Demystifying Data Science
PDF
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
PDF
Are API Services Taking Over All the Interesting Data Science Problems?
PPTX
Machine Learning
PPTX
An introduction to Machine Learning with scikit-learn (October 2018)
PPTX
AI in Software Development.pptx
PPT
Software Measurement: Lecture 3. Metrics in Organization
PPTX
AI hype or reality
DOCX
Deep Learning Vocabulary.docx
PDF
1. quality control solutions for niche marketing 1-6
PPTX
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
PDF
Arocom - Projects and Resource Portfolio.pdf
PPTX
Online talent sourcing - a future essentia
PPTX
Rise of the machines -- Owasp israel -- June 2014 meetup
Resume_PankajTaneja_Infosystem
[Fortifier] Case study
ICIC 2013 New Product Introductions CEPT
M.Tech._2014_1.8 yr_exp
Major Project Presentation (7th Sem) - Code Detection.pptx
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
Demystifying Data Science
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
Are API Services Taking Over All the Interesting Data Science Problems?
Machine Learning
An introduction to Machine Learning with scikit-learn (October 2018)
AI in Software Development.pptx
Software Measurement: Lecture 3. Metrics in Organization
AI hype or reality
Deep Learning Vocabulary.docx
1. quality control solutions for niche marketing 1-6
Best Practices in Software Cost Estimation - Metrikon 2015 - Frank Vogelezang
Arocom - Projects and Resource Portfolio.pdf
Online talent sourcing - a future essentia
Rise of the machines -- Owasp israel -- June 2014 meetup
Ad

More from FeatureByte (6)

PDF
Accelerating Data Science through Feature Platform, Transformers and GenAI
PDF
Accelerating Data Science through Feature Platform, Transformers, and GenAI
PDF
Simplify Feature Engineering in Your Data Warehouse
PDF
Transforming Feature Ideas into Machine Learning Inputs
PDF
Feature Ideation
PDF
Maximizing Your ML Success with Innovative Feature Engineering
Accelerating Data Science through Feature Platform, Transformers and GenAI
Accelerating Data Science through Feature Platform, Transformers, and GenAI
Simplify Feature Engineering in Your Data Warehouse
Transforming Feature Ideas into Machine Learning Inputs
Feature Ideation
Maximizing Your ML Success with Innovative Feature Engineering
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation

Do you have too many meaningless features? — Featurebyte @ ODSC East 2023

  • 1. FeatureByte Inc. © Do you have too many meaningless features? Semantic Feature Engineering may be the cure Sergey Yurgenson May 2023, Boston
  • 4. FeatureByte Inc. © What do we know about feature engineering?
  • 5. FeatureByte Inc. © 5 Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process. - Wikipedia “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering”. - Andrew Ng “At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” - Prof. Pedro Domingos
  • 6. FeatureByte Inc. © 6 Why do we need feature engineering? Improves Model Performance Better features lead to more accurate models, which leads to better predictive power and ultimately, better decisions. Feature = X - Y Do not force an algorithm to learn something that you already know X Y
  • 7. FeatureByte Inc. © 7 Current state of feature engineering - usual approach Brute force: Create many-many features and let ML algorithm to select most predictive. ● Human - hunt for “golden feature” ○ “For example , we create a “golden” feature called “mostly_dead”, that grouped insights from different biological functions or measures that were not compatible with life” (Nurit Cohen Inger, 2020) ○ “Golden Features: before SearchStrea.tsv filter by ObjectiveType=3: Sum of Objective==1 by SearchID, Sum of Objective==2 by SerchID, number of instances by Search ID. Combine these 3 features.” (Giba, Avido Ad Click competition) ● Machine - automated feature generation ■ Libraries : Featuretools, TSFresh, AutoFeat…
  • 8. FeatureByte Inc. © 8 Current state of feature engineering - usual outcome ● Features which lack explainability and do not pass a common sense filter ● A lot of correlated features That requires additional feature selection or dimensionality reduction (making them even less explainable) ● Increased danger of overfitting of validation or even test sets ● Features are difficult to put in production, support and maintain
  • 9. FeatureByte Inc. © How to solve that feature mess problem?
  • 10. FeatureByte Inc. © 10 Easy to maintain, less prone to errors and easy to debug if error happened, easy to trace feature drift. Easy to understand and explain. Models looks more trustworthy Business stakeholders are happy What if we create smarter, explainable features from the start? Data Scientist and Data Engineers are happy
  • 11. FeatureByte Inc. © Systematic approach based on data semantics
  • 13. FeatureByte Inc. © Feature engineering through semantics DT var23 02/04/2022 04:23:18 -5.37 03/05/2022 09:01:36 12.56 04/05/2022 11:39:02 38.10 04/08/2022 10:42:54 29.94 Who Semantic Feature DS Numeric Aggregations: mean, min, max, std, number of records (?) DS Additive numeric (?) + sum Data owner, SME Transaction amount [additive numeric] Aggregations: mean, min, max, std, number of records, sum, mean/min/max/std for positive and negative separately… Data owner, SME Temperature at specific location [non-additive numeric] Aggregations: mean, min, max, std [taking into account time between data points] number of records (?) Number of days below specific temperature… The more we know about a data field, the more meaningful and domain specific feature engineering is. Data
  • 14. FeatureByte Inc. © 14 Requirements: ● Easy to use and automate ● Easy to modify and expand ● Easy to share How to organise domain knowledge and feature engineering knowledge ?
  • 15. FeatureByte Inc. © 15 Feature engineering through semantics Numeric Mean, std, min, max… Categorical Most frequent value, entropy… Text Additive Numeric Sum Non-Additive Numeric Nominal Categorical Ordinal Categorical Count of Items Most frequent count Amount Codes Number of Items in a basket Is number > N Number of patients Sample of data semantic ontology Zip code Long, Lat Main principle: ● Each data field has a specific semantic ● Each semantic has its own specific engineered features ● Features are also inherited from ancestors ● Engineered features also belong to the semantic ontology
  • 16. FeatureByte Inc. © 16 Feature engineering through semantics: zip code Numeric Mean, std, min, max Categorical Last known value, Most frequent value, Number of unique values, Fraction of most frequent value, Entropy Non-Additive Numeric Nominal Categorical Codes Zip code Long, Lat Circular ● Zip code
  • 17. FeatureByte Inc. © 17 Feature engineering through semantics: zip code Numeric Mean, std, min, max Categorical Last known value, Most frequent value, Number of unique values, Fraction of most frequent value, Entropy Non-Additive Numeric Nominal Categorical Codes Zip code Long, Lat Circular ● Last known zip ● Most frequent zip ● Number of unique zip ● Fraction of most frequent zip ● Entropy
  • 18. FeatureByte Inc. © 18 Feature engineering through semantics: zip code Numeric Mean, std, min, max Categorical Last known value, Most frequent value, Number of unique values, Fraction of most frequent value, Entropy Non-Additive Numeric Nominal Categorical Codes Zip code Long, Lat Circular ● Longitude ● Latitude
  • 19. FeatureByte Inc. © 19 Feature engineering through semantics: zip code Numeric Last known value, Mean, std, min, max Categorical Last known value, Most frequent value, Number of unique values, Fraction of most frequent value, Entropy Non-Additive Numeric Nominal Categorical Codes Zip code Long, Lat Circular ● Last known Longitude ● Last known Latitude ● Mean Longitude ● Mean Latitude ● STD Longitude ● STD Latitude ● Min Longitude ● Min Latitude ● Max Longitude ● Max Latitude
  • 20. FeatureByte Inc. © Feature engineering through semantics: zip code Zip code 17 Features Last known Longitude/Latitude Longitude, Latitude Mean location (long/lat), Location boundaries (long/lat min/max) location distribution (long/lat std) Last known zip code value Most frequent zip code value Number of unique zip codes values Fraction of most frequent zip code values Entropy of zip codes
  • 21. FeatureByte Inc. © 21 Feature engineering through semantics Numeric Mean, std, min, max… Categorical Most frequent value, entropy… Text Additive Numeric Sum Non-Additive Numeric Nominal Categorical Ordinal Categorical Count of Items Most frequent count Amount Codes Number of Items in a basket Is number > N Number of patients Sample of data semantic ontology Zip code Long, Lat 21 ICD-10-CM Sequence of codes
  • 22. FeatureByte Inc. © 22 ● Facilitate communication between data SME and data science SME ● Support knowledge continuity ● Allows data science collaboration and taps into power of community ● Allows feature engineering automation based on data semantics and accumulated community knowledge ● Easy to extend domain specific semantics without need to change the rest of the taxonomy ● Facilitates very structured way to deal feature engineering Benefits
  • 23. FeatureByte Inc. © 23 Manual process Requires domain expertise for domain specific semantics Current bottleneck:
  • 24. FeatureByte Inc. © Can LLMs help ?
  • 25. FeatureByte Inc. © 25 Main LLM problems when using for ML (beyond IP, privacy, regulatory requirements…) ● Hallucinations ● Randomness ● LLM may have limited knowledge of areas with limited public information available
  • 27. FeatureByte Inc. © 27 Frame questions in a more deterministic way. Use LLM as a classification algorithm
  • 28. FeatureByte Inc. © 28 Human in the loop - take LLM recommendations, make your own conclusions
  • 29. FeatureByte Inc. © 29 Make question more specific.
  • 30. FeatureByte Inc. © Lesson: LLMs could be helpful, but: 1. Human in the loop is needed 2. Reduce degree of freedom in prompt 3. Many recommended features are too generic
  • 31. FeatureByte Inc. © Codifying a systematic and simple approach to Feature Engineering Embedding data semantic into the platform Semantic based feature recommendation (coming soon) Building a community to extend domain specific knowledge (coming soon) Featurebyte - open source feature engineering library https://github.com/featurebyte/featurebyte www.featurebyte.com