Do you have too many meaningless features? — Featurebyte @ ODSC East 2023

FeatureByte Inc. ©
Do you have too many meaningless
features?
Semantic Feature Engineering may
be the cure
Sergey Yurgenson
May 2023, Boston

FeatureByte Inc. ©
What do we know about feature
engineering?

FeatureByte Inc. © 5
Feature engineering or feature extraction or feature discovery is the process of using domain
knowledge to extract features (characteristics, properties, attributes) from raw data. The motivation is to
use these extra features to improve the quality of results from a machine learning process, compared
with supplying only the raw data to the machine learning process.
- Wikipedia
“Coming up with features is difﬁcult, time-consuming, requires expert knowledge. ‘Applied machine learning’
is basically feature engineering”.
- Andrew Ng
“At the end of the day, some machine learning projects succeed and some fail. What makes the difference?
Easily the most important factor is the features used.”
- Prof. Pedro Domingos

Why do we need feature engineering?
Improves Model
Performance
Better features lead to more
accurate models, which
leads to better predictive
power and ultimately, better
decisions.
Feature = X - Y
Do not force an algorithm to learn
something that you already know
X
Y

Current state of feature engineering - usual approach
Brute force: Create many-many features and let ML algorithm to select most predictive.
● Human - hunt for “golden feature”
○ “For example , we create a “golden” feature called “mostly_dead”, that grouped insights from
different biological functions or measures that were not compatible with life” (Nurit Cohen
Inger, 2020)
○ “Golden Features: before SearchStrea.tsv ﬁlter by ObjectiveType=3: Sum of Objective==1 by
SearchID, Sum of Objective==2 by SerchID, number of instances by Search ID. Combine these
3 features.” (Giba, Avido Ad Click competition)
● Machine - automated feature generation
■ Libraries : Featuretools, TSFresh, AutoFeat…

Current state of feature engineering - usual outcome
● Features which lack explainability and do not pass a common
sense filter
● A lot of correlated features
That requires additional feature selection or dimensionality
reduction (making them even less explainable)
● Increased danger of overfitting of validation or even test sets
● Features are difficult to put in production, support and maintain

FeatureByte Inc. ©
How to solve that
feature mess problem?

Easy to maintain, less prone to errors
and easy to debug if error happened,
easy to trace feature drift.
Easy to understand and explain.
Models looks more trustworthy
Business stakeholders
are happy
What if we create smarter, explainable features from the start?
Data Scientist and Data
Engineers are happy

FeatureByte Inc. ©
Systematic approach
based on data semantics

FeatureByte Inc. ©
Feature engineering through semantics
DT var23
02/04/2022
04:23:18
-5.37
03/05/2022
09:01:36
12.56
04/05/2022
11:39:02
38.10
04/08/2022
10:42:54
29.94
Who Semantic Feature
DS Numeric Aggregations: mean, min, max, std, number of
records (?)
DS Additive numeric (?) + sum
Data owner,
SME
Transaction amount
[additive numeric]
Aggregations: mean, min, max, std, number of
records, sum, mean/min/max/std for positive and
negative separately…
Data owner,
SME
Temperature at specific
location [non-additive
numeric]
Aggregations: mean, min, max, std [taking into
account time between data points] number of
records (?)
Number of days below specific temperature…
The more we know about a data field, the more meaningful and
domain specific feature engineering is.
Data

Requirements:
● Easy to use and automate
● Easy to modify and expand
● Easy to share
How to organise domain knowledge and feature engineering knowledge ?

FeatureByte Inc. ©
15
Numeric
Mean, std, min, max…
Categorical
Most frequent value, entropy…
Text
Additive Numeric
Sum
Non-Additive
Numeric
Nominal Categorical Ordinal Categorical
Count of Items
Most frequent count
Amount Codes
Number of Items in a
basket
Is number > N
Number of patients
Sample of data semantic ontology
Zip code
Long, Lat
Main principle:
● Each data field has a specific
semantic
● Each semantic has its own specific
engineered features
● Features are also inherited from
ancestors
● Engineered features also belong to
the semantic ontology

FeatureByte Inc. ©
16
Feature engineering through semantics: zip code
Numeric
Mean, std, min, max
Categorical
Last known value, Most frequent
value, Number of unique values,
Fraction of most frequent value,
Entropy
Non-Additive Numeric Nominal Categorical
Codes
Zip code
Long, Lat
Circular
● Zip code

FeatureByte Inc. ©
17
Numeric
Mean, std, min, max
Categorical
Entropy
Codes
Zip code
Long, Lat
Circular
● Last known zip
● Most frequent zip
● Number of unique zip
● Fraction of most frequent zip
● Entropy

FeatureByte Inc. ©
18
Numeric
Mean, std, min, max
Categorical
Entropy
Codes
Zip code
Long, Lat
Circular
● Longitude
● Latitude

FeatureByte Inc. ©
19
Numeric
Last known value, Mean,
std, min, max
Categorical
Entropy
Codes
Zip code
Long, Lat
Circular
● Last known Longitude
● Last known Latitude
● Mean Longitude
● Mean Latitude
● STD Longitude
● STD Latitude
● Min Longitude
● Min Latitude
● Max Longitude
● Max Latitude

FeatureByte Inc. ©
Zip code
17 Features
Last known Longitude/Latitude
Longitude, Latitude
Mean location (long/lat),
Location boundaries (long/lat min/max)
location distribution (long/lat std)
Last known zip code value
Most frequent zip code value
Number of unique zip codes values
Fraction of most frequent zip code values
Entropy of zip codes

FeatureByte Inc. ©
21
Numeric
Mean, std, min, max…
Categorical
Most frequent value, entropy…
Text
Additive Numeric
Sum
Non-Additive
Numeric
Nominal Categorical Ordinal Categorical
Count of Items
Most frequent count
Amount Codes
Number of Items in a
basket
Is number > N
Number of patients
Sample of data semantic ontology
Zip code
Long, Lat 21
ICD-10-CM
Sequence of codes

● Facilitate communication between data SME and data science SME
● Support knowledge continuity
● Allows data science collaboration and taps into power of community
● Allows feature engineering automation based on data semantics and
accumulated community knowledge
● Easy to extend domain speciﬁc semantics without need to change the
rest of the taxonomy
● Facilitates very structured way to deal feature engineering
Beneﬁts

Manual process
Requires domain expertise for domain
speciﬁc semantics
Current bottleneck:

Main LLM problems when using for ML
(beyond IP, privacy, regulatory requirements…)
● Hallucinations
● Randomness
● LLM may have limited knowledge of areas with limited public
information available

Frame questions in a more deterministic way. Use LLM as a classiﬁcation algorithm

Human in the loop - take LLM recommendations, make your own conclusions

Make question more speciﬁc.

FeatureByte Inc. ©
Lesson: LLMs could be helpful, but:
1. Human in the loop is needed
2. Reduce degree of freedom in prompt
3. Many recommended features are too
generic

FeatureByte Inc. ©
Codifying a systematic and simple approach to Feature
Engineering
Embedding data semantic into the platform
Semantic based feature recommendation (coming soon)
Building a community to extend domain speciﬁc
knowledge (coming soon)
Featurebyte - open source feature engineering library
https://github.com/featurebyte/featurebyte
www.featurebyte.com

Do you have too many meaningless features? — Featurebyte @ ODSC East 2023

More Related Content

Similar to Do you have too many meaningless features? — Featurebyte @ ODSC East 2023 (20)

More from FeatureByte (6)

Recently uploaded (20)

Do you have too many meaningless features? — Featurebyte @ ODSC East 2023