SlideShare a Scribd company logo
FEATURE ENGINEERING
FOR DIVERSE DATA TYPES
Alice Zheng
October 10, 2016
Seattle PyLadies Meetup
1
2
MY JOURNEY SO FAR
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3
MACHINE LEARNING IS USEFUL!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4
THE MACHINE LEARNING PIPELINE
It is a puppy and
it is extremely
cute.
Raw data
Features
Models
Predictions
Deploy in
production
Models
6
A SIMPLE MODEL
X
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) =
1 if f(x, y) > 0
0 if f(x, y) <= 0
7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)
0
8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions
(e.g., deep neural nets)
9
BETWEEN RAW DATA AND MODELS
• Mathematical models take numeric input
• Raw data are not numeric (or not the right kind of numeric)
• Featurization: the step in-between
• Feature space: multi-dimensional numeric space where modeling happens
Feature Generation
Feature: An individual measurable property
of a phenomenon being observed.
⎯ Christopher Bishop,
“Pattern Recognition and Machine Learning”
TEXT
12
TURNING TEXT INTO FEATURES
It is a puppy and it
is extremely cute.
What are the important
measures? Keywords?
Verb tense? Subject,
object?
it 2
is 2
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Bag of words feature
vector
Raw text
13
VISUALIZING BAG-OF-WORDS
puppy
cute
1
1
It is a puppy and
it is extremely cute
14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen
1
Decision surface
Feature Cleaning and Transformation
16
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc
Count
Rank Word Doc
Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,479 tangified 1 357,469 chappagetti 1
357,478 laaaaaaasts 1 357,468 herdy 1
357,477 bailouts 1 357,467 csmpus 1
357,476 feautred 1 357,466 costoso 1
357,475 résine 1 357,465 freebased 1
357,474 chilyl 1 357,464 tikme 1
357,473 cariottis 1 357,463 traditionresort 1
357,472 enfeebled 1 357,462 jallisco 1
357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
18
FEATURE CLEANING
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
19
FEATURE CLEANING
• Frequency-based pruning
20
STOPWORDS VS. FREQUENCY
FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
21
FEATURE SCALING WITH TD-IDF
• Scaling ”evens out” the features
• A soft filter
• Tf-idf = term frequency x inverse document frequency
• Tf = Number of times a terms appears in a document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
22
VISUALIZING TF-IDF
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
IMAGES
25
REPRESENTING IMAGES
What are the “semantic atoms” of images?
• Semantic atom = a unit of meaning
26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue
27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure
28
IMAGE GRADIENTS AND
ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or
texture
• Image gradient: direction of largest change in
color, starting from a pixel
-45º
0º
45º
-90º
90º
135º
180º
-135º
• Gradient orientation histogram: indicates the
prominent directions of color change in a
patch of pixels
29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999
30
DEEP LEARNING APPROACH
• Stack multiple layers – combine local features to form global features
• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.
32
FEATURIZATION CHALLENGES
It is a puppy and it is
extremely cute.
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Text
ImageAudio
33
KEY TO FEATURE ENGINEERING
• Features sit in-between data and models
• Need to encapsulate necessary semantic information from raw data
• Distribution of data in feature space should be easily manageable by intended
model
• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio
• Requires ingenuity and intuition!
@RainyData alicez@amazon.com
Amazon Ad Platform is hiring!

More Related Content

PPTX
The How and Why of Feature Engineering
PDF
GANs and Applications
PDF
L5. Data Transformation and Feature Engineering
PDF
A Short Introduction to Generative Adversarial Networks
PDF
Generative adversarial networks
PDF
Gan 발표자료
PDF
Basic Generative Adversarial Networks
PDF
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
The How and Why of Feature Engineering
GANs and Applications
L5. Data Transformation and Feature Engineering
A Short Introduction to Generative Adversarial Networks
Generative adversarial networks
Gan 발표자료
Basic Generative Adversarial Networks
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...

What's hot (14)

PDF
Generative Adversarial Networks and Their Applications
PDF
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
PDF
Generative Adversarial Networks
PPTX
Generative Adversarial Networks (GAN)
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
Generative Adversarial Network (+Laplacian Pyramid GAN)
PPTX
Generative Adversarial Networks and Their Applications in Medical Imaging
PDF
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
PDF
Convolutional neural network in practice
PDF
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
PDF
Variants of GANs - Jaejun Yoo
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PDF
Generative Adversarial Network and its Applications to Speech Processing an...
PPTX
Deep learning to the rescue - solving long standing problems of recommender ...
Generative Adversarial Networks and Their Applications
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
Generative Adversarial Networks
Generative Adversarial Networks (GAN)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Networks and Their Applications in Medical Imaging
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
Convolutional neural network in practice
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Variants of GANs - Jaejun Yoo
Tutorial on Theory and Application of Generative Adversarial Networks
Generative Adversarial Network and its Applications to Speech Processing an...
Deep learning to the rescue - solving long standing problems of recommender ...
Ad

Viewers also liked (20)

PPTX
Feature Engineering
PDF
BSSML16 L7. Feature Engineering
PPTX
The Challenges of Bringing Machine Learning to the Masses
PPTX
Understanding Feature Space in Machine Learning
PPTX
Make Sense Out of Data with Feature Engineering
PDF
Featurizing log data before XGBoost
PPTX
Yug Contract Company Digital Descent Autumn 2010
PPTX
Science presentation
PPTX
Can automated feature engineering prevent target leaks
PPTX
Introduction &amp; EHR Benefits Realization
PPTX
What the Bleep is Big Data? A Holistic View of Data and Algorithms
PPTX
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
PDF
Deep Learning in Natural Language Processing
PDF
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
PPTX
Reverse Engineering Feature Models From Software Variants to Build Software P...
PPTX
Enterprise mHealth Strategy
PDF
@ UDRI - Traffic & Transportation Plan - Final
PDF
Visualising Multi Dimensional Data
PDF
Kaggle presentation
PDF
Deep Learning for NLP
Feature Engineering
BSSML16 L7. Feature Engineering
The Challenges of Bringing Machine Learning to the Masses
Understanding Feature Space in Machine Learning
Make Sense Out of Data with Feature Engineering
Featurizing log data before XGBoost
Yug Contract Company Digital Descent Autumn 2010
Science presentation
Can automated feature engineering prevent target leaks
Introduction &amp; EHR Benefits Realization
What the Bleep is Big Data? A Holistic View of Data and Algorithms
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Deep Learning in Natural Language Processing
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Reverse Engineering Feature Models From Software Variants to Build Software P...
Enterprise mHealth Strategy
@ UDRI - Traffic & Transportation Plan - Final
Visualising Multi Dimensional Data
Kaggle presentation
Deep Learning for NLP
Ad

Similar to Feature engineering for diverse data types (20)

PPTX
Overview of Machine Learning and Feature Engineering
PDF
Accelerating Data Science through Feature Platform, Transformers and GenAI
PPTX
Understanding feature-space
PDF
Accelerating Data Science through Feature Platform, Transformers, and GenAI
PDF
SophiaConf 2018 - J. Rahajarison (My Little Adventure)
PDF
Searching Images: Recent research at Southampton
PPTX
Movie Recommendation System.pptx
PPTX
05 -- Feature Engineering (Text).pptxiuy
PDF
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
PPTX
Predicting the relevance of search results for e-commerce systems
PDF
Searching Images: Recent research at Southampton
PDF
Modern text mining – understanding a million comments in 60 minutes
PDF
Master in Big Data Analytics and Social Mining 20015
PDF
Searching Images: Recent research at Southampton
PDF
Apache spark meetup November 22nd Sydney.
PDF
Slides from Apache spark Meetup in Sydney - November,2016
PDF
Maximizing Your ML Success with Innovative Feature Engineering
PDF
IRJET - Document Comparison based on TF-IDF Metric
Overview of Machine Learning and Feature Engineering
Accelerating Data Science through Feature Platform, Transformers and GenAI
Understanding feature-space
Accelerating Data Science through Feature Platform, Transformers, and GenAI
SophiaConf 2018 - J. Rahajarison (My Little Adventure)
Searching Images: Recent research at Southampton
Movie Recommendation System.pptx
05 -- Feature Engineering (Text).pptxiuy
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Predicting the relevance of search results for e-commerce systems
Searching Images: Recent research at Southampton
Modern text mining – understanding a million comments in 60 minutes
Master in Big Data Analytics and Social Mining 20015
Searching Images: Recent research at Southampton
Apache spark meetup November 22nd Sydney.
Slides from Apache spark Meetup in Sydney - November,2016
Maximizing Your ML Success with Innovative Feature Engineering
IRJET - Document Comparison based on TF-IDF Metric

Recently uploaded (20)

PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Sciences of Europe No 170 (2025)
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Microbiology with diagram medical studies .pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
An interstellar mission to test astrophysical black holes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Sciences of Europe No 170 (2025)
INTRODUCTION TO EVS | Concept of sustainability
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Derivatives of integument scales, beaks, horns,.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
2. Earth - The Living Planet Module 2ELS
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ECG_Course_Presentation د.محمد صقران ppt
Comparative Structure of Integument in Vertebrates.pptx
Biophysics 2.pdffffffffffffffffffffffffff
Viruses (History, structure and composition, classification, Bacteriophage Re...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
The KM-GBF monitoring framework – status & key messages.pptx
Microbiology with diagram medical studies .pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
An interstellar mission to test astrophysical black holes

Feature engineering for diverse data types

  • 1. FEATURE ENGINEERING FOR DIVERSE DATA TYPES Alice Zheng October 10, 2016 Seattle PyLadies Meetup 1
  • 2. 2 MY JOURNEY SO FAR Shortage of expertise and good tools in the market. Applied machine learning/ data science Build ML tools Write a book
  • 3. 3 MACHINE LEARNING IS USEFUL! Model data. Make predictions. Build intelligent applications. Play chess and go!
  • 4. 4 THE MACHINE LEARNING PIPELINE It is a puppy and it is extremely cute. Raw data Features Models Predictions Deploy in production
  • 6. 6 A SIMPLE MODEL X Y X and Y 1 1 1 0 0 0 0 1 1 0 0 0 f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0 0 if f(x, y) <= 0
  • 8. 8 FROM SIMPLE TO COMPLEX Xn X3 X2 X1 … r1(X1, X2) r2(X2∪X3) rm(X1, Xn) … s1(r1, r2) s2(r1, r3) sm(rm-1, rm) … Use more complicated functions or Stack layers of simple functions (e.g., deep neural nets)
  • 9. 9 BETWEEN RAW DATA AND MODELS • Mathematical models take numeric input • Raw data are not numeric (or not the right kind of numeric) • Featurization: the step in-between • Feature space: multi-dimensional numeric space where modeling happens
  • 10. Feature Generation Feature: An individual measurable property of a phenomenon being observed. ⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
  • 11. TEXT
  • 12. 12 TURNING TEXT INTO FEATURES It is a puppy and it is extremely cute. What are the important measures? Keywords? Verb tense? Subject, object? it 2 is 2 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Bag of words feature vector Raw text
  • 13. 13 VISUALIZING BAG-OF-WORDS puppy cute 1 1 It is a puppy and it is extremely cute
  • 14. 14 CLASSIFYING BAG-OF-WORDS puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten I have a dog and I have a pen 1 Decision surface
  • 15. Feature Cleaning and Transformation
  • 16. 16 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 1 the 1,416,058 11 was 929,703 2 and 1,381,324 12 this 844,824 3 a 1,263,126 13 but 822,313 4 i 1,230,214 14 my 786,595 5 to 1,196,238 15 that 777,045 6 it 1,027,835 16 with 775,044 7 of 1,025,638 17 on 735,419 8 for 993,430 18 they 720,994 9 is 988,547 19 you 701,015 10 in 961,518 20 have 692,749 Most popular words in Yelp reviews dataset (~ 6M reviews).
  • 17. 17 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 357,480 cmtk8xyqg 1 357,470 attractif 1 357,479 tangified 1 357,469 chappagetti 1 357,478 laaaaaaasts 1 357,468 herdy 1 357,477 bailouts 1 357,467 csmpus 1 357,476 feautred 1 357,466 costoso 1 357,475 résine 1 357,465 freebased 1 357,474 chilyl 1 357,464 tikme 1 357,473 cariottis 1 357,463 traditionresort 1 357,472 enfeebled 1 357,462 jallisco 1 357,471 sparklely 1 357,461 zoawan 1 Least popular words in Yelp reviews dataset (~ 6M reviews).
  • 18. 18 FEATURE CLEANING • Popular words and rare words are not helpful • Manually defined blacklist – stopwords a b c d e f g h i able be came definitely each far get had ie about became can described edu few gets happens if above because cannot despite eg fifth getting hardly ignored according become cant did eight first given has immediately accordingly becomes cause different either five gives have in across becoming causes do else followed go having inasmuch … … … … … … … … …
  • 20. 20 STOPWORDS VS. FREQUENCY FILTERS No training required Stopwords Frequency filters Can be exhaustive Inflexible Adapts to data Also deals with rare words Needs tuning, hard to control Both require manual attention
  • 21. 21 FEATURE SCALING WITH TD-IDF • Scaling ”evens out” the features • A soft filter • Tf-idf = term frequency x inverse document frequency • Tf = Number of times a terms appears in a document • Idf = log(# total docs / # docs containing word w) • Large for uncommon words, small for popular words • Discounts popular words, highlights rare words
  • 22. 22 VISUALIZING TF-IDF puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 23. 23 VISUALIZING TF-IDF puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy
  • 25. 25 REPRESENTING IMAGES What are the “semantic atoms” of images? • Semantic atom = a unit of meaning
  • 27. 27 INFORMATION ABOUT STRUCTURE Collection of local patches encapsulates global structure
  • 28. 28 IMAGE GRADIENTS AND ORIENTATION HISTOGRAM • Color changes indicate edges, patterns, or texture • Image gradient: direction of largest change in color, starting from a pixel -45º 0º 45º -90º 90º 135º 180º -135º • Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels
  • 29. 29 SIFT IMAGE FEATURE PIPELINE Lowe, ICCV 1999
  • 30. 30 DEEP LEARNING APPROACH • Stack multiple layers – combine local features to form global features • Similar in spirit to SIFT/HOG “AlexNet” – Krizhevsky et al., NIPS 2012
  • 31. 31 VISUALIZING ALEXNET Weights of a trained AlexNet. Left– first layer, right – second layer.
  • 32. 32 FEATURIZATION CHALLENGES It is a puppy and it is extremely cute. “Human native” Conceptually abstract Low Semantic content in data High Higher Difficulty of feature generation Lower Text ImageAudio
  • 33. 33 KEY TO FEATURE ENGINEERING • Features sit in-between data and models • Need to encapsulate necessary semantic information from raw data • Distribution of data in feature space should be easily manageable by intended model • Natural text and logs contain higher level semantic information • Easier to featurize than images and audio • Requires ingenuity and intuition! @RainyData alicez@amazon.com Amazon Ad Platform is hiring!

Editor's Notes

  • #5: Features sit between raw data and model. They can make or break an application.