How to Effectively Combine Numerical Features and Categorical Features

How to Effectively Combine
Numerical Features and Categorical Features
June 14th, 2017
Liangjie Hong
Head of Data Science, Etsy Inc.

• Head of Data Science
- Etsy Inc. in NYC, NY (2016. – Present)
- Search & Discovery; Personalization and Recommendation; Computational Advertising
• Senior Manager of Research
- Yahoo Research in Sunnyvale, CA (2013 – 2016)
Leading science efforts for personalization and search sciences
• Published papers in SIGIR, WWW, KDD, CIKM, AAAI, WSDM, RecSys and ICML
• 3 Best Paper Awards, 2000+ Citations with H-Index 18
• Program committee members in KDD, WWW, SIGIR, WSDM, AAAI, EMNLP, ICWSM, ACL, CIKM,
IJCAI and various journal reviewers
Liangjie Hong

About This Paper
• Authors
Qian Zhao, PhD Student from University of Minnesota
Yue Shi, Research Scientist at Facebook
Liangjie Hong, Head of Data Science at Etsy Inc.
• Paper Venue
Full Research Paper in The 26th International World Wide Web Conference, 2017 (WWW 2017)

High-Level Takeaways
• A new family of models to handle categorical features and numerical features well by combining
embedding models and tree-based models
• A simple learning algorithm that can be easily extended from existing data mining and machine learning
toolkits
• State-of-the-art performance on major datasets

Why we need GB-CENT
Motivations
• Real-World Data
Categorical features: user ids, item ids, words, document ids, ...
Numerical features: dwell time, average purchase prices, click-through-rate,...

Why we need GB-CENT
Motivations
• Real-World Data
Categorical features: user ids, item ids, words, document ids, ...
Numerical features: dwell time, average purchase prices, click-through-rate,...
• Ideas
Converting categorical features into numerical ones (e.g., statistics, embedding methods, topic models...)
Converting numerical features into categorical ones (e.g., bucketizing, binary codes, sigmoid transformation...)

Why we need GB-CENT
Motivations
Two Families of Powerful Practical Data Mining and Machine Learning Tools
• Tree-based Models
Decision Trees, Random Forest, Gradient Boosted Decision Trees…
• Matrix-based Embedding Models
Matrix Factorization, Factorization Machines…

Why we need GB-CENT: Tree-based Models
• Pros:
Interpretability for simple trees
Effectiveness in certain tasks: IR ranking models
Simple and easy to train
Handle numerical features well
…

Why we need GB-CENT: Tree-based Models
• Pros:
Interpretability for simple trees
Effectiveness in certain tasks: IR ranking models
Simple and easy to train
Handle numerical features well
…
• Cons:
Need one-hot-encoding to handle categorical features and therefore
cannot easily handle features with large cardinality*
For complex trees, features might appear multiple times in a tree – hard
to explain
…
*TIANQI CHEN AND CARLOS GUESTRIN. XGBOOST: A SCALABLE TREE BOOSTING SYSTEM. KDD '16.

Why we need GB-CENT: Embedding-based Models
• Pros:
Predictive power
Effectiveness in certain tasks: recommender systems
Handle categorical features well through one-hot-encoding
…

Why we need GB-CENT: Embedding-based Models
• Pros:
Predictive power
Effectiveness in certain tasks: recommender systems
Handle categorical features well through one-hot-encoding
…
• Cons:
Numerical features usually need preprocessing and hard to handle.
Hard to interpret in general
…

Why we need GB-CENT
Tree-based models are good at numerical features.
Embedding models are good at categorical features.
Why not combine them two?

What is GB-CENT
In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines
• Matrix-based Embedding Models
Handle large-cardinality categorical features…
• Tree-based Models
Handle numerical features…

What is GB-CENT
In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines
• Factorization Machines
Handle large-cardinality categorical features…
• Gradient Boosted Decision Trees
Handle numerical features…

What is GB-CENT
CAT-E (Factorization Machines)
• Bias term for each categorical feature
• Embedding for each categorical feature
• Interactions between meaningful categorical groups
e.g., users, items, age groups, gender...
No numerical features

What is GB-CENT
CAT-NT (Gradient Boosted Decision Trees)
• One tree per categorical feature (potentially)
• For each tree, the training data is all data instances with numerical features containing this particular
categorical feature.
No categorical features

What is GB-CENT
CAT-E (Factorization Machines)
• generalizes categorical features by embedding them into low-dimensional space.
CAT-NT (Gradient Boosted Decision Trees)
• memorizes each categorical feature’s peculiarities.
HENG-TZE CHENG, LEVENT KOC, JEREMIAH HARMSEN, TAL SHAKED, TUSHAR CHANDRA, HRISHI ARADHYE, GLEN ANDERSON, GREG CORRADO,
WEI CHAI, MUSTAFA ISPIR, ROHAN ANIL, ZAKARIA HAQUE, LICHAN HONG, VIHAN JAIN, XIAOBING LIU, AND HEMAL SHAH. WIDE & DEEP
LEARNING FOR RECOMMENDER SYSTEMS. IN PROCEEDINGS OF THE 1ST WORKSHOP ON DEEP LEARNING FOR RECOMMENDER
SYSTEMS (DLRS 2016). ACM, NEW YORK, NY, USA, 7-10.

What is GB-CENT
Different from GBDT:
• The number of trees in GB-CENT depends on the cardinality of categorical features in the data set, while GBDT has a
pre-specified number of trees M.
• Each tree in GB-CENT only takes numerical features as input while GBDT takes in both categorical and numerical
features.
• Learning a tree for GBDT uses all N instances in the data set while the tree for a categorical feature in GB-CENT only
involves its supporting instances.

What is GB-CENT
Training GB-CENT:
• Train CAT-E part firstly using Stochastic Gradient Descent (SGD)
• Train CAT-NT part secondly

What is GB-CENT
Training GB-CENT:
• Train CAT-E part firstly using Stochastic Gradient Descent (SGD)
• Train CAT-NT part secondly
-- 1) Sort categorical features by their support (how many data instances)
-- 2) Check whether we meet minTreeSupport
-- 3) Use maxTreeDepth and minNodeSplit to fit a tree
-- 4) Use minTreeGain to decide whether keeping a tree

How does GB-CENT perform
• Datasets
MovieLens
Statistics: 240K users, 33K movies, 22M instances, 5 ratings
Categorical features: user_id, item_id, genre, language, country, grade
Numerical features: year, runTime, imdbVotes, imdbRating, metaScore
RedHat
Statistics: 151K customers, 7 categories, 2M instances, binary response
Categorical features: people_id, activity_category
Numerical features: activity characteristics

• Datasets
MovieLens
Evaluation Metric: Root Mean Squared Error (RMSE)
RedHat
Evaluation Metric: Area Under the Curve (AUC)
80% of train, 10% of validation and 10% of testing
We also compare empirical training time.

• Baselines
GB-CENT variants:
1) CAT-E
2) CAT-NT
3) GB-CENT
GBDT variants:
1) GBDT-OH: GBDT + One-hot-encoding for categorical features
2) GBDT-CE: Fit CAT-E firstly and then feed into GBDT
FM variants:
1) FM-S: Transform numerical features by sigmoid and feed into FM
2) FM-D: Transform numerical features by discretizing them and feed into FM
SVDFeature variants:
1) SVDFeature-S: Transform numerical features by sigmoid and feed into SVDFeature
2) SVDFeature-D: Transform numerical features by discretizing them and feed into SVDFeature
All latent dimensionality is 20. For GB-CENT, minTreeSupport = 50, minTreeGain = 0.0, minNodeSplit = 50 and maxTreeDepth
= 3.

Main takeaway: Learn many shallow small trees

GB-CENT
• Combine Factorization Machines (handle categorical features) and GBDT (handle numerical features) together
• Combine interpretable results and high predictive power
• Achieve high performance in real-world datasets
Summary

How to Effectively Combine Numerical Features and Categorical Features

More Related Content

What's hot (20)

Similar to How to Effectively Combine Numerical Features and Categorical Features (20)

More from Domino Data Lab (20)

Recently uploaded (20)

How to Effectively Combine Numerical Features and Categorical Features