SlideShare a Scribd company logo
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Андрей Белас AI Solution Architect, SMART business
 Эксперт в области машинного обучения, публичный
спикер.
 Создатель и ментор SMART Data Science Academy, отвечаю
за техническое развитие data science команды и
архитектуру всех data science проектов SMART business.
 Microsoft Certified Professional в направлениях:
 Big Data and Advanced Analytics
 Cloud Data Science with Azure Machine Learning
 Developing SQL Data Models.
Опыт работы:
 Deep Learning
 Computer Vision
 AI in Forecasting
 AI in Marketing
 Risk management
 Business Intelligence
Andrii Belas "Modern approaches to working with categorical data in machine learning"
http://smart-it.com/ai
http://smart-it.com/ru/ai
Узнайте больше
5
6
К
Agenda
1. Overview
2. Machine Learning approach
3. Deep Learning approach
4. Some software and practical examples
Tabular data
Tabular data
• Basic type of data: spreadsheet, relational database, financial reports…
• Credit scoring
• Pricing
• Recommendation systems
• Sales forecasting
• Customer churn
• Fraud detection
Let’s assume that preparation is done
Business
Understanding
Data
Understanding
Data
Preparation
Modelling Evaluation Deployment
Identify project
objectives
Collect and
review data
Select and
cleanse data
Manipulate data
and draw
conclusions
Evaluate model
and conclusions
Apply conclusions
to business
Feature types
• Numeric – could be any number (age, salary …)
• Categorical - you can select the answer from
a small group of possibilities
(gender, occupation)
• Other types (text, images, audio …)
Let’s start modeling…
• But machine learning models can only learn from numeric values (mostly)
• Random forest – limited number of categories
• Xgboost – numeric only
• So should we drop nonnumeric features?
• No! They potentially have predictive power.
Machine Learning: Classical
Not really rely on data
Machine Learning: Classical
Assuming that frequency is
important
Machine Learning: Modern
• Label encoding gives random order. No correlation with target
• Trees are unable to handle high-cardinality categorical variables: trees have
limited depth.
• We want to use target to generate features – target encoding
• Mean encoding is the most common
Machine Learning: Modern
For classification
Machine Learning: Modern
• Easy to overfit, use regularization or special packages
• Try some modern libraries with build-in encodings:
• LightGBM
• Catboost
What about Neural Networks
Deep Learning: Classical
Not rely on data, very
sparse matrices
Embeddings
Deep Learning: Embedding
• Inspired from NLP (word embeddings, word2vec), but currently not in the
books
• We will use embedding layer to treat categorical variables!
• This approach allows for relationships between categories to be captured
• There may be patterns for cities that are geographically near each other, and
for cities that are of similar socio-economic status etc
• Much lower dimensionality
Deep Learning: Word2Vec
• Note the difference between first
two rows and rest
• First dimension is capturing
something related to being a dog,
and the second dimension captures
youthfulness
• We definitely won’t do vocabulary
with one hot today 
Deep Learning: Embedding
• Much smaller
• Learned from data
• Latent (hidden) features – can visualize then
• Can then be used as pretrained (shops for example) – transfer learning
Embeddings - practical
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Deep Learning: Embedding only
• Doesn’t cover interactions with other variables
• Multiple categorical variables can cause the problem
• Solution – multimodal (multi-input) neural networks!
Andrii Belas "Modern approaches to working with categorical data in machine learning"
КMultimodal learning
Andrii Belas "Modern approaches to working with categorical data in machine learning"
OpenAI Five Model Architecture
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Andrii Belas "Modern approaches to working with categorical data in machine learning"
Deep Learning: Embeddings
• Embedding - look something up in an array (looking something up in an array is
mathematically identical to doing a matrix product by a one hot encoded matrix,
but much more efficient)
• Embeddings are amazing!
Useful links
• https://ru.coursera.org/learn/competitive-data-science - coursera course on
modern ML
• http://contrib.scikit-learn.org/categorical-encoding/ - Python package
• https://github.com/bfgray3/cattonum - R library
• https://keras.io/getting-started/functional-api-guide/ - Keras functional API
• https://www.kaggle.com/colinmorris/embedding-layers - full example
Questions?
Andrii Belas "Modern approaches to working with categorical data in machine learning"

More Related Content

PDF
Scalable state of-the-art conversational AI
PDF
Rasa AI: Building clever chatbots
PPTX
[VFS 2019] Building chatbot with RASA
PPTX
MLaaS - Machine Learning as a Service
PPTX
Webinar: Question Answering and Virtual Assistants with Deep Learning
PPTX
Data-Oriented Programming: making data a first-class citizen
PDF
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
PDF
How Will AI Change the Role of the Data Scientist?
Scalable state of-the-art conversational AI
Rasa AI: Building clever chatbots
[VFS 2019] Building chatbot with RASA
MLaaS - Machine Learning as a Service
Webinar: Question Answering and Virtual Assistants with Deep Learning
Data-Oriented Programming: making data a first-class citizen
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
How Will AI Change the Role of the Data Scientist?

What's hot (9)

PDF
How to become a data scientist
PPTX
IBM Deep Learning Overview
PDF
Conversational AI with Rasa - PyData Workshop
PDF
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
PPTX
Conversational interfaces for chatbots and artificial intelligence final
PPTX
Proposed Talk Outline for Pycon2017
PPTX
The Python ecosystem for data science - Landscape Overview
PDF
How to Identify, Train or Become a Data Scientist
PPTX
Test strategy for Conversational AI
How to become a data scientist
IBM Deep Learning Overview
Conversational AI with Rasa - PyData Workshop
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Conversational interfaces for chatbots and artificial intelligence final
Proposed Talk Outline for Pycon2017
The Python ecosystem for data science - Landscape Overview
How to Identify, Train or Become a Data Scientist
Test strategy for Conversational AI
Ad

Similar to Andrii Belas "Modern approaches to working with categorical data in machine learning" (20)

PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
PDF
Intro to machine learning
PDF
Machine learning it is time...
PPTX
Introduction to Machine Learning - An overview and first step for candidate d...
PDF
ML.pdf
PPTX
Introduction overviewmachinelearning sig Door Lucas Jellema
PPTX
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
PPTX
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
PPTX
Machine learning
PPTX
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
PPT
Machine learning and deep learning algorithms
PPTX
Machine Learning with Spark
PPTX
Unit - 1 - Introduction of the machine learning
PPTX
Big Data & Machine Learning - TDC2013 Sao Paulo
PDF
machine learning basic unit1 for third year cse studnets
PDF
DSCI 552 machine learning for data science
PPTX
Lectuhhhhhhhhhhhhhhhhhhhhhhbbbhhhre 1.pptx
PDF
Choosing a Machine Learning technique to solve your need
PDF
machine _Learning_and _its_concepts_PPT (1).pdf
PPTX
Chapter8_What_Is_Machine_Learning Testing Cases
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Intro to machine learning
Machine learning it is time...
Introduction to Machine Learning - An overview and first step for candidate d...
ML.pdf
Introduction overviewmachinelearning sig Door Lucas Jellema
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Machine learning
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Machine learning and deep learning algorithms
Machine Learning with Spark
Unit - 1 - Introduction of the machine learning
Big Data & Machine Learning - TDC2013 Sao Paulo
machine learning basic unit1 for third year cse studnets
DSCI 552 machine learning for data science
Lectuhhhhhhhhhhhhhhhhhhhhhhbbbhhhre 1.pptx
Choosing a Machine Learning technique to solve your need
machine _Learning_and _its_concepts_PPT (1).pdf
Chapter8_What_Is_Machine_Learning Testing Cases
Ad

More from Lviv Startup Club (20)

PDF
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
PDF
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
PDF
Maksym Vyshnivetskyi: PMO Quality Management (UA)
PDF
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
PDF
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
PDF
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
PDF
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
PDF
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
PDF
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
PPTX
Dmytro Liesov: PMO Tools and Technologies (UA)
PDF
Rostyslav Chayka: Управління командою за допомогою AI (UA)
PDF
Oleksandr Osypenko: Tailoring + Change Management (UA)
PDF
Maksym Vyshnivetskyi: Управління закупівлями (UA)
PDF
Oleksandr Osypenko: Управління ризиками (UA)
PPTX
Dmytro Zubkov: PMO Resource Management (UA)
PPTX
Rostyslav Chayka: Комунікація за допомогою AI (UA)
PDF
Ihor Pavlenko: Комунікація за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління якістю (UA)
PDF
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
Dmytro Liesov: PMO Tools and Technologies (UA)
Rostyslav Chayka: Управління командою за допомогою AI (UA)
Oleksandr Osypenko: Tailoring + Change Management (UA)
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Oleksandr Osypenko: Управління ризиками (UA)
Dmytro Zubkov: PMO Resource Management (UA)
Rostyslav Chayka: Комунікація за допомогою AI (UA)
Ihor Pavlenko: Комунікація за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління якістю (UA)
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)

Recently uploaded (20)

PDF
Tata consultancy services case study shri Sharda college, basrur
PPTX
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
PPT
Chapter four Project-Preparation material
PDF
IFRS Notes in your pocket for study all the time
PDF
Laughter Yoga Basic Learning Workshop Manual
PPTX
2025 Product Deck V1.0.pptxCATALOGTCLCIA
DOCX
Business Management - unit 1 and 2
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
Keppel_Proposed Divestment of M1 Limited
PDF
Daniels 2024 Inclusive, Sustainable Development
PDF
How to Get Business Funding for Small Business Fast
PDF
Deliverable file - Regulatory guideline analysis.pdf
PPTX
DMT - Profile Brief About Business .pptx
PDF
Hindu Circuler Economy - Model (Concept)
PDF
Digital Marketing & E-commerce Certificate Glossary.pdf.................
PDF
SBI Securities Weekly Wrap 08-08-2025_250808_205045.pdf
PDF
NEW - FEES STRUCTURES (01-july-2024).pdf
PDF
Module 2 - Modern Supervison Challenges - Student Resource.pdf
PDF
Comments on Crystal Cloud and Energy Star.pdf
Tata consultancy services case study shri Sharda college, basrur
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
Chapter four Project-Preparation material
IFRS Notes in your pocket for study all the time
Laughter Yoga Basic Learning Workshop Manual
2025 Product Deck V1.0.pptxCATALOGTCLCIA
Business Management - unit 1 and 2
340036916-American-Literature-Literary-Period-Overview.ppt
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
Keppel_Proposed Divestment of M1 Limited
Daniels 2024 Inclusive, Sustainable Development
How to Get Business Funding for Small Business Fast
Deliverable file - Regulatory guideline analysis.pdf
DMT - Profile Brief About Business .pptx
Hindu Circuler Economy - Model (Concept)
Digital Marketing & E-commerce Certificate Glossary.pdf.................
SBI Securities Weekly Wrap 08-08-2025_250808_205045.pdf
NEW - FEES STRUCTURES (01-july-2024).pdf
Module 2 - Modern Supervison Challenges - Student Resource.pdf
Comments on Crystal Cloud and Energy Star.pdf

Andrii Belas "Modern approaches to working with categorical data in machine learning"

  • 2. Андрей Белас AI Solution Architect, SMART business  Эксперт в области машинного обучения, публичный спикер.  Создатель и ментор SMART Data Science Academy, отвечаю за техническое развитие data science команды и архитектуру всех data science проектов SMART business.  Microsoft Certified Professional в направлениях:  Big Data and Advanced Analytics  Cloud Data Science with Azure Machine Learning  Developing SQL Data Models. Опыт работы:  Deep Learning  Computer Vision  AI in Forecasting  AI in Marketing  Risk management  Business Intelligence
  • 5. 5
  • 6. 6
  • 7. К Agenda 1. Overview 2. Machine Learning approach 3. Deep Learning approach 4. Some software and practical examples
  • 9. Tabular data • Basic type of data: spreadsheet, relational database, financial reports… • Credit scoring • Pricing • Recommendation systems • Sales forecasting • Customer churn • Fraud detection
  • 10. Let’s assume that preparation is done Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Identify project objectives Collect and review data Select and cleanse data Manipulate data and draw conclusions Evaluate model and conclusions Apply conclusions to business
  • 11. Feature types • Numeric – could be any number (age, salary …) • Categorical - you can select the answer from a small group of possibilities (gender, occupation) • Other types (text, images, audio …)
  • 12. Let’s start modeling… • But machine learning models can only learn from numeric values (mostly) • Random forest – limited number of categories • Xgboost – numeric only • So should we drop nonnumeric features? • No! They potentially have predictive power.
  • 13. Machine Learning: Classical Not really rely on data
  • 14. Machine Learning: Classical Assuming that frequency is important
  • 15. Machine Learning: Modern • Label encoding gives random order. No correlation with target • Trees are unable to handle high-cardinality categorical variables: trees have limited depth. • We want to use target to generate features – target encoding • Mean encoding is the most common
  • 17. Machine Learning: Modern • Easy to overfit, use regularization or special packages • Try some modern libraries with build-in encodings: • LightGBM • Catboost
  • 18. What about Neural Networks
  • 19. Deep Learning: Classical Not rely on data, very sparse matrices
  • 21. Deep Learning: Embedding • Inspired from NLP (word embeddings, word2vec), but currently not in the books • We will use embedding layer to treat categorical variables! • This approach allows for relationships between categories to be captured • There may be patterns for cities that are geographically near each other, and for cities that are of similar socio-economic status etc • Much lower dimensionality
  • 22. Deep Learning: Word2Vec • Note the difference between first two rows and rest • First dimension is capturing something related to being a dog, and the second dimension captures youthfulness • We definitely won’t do vocabulary with one hot today 
  • 23. Deep Learning: Embedding • Much smaller • Learned from data • Latent (hidden) features – can visualize then • Can then be used as pretrained (shops for example) – transfer learning
  • 27. Deep Learning: Embedding only • Doesn’t cover interactions with other variables • Multiple categorical variables can cause the problem • Solution – multimodal (multi-input) neural networks!
  • 31. OpenAI Five Model Architecture
  • 35. Deep Learning: Embeddings • Embedding - look something up in an array (looking something up in an array is mathematically identical to doing a matrix product by a one hot encoded matrix, but much more efficient) • Embeddings are amazing!
  • 36. Useful links • https://ru.coursera.org/learn/competitive-data-science - coursera course on modern ML • http://contrib.scikit-learn.org/categorical-encoding/ - Python package • https://github.com/bfgray3/cattonum - R library • https://keras.io/getting-started/functional-api-guide/ - Keras functional API • https://www.kaggle.com/colinmorris/embedding-layers - full example