SlideShare a Scribd company logo
AutoML for user segmentation
Ilya Boytsov
Rambler&Co
Rambler&Co - largest media holding in Russia
About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision
In this talk:
• DMP and user segmentation tasks explained
• Key structures of AutoML pipeline for user segmentation
• Problems we faced while maintaining pipeline
• Feature engineering for machine learning at scale
• Optimization of pipeline tasks
Data management platform (DMP): a powerful AdTech
solution
-Collect user behavior data from various sources
-Integrate data to create a complete customer view
-Store and manage audience segments
-Target audience segments in online ad companies
Types of Data Sources of Data
1st party data – raw events logs (visited
websites)
2nd party data – customer journey data
3rd party data – data collected from partners
Media resources
Products and services
Data from ad campaigns, behavioral factors
Other sources
Образец слайда
DMP AutoML pipeline: solution for any user
segmentation task
About 1000 models fitted on daily basis
Every model is being applied on 300 million of test samples daily
ML problems:
• binary/multiclass classification
• Look alike –> binary classification(segment vs random)
Retargeting_973
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Retargeting_1069
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Examples: Look-alike modelling boosts ctr
General principles of DMP AutoML
All models have similar structure of fit and apply stages
Adding models and exploitation options have to be implemented with
web interface
No need for ML developers to support a scope of key operations
Felix
Backoffice and web
interface for AutoML
pipeline
Create new models, add new segments,
visualize model performance and many more
AutoML for user segmentation: how to match millions of users with hundreds of segments every day
AutoML pipeline daily workflow
Felix
Compute features
Create train table
Train models
Compute pivots load
pivots
Apply and slice
predictions
Compute
metrics
Load
models
Workflow manager: Apache Airflow
• Run a series of tasks as DAG (directed acyclic graph)
• Express task dependencies
• Handle failures
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Problem:
Some target segments(labels) finish computing
slower than others.
Solution:
While some models wait for target segments, other
models keep training
Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Key problems we faced
• Data collection delay
• Out of memory issues
• High cardinality feature matrices
• Too much time to map predictions with label thresholds
• Some models are being applied more often than others
Data collection delay
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
Data collection delay: do not wait too much
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
• If exceeded fill the missing parts of features table with last computed
day
Feature Engineering(FE): overcoming high cardinality
feature matrices
Main rule:
New Features must be
applicable for a majority of
models
Key techniques
• Counting based FE
• Distance based FE
Feature matrix of shape (N, 10000)
id Feature_1 ... Feature_10000
1 42 ... 542
.... ... ... ...
N 89 ... 0
Distance based FE: Cluster distance
Algorithm:
1) Reduce dimension of feature matrix if needed (we use SVD
decomposition)
2) Fit KMeans clustering algorithm with K clusters on given data
3) Calculate distance from sample point to centroid of Kth cluster
4) Use distances as feature representation for sample row
Feature matrix of shape (N, K)
id dist_to_1st_cluster ... dist_to_Kth_cluster
1 0.6757 ... 0.0942
.... ... ... ...
N 0.342 ... 0.6113
Problem:
It may take much time for KMeans to converge
and compute distances for every model…
Solution: “Global” Cluster distance Feature
• Fit KMeans only once on representative
unlabeled sample to extract general information
and use for all models
Experimental results:
• Replacing individually fitted by model distance features with ”Global” feature
doesn’t harm model quality
• Combining both feature representations improve roc auc score about 1%
Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
General problems:
1) High cardinality
2) Efficient with only linear models
Counting based FE: DRACULA
Domain Robust Algorithm for Counting Based Learning
Source: http://www.slideshare.net/SessionsEvents/misha-bilenko-
principal-researcher-microsoft
Algorithm
Compute counts table from all train
data
Compute P(label | feature) for
every unique feature
Aggregate list of probabilities to get
low cardinality data representation
01
02
03
Counts of visited domains for single
user
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Domain Count
news.rambler.ru 95859
auto.ru 31040
mercedes-benz.ru 1386
Total counts in train data
Counts table
Domain Total count Count(label=0) Count(label=1)
news.rambler.ru 95859 41268 54591
auto.ru 31040 26809 4231
mercedes-benz.ru 1386 1120 266
Total 128285 69131 59154
𝑁label = domain frequency with given label
𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 = smoothing constant * aprior class
probability
𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 =1- 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
N = 𝑁 𝑝𝑜𝑠 + 𝑁 𝑛𝑒𝑔
P(label | domain) =
𝑁 𝑙𝑎𝑏𝑒𝑙 + 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
𝑁+𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 + 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟
General formula: probability of label given domain
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) =
0.81
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Interpretation: the concentration of
nonzero elements in histogram
represents estimation of P(label | user)
Advances of algorithm
• Scalable (add new features, recompute probabilities)
• Adaptive (fits for binary and multiclass classification and regression as
well)
• Efficient for gradient boosting decision trees due to low cardinality
• Ability to compute features in distributed manner (mapreduce)
• Ability to store counts table with count min sketch
Compute models pivots task
Approach to approximate label thresholds
Problems:
• how to select thresholds for
labels?
• how to do it computationally
fast?
Desired solution:
• Heuristics to approximate label
thresholds
Compute models pivots task
Approach to approximate label thresholds
Precision-threshold for binary classification.
What probability threshold optimizes given metrics quality?
Compute models pivots task
Approach to approximate label thresholds
Algorithm:
• Take sample of apply data ( we use 5%, about 15 million samples)
• Compute probabilities histogram for this sample
• Use Nth percentile as estimation for label threshold
Apply model task
Task interval: every hour
Number of models per run: 200
General problem:
• Some models are being applied more often then others
Priority schema of apply models
1) Request all models
2) Filter out not yet trained models
3) Sort by date of adding a model (descending)
4) Sort by date of last apply (ascending)
5) Take N top priority models
Key notes:
Key notes:
1) Think of scalable approach
Key notes:
1) Think of scalable approach
2) Implement monitoring of pipeline performance
Key notes:
1) Think of scalable approach
2) Implement monitoring of pipeline performance
3) Make experiments
In lieu of conclusion...
Pipeline automation...is full of fun
Questions? Contact me!
https://www.facebook.com/ieboytsov
i.boytsov@rambler-co.ru
AutoML for user segmentation: how to match millions of users with hundreds of segments every day

More Related Content

PDF
AlphaPy: A Data Science Pipeline in Python
PDF
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
PPTX
Welch Verolog 2013
PPTX
Multi phase mixture media
PDF
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
PPTX
Response prediction for display advertising - WSDM 2014
PDF
Index conf sparkml-feb20-n-pentreath
PDF
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
AlphaPy: A Data Science Pipeline in Python
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...
Welch Verolog 2013
Multi phase mixture media
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Response prediction for display advertising - WSDM 2014
Index conf sparkml-feb20-n-pentreath
Taking your machine learning workflow to the next level using Scikit-Learn Pi...

What's hot (12)

PPTX
Parameterization Matlab Projects Research Topics
PDF
Operationalizing Data Science using Cloud Foundry
PPTX
Survey of Graph Indexing
PPTX
A framework for nonlinear model predictive control
PPT
O Matrix Overview
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
PDF
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
PDF
AgensGraph Presentation at PGConf.us 2017
PPTX
Simulation lab
PPTX
SPARQL and RDF query optimization
PDF
Matlab overview
PDF
Matlab course
Parameterization Matlab Projects Research Topics
Operationalizing Data Science using Cloud Foundry
Survey of Graph Indexing
A framework for nonlinear model predictive control
O Matrix Overview
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
R2RML-F: Towards Sharing and Executing Domain Logic in R2RML Mappings
AgensGraph Presentation at PGConf.us 2017
Simulation lab
SPARQL and RDF query optimization
Matlab overview
Matlab course
Ad

Similar to AutoML for user segmentation: how to match millions of users with hundreds of segments every day (20)

PDF
The Power of Auto ML and How Does it Work
PDF
Spark Summit EU talk by Nick Pentreath
PDF
Scaling Ride-Hailing with Machine Learning on MLflow
PPTX
Recommendations for Building Machine Learning Software
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PPTX
System mldl meetup
PPTX
Productionalizing ML : Real Experience
PDF
Challenges of Building a First Class SQL-on-Hadoop Engine
PPTX
Recommendations for Building Machine Learning Software
PPTX
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
PPTX
Practical data science
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PPTX
Map reduce helpful for college students.pptx
PPTX
Big Data Processing
PPTX
Ops Jumpstart: MongoDB Administration 101
PDF
Mathematical Modeling using MATLAB, by U.M. Sundar Senior Application Enginee...
PDF
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
PDF
Scaling machinelearning as a service at uber li Erran li - 2016
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
Customer choice probabilities
The Power of Auto ML and How Does it Work
Spark Summit EU talk by Nick Pentreath
Scaling Ride-Hailing with Machine Learning on MLflow
Recommendations for Building Machine Learning Software
Lessons Learned from Building Machine Learning Software at Netflix
System mldl meetup
Productionalizing ML : Real Experience
Challenges of Building a First Class SQL-on-Hadoop Engine
Recommendations for Building Machine Learning Software
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Practical data science
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Map reduce helpful for college students.pptx
Big Data Processing
Ops Jumpstart: MongoDB Administration 101
Mathematical Modeling using MATLAB, by U.M. Sundar Senior Application Enginee...
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machinelearning as a service at uber li Erran li - 2016
BSSML16 L10. Summary Day 2 Sessions
Customer choice probabilities
Ad

More from Institute of Contemporary Sciences (20)

PDF
First 5 years of PSI:ML - Filip Panjevic
PPTX
Building valuable (online and offline) Data Science communities - Experience ...
PPT
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
PPTX
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
PPTX
Solving churn challenge in Big Data environment - Jelena Pekez
PDF
Application of Business Intelligence in bank risk management - Dimitar Dilov
PPTX
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
PPTX
Recommender systems for personalized financial advice from concept to product...
PDF
Advanced tools in real time analytics and AI in customer support - Milan Sima...
PPTX
Complex AI forecasting methods for investments portfolio optimization - Pawel...
PPTX
From Zero to ML Hero for Underdogs - Amir Tabakovic
PDF
Data and data scientists are not equal to money david hoyle
PPSX
The price is right - Tomislav Krizan
PPTX
When it's raining gold, bring a bucket - Andjela Culibrk
PPTX
Reality and traps of real time data engineering - Milos Solujic
PPTX
Sensor networks for personalized health monitoring - Vladimir Brusic
PDF
Improving Data Quality with Product Similarity Search
PPTX
Prediction of good patterns for future sales using image recognition
PPTX
Using data to fight corruption: full budget transparency in local government
PPTX
Geospatial Analysis and Open Data - Forest and Climate
First 5 years of PSI:ML - Filip Panjevic
Building valuable (online and offline) Data Science communities - Experience ...
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Solving churn challenge in Big Data environment - Jelena Pekez
Application of Business Intelligence in bank risk management - Dimitar Dilov
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Recommender systems for personalized financial advice from concept to product...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
From Zero to ML Hero for Underdogs - Amir Tabakovic
Data and data scientists are not equal to money david hoyle
The price is right - Tomislav Krizan
When it's raining gold, bring a bucket - Andjela Culibrk
Reality and traps of real time data engineering - Milos Solujic
Sensor networks for personalized health monitoring - Vladimir Brusic
Improving Data Quality with Product Similarity Search
Prediction of good patterns for future sales using image recognition
Using data to fight corruption: full budget transparency in local government
Geospatial Analysis and Open Data - Forest and Climate

Recently uploaded (20)

PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Business Data Analytics.
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Business Data Analytics.
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
IBA_Chapter_11_Slides_Final_Accessible.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

AutoML for user segmentation: how to match millions of users with hundreds of segments every day

  • 1. AutoML for user segmentation Ilya Boytsov Rambler&Co
  • 2. Rambler&Co - largest media holding in Russia
  • 3. About Rambler&Co AdTech Projects: • Data management platform (user segmentation) • Recommender systems • “Lumiere” (forecasting offline cinema traffic) • Computer vision
  • 4. About Rambler&Co AdTech Projects: • Data management platform (user segmentation) • Recommender systems • “Lumiere” (forecasting offline cinema traffic) • Computer vision
  • 5. In this talk: • DMP and user segmentation tasks explained • Key structures of AutoML pipeline for user segmentation • Problems we faced while maintaining pipeline • Feature engineering for machine learning at scale • Optimization of pipeline tasks
  • 6. Data management platform (DMP): a powerful AdTech solution -Collect user behavior data from various sources -Integrate data to create a complete customer view -Store and manage audience segments -Target audience segments in online ad companies
  • 7. Types of Data Sources of Data 1st party data – raw events logs (visited websites) 2nd party data – customer journey data 3rd party data – data collected from partners Media resources Products and services Data from ad campaigns, behavioral factors Other sources Образец слайда
  • 8. DMP AutoML pipeline: solution for any user segmentation task About 1000 models fitted on daily basis Every model is being applied on 300 million of test samples daily ML problems: • binary/multiclass classification • Look alike –> binary classification(segment vs random)
  • 9. Retargeting_973 Look-alike 0.0-1% Look-alike 1.0-5.0% Look-alike 5.0-10.0% Retargeting_1069 Look-alike 0.0-1% Look-alike 1.0-5.0% Look-alike 5.0-10.0% Examples: Look-alike modelling boosts ctr
  • 10. General principles of DMP AutoML All models have similar structure of fit and apply stages Adding models and exploitation options have to be implemented with web interface No need for ML developers to support a scope of key operations
  • 11. Felix Backoffice and web interface for AutoML pipeline Create new models, add new segments, visualize model performance and many more
  • 13. AutoML pipeline daily workflow Felix Compute features Create train table Train models Compute pivots load pivots Apply and slice predictions Compute metrics Load models
  • 14. Workflow manager: Apache Airflow • Run a series of tasks as DAG (directed acyclic graph) • Express task dependencies • Handle failures
  • 15. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour
  • 16. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour Problem: Some target segments(labels) finish computing slower than others. Solution: While some models wait for target segments, other models keep training
  • 17. Train and apply DAG`s Train DAG interval: every 4 hours Apply DAG interval: every hour
  • 18. Key problems we faced • Data collection delay • Out of memory issues • High cardinality feature matrices • Too much time to map predictions with label thresholds • Some models are being applied more often than others
  • 19. Data collection delay • Use Airflow sensor to wait for MAX_ FEATURE_DELAY
  • 20. Data collection delay: do not wait too much • Use Airflow sensor to wait for MAX_ FEATURE_DELAY • If exceeded fill the missing parts of features table with last computed day
  • 21. Feature Engineering(FE): overcoming high cardinality feature matrices Main rule: New Features must be applicable for a majority of models Key techniques • Counting based FE • Distance based FE
  • 22. Feature matrix of shape (N, 10000) id Feature_1 ... Feature_10000 1 42 ... 542 .... ... ... ... N 89 ... 0
  • 23. Distance based FE: Cluster distance Algorithm: 1) Reduce dimension of feature matrix if needed (we use SVD decomposition) 2) Fit KMeans clustering algorithm with K clusters on given data 3) Calculate distance from sample point to centroid of Kth cluster 4) Use distances as feature representation for sample row
  • 24. Feature matrix of shape (N, K) id dist_to_1st_cluster ... dist_to_Kth_cluster 1 0.6757 ... 0.0942 .... ... ... ... N 0.342 ... 0.6113
  • 25. Problem: It may take much time for KMeans to converge and compute distances for every model…
  • 26. Solution: “Global” Cluster distance Feature • Fit KMeans only once on representative unlabeled sample to extract general information and use for all models
  • 27. Experimental results: • Replacing individually fitted by model distance features with ”Global” feature doesn’t harm model quality • Combining both feature representations improve roc auc score about 1%
  • 28. Counting based FE Traditional approaches: 1) Feature Hashing 2) One hot encoding User Domain Count Bob news.rambler.ru 5 Bob auto.ru 11 Bob mercedes-benz.ru 15
  • 29. Counting based FE Traditional approaches: 1) Feature Hashing 2) One hot encoding General problems: 1) High cardinality 2) Efficient with only linear models
  • 30. Counting based FE: DRACULA Domain Robust Algorithm for Counting Based Learning Source: http://www.slideshare.net/SessionsEvents/misha-bilenko- principal-researcher-microsoft
  • 31. Algorithm Compute counts table from all train data Compute P(label | feature) for every unique feature Aggregate list of probabilities to get low cardinality data representation 01 02 03
  • 32. Counts of visited domains for single user User Domain Count Bob news.rambler.ru 5 Bob auto.ru 11 Bob mercedes-benz.ru 15 Domain Count news.rambler.ru 95859 auto.ru 31040 mercedes-benz.ru 1386 Total counts in train data
  • 33. Counts table Domain Total count Count(label=0) Count(label=1) news.rambler.ru 95859 41268 54591 auto.ru 31040 26809 4231 mercedes-benz.ru 1386 1120 266 Total 128285 69131 59154
  • 34. 𝑁label = domain frequency with given label 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 = smoothing constant * aprior class probability 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 =1- 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 N = 𝑁 𝑝𝑜𝑠 + 𝑁 𝑛𝑒𝑔 P(label | domain) = 𝑁 𝑙𝑎𝑏𝑒𝑙 + 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 𝑁+𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 + 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 General formula: probability of label given domain
  • 35. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81
  • 36. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81 N = 10 Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ] Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
  • 37. Compute data representation for single user P(label=0|domain = ”news.rambler.ru”) = 0.43 P(label=0|domain = ”auto.ru”) = 0.86 P(label=0|domain = ”mercedes-benz.ru”) = 0.81 N = 10 Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ] Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0] Interpretation: the concentration of nonzero elements in histogram represents estimation of P(label | user)
  • 38. Advances of algorithm • Scalable (add new features, recompute probabilities) • Adaptive (fits for binary and multiclass classification and regression as well) • Efficient for gradient boosting decision trees due to low cardinality • Ability to compute features in distributed manner (mapreduce) • Ability to store counts table with count min sketch
  • 39. Compute models pivots task Approach to approximate label thresholds Problems: • how to select thresholds for labels? • how to do it computationally fast? Desired solution: • Heuristics to approximate label thresholds
  • 40. Compute models pivots task Approach to approximate label thresholds Precision-threshold for binary classification. What probability threshold optimizes given metrics quality?
  • 41. Compute models pivots task Approach to approximate label thresholds Algorithm: • Take sample of apply data ( we use 5%, about 15 million samples) • Compute probabilities histogram for this sample • Use Nth percentile as estimation for label threshold
  • 42. Apply model task Task interval: every hour Number of models per run: 200 General problem: • Some models are being applied more often then others
  • 43. Priority schema of apply models 1) Request all models 2) Filter out not yet trained models 3) Sort by date of adding a model (descending) 4) Sort by date of last apply (ascending) 5) Take N top priority models
  • 45. Key notes: 1) Think of scalable approach
  • 46. Key notes: 1) Think of scalable approach 2) Implement monitoring of pipeline performance
  • 47. Key notes: 1) Think of scalable approach 2) Implement monitoring of pipeline performance 3) Make experiments
  • 48. In lieu of conclusion... Pipeline automation...is full of fun