AutoML for user segmentation: how to match millions of users with hundreds of segments every day

AutoML for user segmentation
Ilya Boytsov
Rambler&Co

Rambler&Co - largest media holding in Russia

About Rambler&Co AdTech
Projects:
• Data management platform (user
segmentation)
• Recommender systems
• “Lumiere” (forecasting offline cinema traffic)
• Computer vision

In this talk:
• DMP and user segmentation tasks explained
• Key structures of AutoML pipeline for user segmentation
• Problems we faced while maintaining pipeline
• Feature engineering for machine learning at scale
• Optimization of pipeline tasks

Data management platform (DMP): a powerful AdTech
solution
-Collect user behavior data from various sources
-Integrate data to create a complete customer view
-Store and manage audience segments
-Target audience segments in online ad companies

Types of Data Sources of Data
1st party data – raw events logs (visited
websites)
2nd party data – customer journey data
3rd party data – data collected from partners
Media resources
Products and services
Data from ad campaigns, behavioral factors
Other sources
Образец слайда

DMP AutoML pipeline: solution for any user
segmentation task
About 1000 models fitted on daily basis
Every model is being applied on 300 million of test samples daily
ML problems:
• binary/multiclass classification
• Look alike –> binary classification(segment vs random)

Retargeting_973
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Retargeting_1069
Look-alike 0.0-1%
Look-alike 1.0-5.0%
Look-alike 5.0-10.0%
Examples: Look-alike modelling boosts ctr

General principles of DMP AutoML
All models have similar structure of fit and apply stages
Adding models and exploitation options have to be implemented with
web interface
No need for ML developers to support a scope of key operations

Felix
Backoffice and web
interface for AutoML
pipeline
Create new models, add new segments,
visualize model performance and many more

AutoML for user segmentation: how to match millions of users with hundreds of segments every day

AutoML pipeline daily workflow
Felix
Compute features
Create train table
Train models
Compute pivots load
pivots
Apply and slice
predictions
Compute
metrics
Load
models

Workflow manager: Apache Airflow
• Run a series of tasks as DAG (directed acyclic graph)
• Express task dependencies
• Handle failures

Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour

Train and apply DAG`s
Train DAG interval: every 4 hours
Apply DAG interval: every hour
Problem:
Some target segments(labels) finish computing
slower than others.
Solution:
While some models wait for target segments, other
models keep training

Key problems we faced
• Data collection delay
• Out of memory issues
• High cardinality feature matrices
• Too much time to map predictions with label thresholds
• Some models are being applied more often than others

Data collection delay
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY

Data collection delay: do not wait too much
• Use Airflow sensor to wait for MAX_ FEATURE_DELAY
• If exceeded fill the missing parts of features table with last computed
day

Feature Engineering(FE): overcoming high cardinality
feature matrices
Main rule:
New Features must be
applicable for a majority of
models
Key techniques
• Counting based FE
• Distance based FE

Feature matrix of shape (N, 10000)
id Feature_1 ... Feature_10000
1 42 ... 542
.... ... ... ...
N 89 ... 0

Distance based FE: Cluster distance
Algorithm:
1) Reduce dimension of feature matrix if needed (we use SVD
decomposition)
2) Fit KMeans clustering algorithm with K clusters on given data
3) Calculate distance from sample point to centroid of Kth cluster
4) Use distances as feature representation for sample row

Feature matrix of shape (N, K)
id dist_to_1st_cluster ... dist_to_Kth_cluster
1 0.6757 ... 0.0942
.... ... ... ...
N 0.342 ... 0.6113

Problem:
It may take much time for KMeans to converge
and compute distances for every model…

Solution: “Global” Cluster distance Feature
• Fit KMeans only once on representative
unlabeled sample to extract general information
and use for all models

Experimental results:
• Replacing individually fitted by model distance features with ”Global” feature
doesn’t harm model quality
• Combining both feature representations improve roc auc score about 1%

Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15

Counting based FE
Traditional approaches:
1) Feature Hashing
2) One hot encoding
General problems:
1) High cardinality
2) Efficient with only linear models

Counting based FE: DRACULA
Domain Robust Algorithm for Counting Based Learning
Source: http://www.slideshare.net/SessionsEvents/misha-bilenko-
principal-researcher-microsoft

Algorithm
Compute counts table from all train
data
Compute P(label | feature) for
every unique feature
Aggregate list of probabilities to get
low cardinality data representation
01
02
03

Counts of visited domains for single
user
User Domain Count
Bob news.rambler.ru 5
Bob auto.ru 11
Bob mercedes-benz.ru 15
Domain Count
news.rambler.ru 95859
auto.ru 31040
mercedes-benz.ru 1386
Total counts in train data

Counts table
Domain Total count Count(label=0) Count(label=1)
news.rambler.ru 95859 41268 54591
auto.ru 31040 26809 4231
mercedes-benz.ru 1386 1120 266
Total 128285 69131 59154

𝑁label = domain frequency with given label
𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 = smoothing constant * aprior class
probability
𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟 =1- 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
N = 𝑁 𝑝𝑜𝑠 + 𝑁 𝑛𝑒𝑔
P(label | domain) =
𝑁 𝑙𝑎𝑏𝑒𝑙 + 𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟
𝑁+𝑁 𝑝𝑜𝑠 𝑎𝑝𝑟𝑖𝑜𝑟 + 𝑁 𝑛𝑒𝑔 𝑎𝑝𝑟𝑖𝑜𝑟
General formula: probability of label given domain

Compute data representation for single user
P(label=0|domain = ”news.rambler.ru”) = 0.43
P(label=0|domain = ”auto.ru”) = 0.86
P(label=0|domain = ”mercedes-benz.ru”) =
0.81

P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]

P(label=0|domain = ”mercedes-benz.ru”) = 0.81
N = 10
Bins = [0.0 , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0 ]
Histogram = [0, 0, 0, 0, 1, 0, 0, 0, 2, 0]
Interpretation: the concentration of
nonzero elements in histogram
represents estimation of P(label | user)

Advances of algorithm
• Scalable (add new features, recompute probabilities)
• Adaptive (fits for binary and multiclass classification and regression as
well)
• Efficient for gradient boosting decision trees due to low cardinality
• Ability to compute features in distributed manner (mapreduce)
• Ability to store counts table with count min sketch

Compute models pivots task
Approach to approximate label thresholds
Problems:
• how to select thresholds for
labels?
• how to do it computationally
fast?
Desired solution:
• Heuristics to approximate label
thresholds

Precision-threshold for binary classification.
What probability threshold optimizes given metrics quality?

Algorithm:
• Take sample of apply data ( we use 5%, about 15 million samples)
• Compute probabilities histogram for this sample
• Use Nth percentile as estimation for label threshold

Apply model task
Task interval: every hour
Number of models per run: 200
General problem:
• Some models are being applied more often then others

Priority schema of apply models
1) Request all models
2) Filter out not yet trained models
3) Sort by date of adding a model (descending)
4) Sort by date of last apply (ascending)
5) Take N top priority models

Key notes:
1) Think of scalable approach

Key notes:
2) Implement monitoring of pipeline performance

Key notes:
2) Implement monitoring of pipeline performance
3) Make experiments

In lieu of conclusion...
Pipeline automation...is full of fun

Questions? Contact me!
https://www.facebook.com/ieboytsov
i.boytsov@rambler-co.ru

AutoML for user segmentation: how to match millions of users with hundreds of segments every day

More Related Content

What's hot (12)

Similar to AutoML for user segmentation: how to match millions of users with hundreds of segments every day (20)

More from Institute of Contemporary Sciences (20)

Recently uploaded (20)

AutoML for user segmentation: how to match millions of users with hundreds of segments every day