Semi-Supervised Learning In An Adversarial Environment

Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber Risk Team
Semi-supervised Learning In An
Adversarial Environment
Karthik Ramasamy
Gaurav Agarwal

About us
• Data Science Manager and founding member of - Account Security and Privacy
• Previously:
‐ 2 years at my own startup
‐ 4.5 years at LinkedIn (founding member of the security team)
• Interests:
‐ Applying ML, recently deep learning, in security and fraud problems
‐ Building scalable infra for ML systems
Karthik Ramasamy

About us
• Senior engineer and founding member of Account security team
• 2 years at Uber working on fraud problems
• Previously:
‐ 4 years at Microsoft working on NLP Question answering systems
• Interests:
‐ Intersection of distributed systems and ML
‐ Natural language processing
Gaurav Agarwal

Focus of the talk
• DS algorithms and process
• Features
‐ Not covered in this talk
‐ Arbitrary names will be used to describe features
• Engineering architecture
• Engineering challenges

Semi-Supervised Learning
Classification Clustering

Adversarial Problem
Not-Hotdog Classifier Fake Account Classifier

Adversarial Problem
Recall for a specific model % of null features

Account Takeovers (ATO)
Single IP IPs in same
class c/b
Targeted Attacks on
specific accounts
Phishing and Malware
Attacks
Massive Botnet
with >100K IPs
Proxy IPs
across world
Easy to detect Hard to detect

Clustering: K-Means
Credits: www.naftaliharris.com
https://youtu.be/vY5q20NXVJM

Clustering: DBSCAN
Credits: www.naftaliharris.com
https://youtu.be/h53WMIImUuc

Clustering Algorithm: DBSCAN
1. Method 1 - use labels to tune hyperparameter
a. Solves the hyperparameter tuning issue
2. Method 2 - use labels as constraints in the clustering algorithm
a. Challenges
i. Working with week labels
ii. Scalability
Semi-Supervised Clustering Approach

Login Clusters 2016
*PCA used for visualization

Login Clusters 2017
*PCA used for visualization

ML Challenges
• Feature Selection
‐ Manual feature selection
‐ Aggregate features are better
‐ Feature normalization is very important
▪ Features like trip fare and #trips are bad features
▪ % of UberX trips for a user is a good feature
▪ All features having same scale like % of X
• Feature evolution in adversarial environment
• Scalability
‐ DBSCAN for large dataset (10’s of millions) takes long time to fit
• No online DBSCAN

Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Engineering Architecture

Goals
● Minimum human input for unknown anomalies
● Automatic action support
● Low latency for sync scenarios eg: Login
● Easily reusable for other use cases eg: Bot signups
● Good ML library support

Query Flow
Login
Client
Feature
Gathering
ML
Models
Actions
Risk
Gateway
Rule
Engine
Other
Stores
Error code
Challenge
thrown
Async call
Parallel calls
Sync callChallenges:
● 2FA
● Captcha
● ... Streaming
Counters
Clustering
Features

Feature Computation
Feature
Normalization
Categorical
Feature
Transformation
Spark
Mllib
(k-means)
Spark
Clustering
Features
(Cassandra)
Offline
Features
Attempt
Thresholding
DBSCAN
(sklearn)
Parameter
Tuning
Login
Attempts
(Kafka)
Challenge
Feedback
Signals
(Kafka)
Streaming pipeline
Hourly job pipeline

Engineering Challenges
● Python vs Scala perf for streaming case
● DBScan limitations in Spark
● Window aggregations limited to 30 minutes
● JOINS with feedback signals in realtime

Production Setup
● Batch: 7 days worth of data, run DBSCAN hourly
● Streaming: 60 minutes moving window, run streaming k-means
● Used feedback signal success ratios to mark clusters as good, bad or unknown
● Bad clusters: Always throw
● Good clusters: Small % of attempts
● Unknown clusters: X% of attempts

Results
Good Clusters
Bad Clusters
Unknown Clusters

GPU DBSCAN using Faiss from FB

Semi-Supervised Learning In An Adversarial Environment

More Related Content

What's hot (20)

Similar to Semi-Supervised Learning In An Adversarial Environment (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Semi-Supervised Learning In An Adversarial Environment