SlideShare a Scribd company logo
10
Most read
14
Most read
18
Most read
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Uber Risk Team
Semi-supervised Learning In An
Adversarial Environment
Karthik Ramasamy
Gaurav Agarwal
About us
• Data Science Manager and founding member of - Account Security and Privacy
• Previously:
‐ 2 years at my own startup
‐ 4.5 years at LinkedIn (founding member of the security team)
• Interests:
‐ Applying ML, recently deep learning, in security and fraud problems
‐ Building scalable infra for ML systems
Karthik Ramasamy
About us
• Senior engineer and founding member of Account security team
• 2 years at Uber working on fraud problems
• Previously:
‐ 4 years at Microsoft working on NLP Question answering systems
• Interests:
‐ Intersection of distributed systems and ML
‐ Natural language processing
Gaurav Agarwal
Focus of the talk
• DS algorithms and process
• Features
‐ Not covered in this talk
‐ Arbitrary names will be used to describe features
• Engineering architecture
• Engineering challenges
Semi-Supervised Learning
Classification Clustering
Adversarial Problem
Not-Hotdog Classifier Fake Account Classifier
Adversarial Problem
Recall for a specific model % of null features
Account Takeovers (ATO)
Single IP IPs in same
class c/b
Targeted Attacks on
specific accounts
Phishing and Malware
Attacks
Massive Botnet
with >100K IPs
Proxy IPs
across world
Easy to detect Hard to detect
Clustering: K-Means
Credits: www.naftaliharris.com
https://youtu.be/vY5q20NXVJM
Clustering: DBSCAN
Credits: www.naftaliharris.com
https://youtu.be/h53WMIImUuc
Clustering Algorithm: DBSCAN
1. Method 1 - use labels to tune hyperparameter
a. Solves the hyperparameter tuning issue
2. Method 2 - use labels as constraints in the clustering algorithm
a. Challenges
i. Working with week labels
ii. Scalability
Semi-Supervised Clustering Approach
Login Clusters 2016
*PCA used for visualization
Login Clusters 2017
*PCA used for visualization
ML Challenges
• Feature Selection
‐ Manual feature selection
‐ Aggregate features are better
‐ Feature normalization is very important
▪ Features like trip fare and #trips are bad features
▪ % of UberX trips for a user is a good feature
▪ All features having same scale like % of X
• Feature evolution in adversarial environment
• Scalability
‐ DBSCAN for large dataset (10’s of millions) takes long time to fit
• No online DBSCAN
Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Engineering Architecture
Goals
● Minimum human input for unknown anomalies
● Automatic action support
● Low latency for sync scenarios eg: Login
● Easily reusable for other use cases eg: Bot signups
● Good ML library support
Query Flow
Login
Client
Feature
Gathering
ML
Models
Actions
Risk
Gateway
Rule
Engine
Other
Stores
Error code
Challenge
thrown
Async call
Parallel calls
Sync callChallenges:
● 2FA
● Captcha
● ... Streaming
Counters
Clustering
Features
Feature Computation
Feature
Normalization
Categorical
Feature
Transformation
Spark
Mllib
(k-means)
Spark
Clustering
Features
(Cassandra)
Offline
Features
Attempt
Thresholding
DBSCAN
(sklearn)
Parameter
Tuning
Login
Attempts
(Kafka)
Challenge
Feedback
Signals
(Kafka)
Streaming pipeline
Hourly job pipeline
Engineering Challenges
● Python vs Scala perf for streaming case
● DBScan limitations in Spark
● Window aggregations limited to 30 minutes
● JOINS with feedback signals in realtime
Production Setup
● Batch: 7 days worth of data, run DBSCAN hourly
● Streaming: 60 minutes moving window, run streaming k-means
● Used feedback signal success ratios to mark clusters as good, bad or unknown
● Bad clusters: Always throw
● Good clusters: Small % of attempts
● Unknown clusters: X% of attempts
Results
Good Clusters
Bad Clusters
Unknown Clusters
GPU DBSCAN using Faiss from FB
Thank you!

More Related Content

PPTX
daa-unit-3-greedy method
PDF
Introduction to Hadoop
PDF
Bayesian learning
PPTX
LINEAR BOUNDED AUTOMATA (LBA).pptx
PPTX
Control Strategies in AI
PPTX
Artificial Intelligence
PPTX
Naive bayes
PPT
Turing machines
daa-unit-3-greedy method
Introduction to Hadoop
Bayesian learning
LINEAR BOUNDED AUTOMATA (LBA).pptx
Control Strategies in AI
Artificial Intelligence
Naive bayes
Turing machines

What's hot (20)

PPTX
simple problem to convert NFA with epsilon to without epsilon
PPTX
Branch and bounding : Data structures
PPTX
Bresenham-Circle-drawing-algorithm, Midpoint Circle Drawing Algorithm
PPTX
Minmax Algorithm In Artificial Intelligence slides
PPTX
Transformer in Vision
DOCX
Gramaticas independientes de contexto ejecrcicios 2
PPTX
Alpha-beta pruning (Artificial Intelligence)
PDF
String matching algorithms
PPTX
Rabin Carp String Matching algorithm
PDF
Introduction to OpenCV
PPTX
Unification and Lifting
PPTX
02 asymptotic notations
PPTX
Problem solving in Artificial Intelligence.pptx
PDF
Rabin karp string matcher
PPTX
Automata presentation turing machine programming techniques
PPTX
Lecture 21 problem reduction search ao star search
PPTX
PDF
An introduction to the Transformers architecture and BERT
PDF
Bayes Belief Networks
simple problem to convert NFA with epsilon to without epsilon
Branch and bounding : Data structures
Bresenham-Circle-drawing-algorithm, Midpoint Circle Drawing Algorithm
Minmax Algorithm In Artificial Intelligence slides
Transformer in Vision
Gramaticas independientes de contexto ejecrcicios 2
Alpha-beta pruning (Artificial Intelligence)
String matching algorithms
Rabin Carp String Matching algorithm
Introduction to OpenCV
Unification and Lifting
02 asymptotic notations
Problem solving in Artificial Intelligence.pptx
Rabin karp string matcher
Automata presentation turing machine programming techniques
Lecture 21 problem reduction search ao star search
An introduction to the Transformers architecture and BERT
Bayes Belief Networks
Ad

Similar to Semi-Supervised Learning In An Adversarial Environment (20)

PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
PPT
Agile india2018 exp_report
PDF
Open source ml systems that need to be built
PDF
Multi-Agent Era will Define the Future of Software
DOC
Satheesh.G_IDM
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
PDF
World Artificial Intelligence Conference Shanghai 2018
PDF
Building a Scalable and reliable open source ML Platform with MLFlow
PDF
Building A Machine Learning Platform At Quora (1)
PPTX
Strata London - Deep Learning 05-2015
PPTX
AI hype or reality
PDF
Software architecture, Patterns for Scale
PDF
C19013010 the tutorial to build shared ai services session 1
PDF
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
PDF
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
DOC
Simha-Resume2
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
PDF
How I became ML Engineer
Machine learning at scale - Webinar By zekeLabs
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Agile india2018 exp_report
Open source ml systems that need to be built
Multi-Agent Era will Define the Future of Software
Satheesh.G_IDM
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
World Artificial Intelligence Conference Shanghai 2018
Building a Scalable and reliable open source ML Platform with MLFlow
Building A Machine Learning Platform At Quora (1)
Strata London - Deep Learning 05-2015
AI hype or reality
Software architecture, Patterns for Scale
C19013010 the tutorial to build shared ai services session 1
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
Simha-Resume2
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
How I became ML Engineer
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
August Patch Tuesday
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
August Patch Tuesday
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Machine learning based COVID-19 study performance prediction
Assigned Numbers - 2025 - Bluetooth® Document
OMC Textile Division Presentation 2021.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Spectroscopy.pptx food analysis technology
SOPHOS-XG Firewall Administrator PPT.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Group 1 Presentation -Planning and Decision Making .pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks

Semi-Supervised Learning In An Adversarial Environment