SlideShare a Scribd company logo
Email.cz workshop
Vit Listik @tivwvit
Email stats
● 60M emails per day
● 3M users daily, 6M monthly
● 2 PB
Email delivery process
Antispam
● Fighting with the bad guys
Antispam sources
● Content
○ Text
○ Images
○ Attachments
○ Links
○ Headers
● Metadata
○ Traffic
○ Historic data (reputation)
○ Blacklists
○ Rules (DKIM, DMARC, SPF)
Vít Listík - Email.cz workshop
Vít Listík - Email.cz workshop
Grey email
Graymail is solicited bulk email messages that don't fit the definition of email spam (e.g., the recipient "opted into" receiving them). Recipient
interest in this type of mailing tends to diminish over time, increasing the likelihood that recipients will report graymail as spam. In some
cases, graymail can account for up to 82 percent of the average user's email inbox.
Antispam stats again
ML in antispam
● Topic
● Usubscribe
● Phishing
● Domain keywords
● Images
● Personalized filter
● Link naturalness
Examples
https://github.com/tivvit/ML-Prague-2016-email-workshop
Tools
● Jupyter
○ Visualizations
○ State
● HDF5
● Pandas
● Ipython cluster
● Cluster storage
Let's go (to) Jupyter
Vít Listík - Email.cz workshop
Topic categorization
● 16 categories
● Manually labeled dataset
● 2 languages (2 models)
● 7th version
● Overlapping classes
NLP
● Bag of words
● Lemmatization
● Stop words
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
[
"John",
"likes",
"to",
"watch",
"movies",
"also",
"football",
"games",
"Mary",
"too"
]
(1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
(2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
SVM
● Classification
● Best split for classes
● Linear classifier (kernels)
Multi class version
● One vs. all
● Winner takes all
Topic categorization
Image categorization
● Classes: spam x ham
● Based on user reaction
● Links analysis
● Low level image features
○ Size
○ DPI
○ Hists
○ Exif
○ Compression
● Raw pixels
Spam roulette
Vít Listík - Email.cz workshop
User reactions
● Noisy
● Inconsistent
● Bots
● Low ratio
Image topics
● Caffe
● Pretrained network
● Same classes as for words
● Cleaned dataset of images from classified emails
● 400k images
● Slow on CPU
Loan non-bank Pharmacy DiscountEbola
Distributed learning
● Spark
● SparkNet (Caffe)
● Elepheas (Keras)
Image types
● Trivial
○ Animated
○ Monitoring
○ Border
● Photo
● Graphics
● Photo with graphics
Graphics
Photo
Image features
Extraction
● PIL
● OpenCV
● Image Magick
Features (142)
● Channel stats
○ Min, max, mean
○ Standard deviation
○ Skewness
○ Entropy
Learning
● Scipy - Decision Trees
● Keras (Tensorflow, theano)
● 30k Manually labeled samples
Trees vs. neurons
Message
● Gray email
● Explore (visualize) your data (in Jupyter)
● Use libraries
● Simple subtasks (boosting) may help
● Store intermediate results
● Store test results with the model

More Related Content

PDF
Winning the Big Data SPAM Challenge__HadoopSummit2010
ODP
Pedagogia cpm
PPTX
Presentación sin voz
PDF
FP Standard - Spring 2015
PDF
QBE Commercial Property PDS
PPT
Lugares históricos de mendoza
PPTX
GrabCAD CrowdSourcing in Design and Engineering
PDF
Presentación CTIF Madrid Oeste English
Winning the Big Data SPAM Challenge__HadoopSummit2010
Pedagogia cpm
Presentación sin voz
FP Standard - Spring 2015
QBE Commercial Property PDS
Lugares históricos de mendoza
GrabCAD CrowdSourcing in Design and Engineering
Presentación CTIF Madrid Oeste English

Viewers also liked (17)

PDF
Self Service Customer Care For Next Generation Networks
PPTX
Gone forever1
PPT
Angela vargas
PDF
NASA PDR Technical Report
PPT
Proyecto girasol Bogota ingeniaaa
PDF
Significado de-los-nombres-de-los-katas
PPTX
MiB jäsenkyselyn tulokset tiivistettynä 2015
PDF
Hl7 rep estandares
PDF
Georgian Governmental Gateway
PDF
Practical Encryption Tips and Tools
DOCX
Paulo Morosini
PPT
Sustentacion tesis usb
PPTX
Promesa de enajenación de inmuebles a plazo
PDF
IMPORTANCIA DE LA PRESERVACIÓN DE LAS ÁREAS PROTEGIDAS DEL PARQUE NACIONAL...
PPTX
Niños Índigo
PPTX
Lessons Learned from Building a Growth Team
PDF
How we research and prototype at Made by Many
Self Service Customer Care For Next Generation Networks
Gone forever1
Angela vargas
NASA PDR Technical Report
Proyecto girasol Bogota ingeniaaa
Significado de-los-nombres-de-los-katas
MiB jäsenkyselyn tulokset tiivistettynä 2015
Hl7 rep estandares
Georgian Governmental Gateway
Practical Encryption Tips and Tools
Paulo Morosini
Sustentacion tesis usb
Promesa de enajenación de inmuebles a plazo
IMPORTANCIA DE LA PRESERVACIÓN DE LAS ÁREAS PROTEGIDAS DEL PARQUE NACIONAL...
Niños Índigo
Lessons Learned from Building a Growth Team
How we research and prototype at Made by Many
Ad

Similar to Vít Listík - Email.cz workshop (20)

PDF
IRJET- Suspicious Email Detection System
PDF
Integration of feature sets with machine learning techniques
PPTX
Practical Machine Learning and Rails Part1
PDF
E-Mail Spam Detection Using Supportive Vector Machine
PDF
EdChang - Parallel Algorithms For Mining Large Scale Data
PDF
The Spammer, the Botmaster, and the Researcher: On the Arms Race in Spamming ...
PPT
SMS Spam Filter Design Using R: A Machine Learning Approach
PDF
Spam image email filtering using K-NN and SVM
PPTX
Machine Learning Fundamentals
DOC
Comparing Naive Bayesian and k-NN algorithms for automatic ...
PDF
Unveiling the gray emails: A Closer Look at Emails in the Gray Area
PDF
Identification of Spam Emails from Valid Emails by Using Voting
PDF
Improving spam detection with automaton
PDF
Open source community metrics
PPTX
Text mining and analytics v6 - p2
PDF
Processing obtained email data by using naïve bayes learning algorithm
PPT
R-programming-training-in-mumbai
PDF
GeeCON Prague 2015
PPTX
Spam Detection.pptx email spam detection ppt using naive bayes classifier
IRJET- Suspicious Email Detection System
Integration of feature sets with machine learning techniques
Practical Machine Learning and Rails Part1
E-Mail Spam Detection Using Supportive Vector Machine
EdChang - Parallel Algorithms For Mining Large Scale Data
The Spammer, the Botmaster, and the Researcher: On the Arms Race in Spamming ...
SMS Spam Filter Design Using R: A Machine Learning Approach
Spam image email filtering using K-NN and SVM
Machine Learning Fundamentals
Comparing Naive Bayesian and k-NN algorithms for automatic ...
Unveiling the gray emails: A Closer Look at Emails in the Gray Area
Identification of Spam Emails from Valid Emails by Using Voting
Improving spam detection with automaton
Open source community metrics
Text mining and analytics v6 - p2
Processing obtained email data by using naïve bayes learning algorithm
R-programming-training-in-mumbai
GeeCON Prague 2015
Spam Detection.pptx email spam detection ppt using naive bayes classifier
Ad

More from Machine Learning Prague (13)

PDF
Lukáš Vrábel - Deep Convolutional Neural Networks
PDF
Tomáš Cícha - Machine Learning Solutions at Seznam.cz
PDF
Jan Pospíšil - Azure ML
PPTX
Michael Levin - MatrixNet Applications at Yandex
PDF
Libor Mořkovský - Recognizing Malware
PDF
Adam Ashenfelter - Finding the Oddballs
PPTX
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
Kateřina Veselovská - ML Approaches to Sentiment Analysis
PPTX
Jiří Materna - Artificial Intelligence in Creative Writing
PPTX
Jan Šedivý - Intelligent Personal Assistants
PPTX
Marek Rosa - Inventing General Artificial Intelligence: A Vision and Methodology
PPTX
Xuedong Huang - Deep Learning and Intelligent Applications
Lukáš Vrábel - Deep Convolutional Neural Networks
Tomáš Cícha - Machine Learning Solutions at Seznam.cz
Jan Pospíšil - Azure ML
Michael Levin - MatrixNet Applications at Yandex
Libor Mořkovský - Recognizing Malware
Adam Ashenfelter - Finding the Oddballs
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Tomáš Mikolov - Distributed Representations for NLP
Kateřina Veselovská - ML Approaches to Sentiment Analysis
Jiří Materna - Artificial Intelligence in Creative Writing
Jan Šedivý - Intelligent Personal Assistants
Marek Rosa - Inventing General Artificial Intelligence: A Vision and Methodology
Xuedong Huang - Deep Learning and Intelligent Applications

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PPTX
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
A Presentation on Artificial Intelligence

Vít Listík - Email.cz workshop