SlideShare a Scribd company logo
Deep Learning
and Text Mining
Will Stanton
Ski Hackathon Kickoff Ceremony, Feb 28, 2015
We have a problem
● At Return Path, we process billions of emails
a year, from tons of senders
● We want to tag and cluster senders
○ Industry verticals (e-commerce, apparel, travel, etc.)
○ Type of customers they sell to (luxury, soccer moms,
etc.)
○ Business model (daily deals, flash sales, etc.)
● It’s too much to do by hand!
What to do?
● Standard approaches aren’t great
○ Bag of words classification model (document-term matrix, LSA, LDA)
■ Have to manually label lots of cases first
■ Difficult with lots of data (especially LDA)
○ Bag of words clustering
■ Can’t easily put one company into multiple categories (ie. more
general tagging)
■ Needs lots of tuning
● How about deep learning neural networks?
○ Very trendy. Let’s try it!
Neural Networks
Input
Layer First
Hidden
Layer
Second
Hidden
Layer
Output
Layer
Inputs
x = (x1, x2, x3)
Output
y
● Machine learning algorithms
modeled after the way the human
brain works
● Learn patterns and structure by
passing training data through
“neurons”
● Useful for classification,
regression, feature extraction, etc.
Deep Learning
● Neural networks with lots of hidden layers
(hundreds)
● State of the art for machine translation, facial
recognition, text classification, speech
recognition
○ Tasks with real deep structure, that humans do
automatically but computers struggle with
○ Should be good for company tagging!
Distributed Representations
Pixels
EdgesCollection of faces
Shapes
Typical facial types
(features)
● Human brain uses distributed representations
● We can use deep learning to do the same thing with
words (letters -> words -> phrases -> sentences -> …)
Deep Learning Challenges
● Computationally difficult to train (ie. slow)
○ Each hidden layer means more parameters
○ Each feature means more parameters
● Real human-generated text has a near-
infinite number of features and data
○ ie. slow would be a problem
● Solution: use word2vec
word2vec
● Published by scientists at Google in 2013
● Python implementation in 2014
○ gensim library
● Learns distributed vector representations
of words (“word to vec”) using a neural net
○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural
net, but it uses deep learning technology (distributed representations, backpropagation,
stochastic gradient descent, etc.) and is based on a series of deep learning papers
What is the output?
● Distributed vector representations of words
○ each word is encoded as a vector of floats
○ vecqueen
= (0.2, -0.3, .7, 0, … , .3)
○ vecwoman
= (0.1, -0.2, .6, 0.1, … , .2)
○ length of the vectors = dimension of the word
representation
○ key concept of word2vec: words with similar
vectors have a similar meaning (context)
word2vec Features
● Very fast and scalable
○ Google trained it on 100’s of billions of words
● Uncovers deep latent structure of word
relationships
○ Can solve analogies like King::Man as Queen::? or
Paris::France as Berlin::?
○ Can solve “one of these things is not like another”
○ Can be used for machine translation or automated
sentence completion
How does it work?
● Feed the algorithm (lots of) sentences
○ totally unsupervised learning
● word2vec trains a neural net that encodes
the context of words within sentences
○ “Skip-grams”: what is the probability that the word
“queen” appears 1 word after “woman”, 2 words
after, etc.
word2vec at Return Path
● At Return Path, we implemented word2vec
on data from our Consumer Data Stream
○ billions of email subject lines from millions of users
○ fed 30 million unique subject lines (300m words) and
sending domains into word2vec (using Python)
Lots of
subject
lines
word2vec word vectors insights
Grouping companies with word2vec
● Find daily deals sites like Groupon
[word for (word, score) in model.most_similar('groupon.com', topn = 100) if
'.com' in word]
['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com',
'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com',
'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com']
● Find apparel sites like Gap
[word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com'
in word]
['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com',
'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com',
'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com',
'neimanmarcusemail.com']
More word2vec applications
● Find relationships between products
● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) =
‘samsung’
● ie. iphone::apple as galaxy::? samsung!
● Distinguish different companies
● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’
● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels
● Other possibilities
○ Find different companies with similar marketing copy
○ Automatically construct high-performing subject lines
○ Many more...
Try it yourself
● C implementation exists, but I recommend
Python
○ gensim library: https://radimrehurek.com/gensim/
○ tutorial:http://radimrehurek.
com/gensim/models/word2vec.html
○ webapp to try it out as part of tutorial
○ Pretrained Google News and Freebase models:
https://code.google.com/p/word2vec/
○ Only takes 10 lines of code to get started!
Thanks for listening!
● Many thanks to:
○ Data Science Association and Level 3
○ Michael Walker for organizing
● Slides posted on http://will-stanton.com/
● Email me at will@will-stanton.com
● Return Path is hiring! Voted #2 best
midsized company to work for in the country
http://careers.returnpath.com/

More Related Content

PDF
Deep Learning for Information Retrieval
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
Deep learning for natural language embeddings
PDF
Multi modal retrieval and generation with deep distributed models
PDF
Deep learning for nlp
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
Zero shot learning through cross-modal transfer
Deep Learning for Information Retrieval
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep learning for natural language embeddings
Multi modal retrieval and generation with deep distributed models
Deep learning for nlp
Visual-Semantic Embeddings: some thoughts on Language
Deep Learning for Natural Language Processing: Word Embeddings
Zero shot learning through cross-modal transfer

What's hot (19)

PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
Information Retrieval with Deep Learning
PPTX
Word2vec slide(lab seminar)
PPTX
Recurrent networks and beyond by Tomas Mikolov
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PDF
Deep Learning, an interactive introduction for NLP-ers
PDF
Deep Learning & NLP: Graphs to the Rescue!
PDF
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
PDF
Learning to understand phrases by embedding the dictionary
PPTX
Talk from NVidia Developer Connect
PDF
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
PDF
Practical Deep Learning for NLP
PPTX
DLBLR talk
PDF
Anthiil Inside workshop on NLP
PPTX
A Panorama of Natural Language Processing
PDF
Lecture: Word Sense Disambiguation
PPTX
Word embeddings, RNN, GRU and LSTM
PPTX
Natural language processing techniques transition from machine learning to de...
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Information Retrieval with Deep Learning
Word2vec slide(lab seminar)
Recurrent networks and beyond by Tomas Mikolov
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Deep Learning, an interactive introduction for NLP-ers
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
(Deep) Neural Networks在 NLP 和 Text Mining 总结
Learning to understand phrases by embedding the dictionary
Talk from NVidia Developer Connect
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Practical Deep Learning for NLP
DLBLR talk
Anthiil Inside workshop on NLP
A Panorama of Natural Language Processing
Lecture: Word Sense Disambiguation
Word embeddings, RNN, GRU and LSTM
Natural language processing techniques transition from machine learning to de...
Ad

Viewers also liked (20)

PPTX
Deep learning for text analytics
PDF
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
PPT
Textmining Retrieval And Clustering
PPT
Thinking about nlp
PPTX
PPTX
NLP@Work Conference: email persuasion
PDF
Functional Reactive Programming in Clojurescript
PPTX
AI Reality: Where are we now? Data for Good? - Bill Boorman
PPTX
Using Deep Learning And NLP To Predict Performance From Resumes
PDF
Monads in Clojure
PDF
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
PDF
Natural language processing (Python)
PPTX
Natural Language Processing and Python
PDF
Classification Based Machine Learning Algorithms
PDF
RでTwitterテキストマイニング
PDF
RではじめるTwitter解析
PPTX
Online algorithms in Machine Learning
PPTX
ElasticSearch for data mining
PPT
SVM&R with Yaruo!!
PDF
PyData 2015 Keynote: "A Systems View of Machine Learning"
Deep learning for text analytics
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
Textmining Retrieval And Clustering
Thinking about nlp
NLP@Work Conference: email persuasion
Functional Reactive Programming in Clojurescript
AI Reality: Where are we now? Data for Good? - Bill Boorman
Using Deep Learning And NLP To Predict Performance From Resumes
Monads in Clojure
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
Natural language processing (Python)
Natural Language Processing and Python
Classification Based Machine Learning Algorithms
RでTwitterテキストマイニング
RではじめるTwitter解析
Online algorithms in Machine Learning
ElasticSearch for data mining
SVM&R with Yaruo!!
PyData 2015 Keynote: "A Systems View of Machine Learning"
Ad

Similar to Deep Learning and Text Mining (20)

PPTX
Word2 vec
PDF
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
PPTX
Cloud AI GenAI Overview.pptx
PPTX
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
PPTX
Cloud Study Jam[1st OCT] gdscgtbit.pptx
PDF
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
PPTX
Introducción a NLP (Natural Language Processing) en Azure
PPT
National Wildlife Federation- OMS- Dreamcore 2011
PPTX
MachinaFiesta: A Vision into Machine Learning 🚀
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PPTX
Brownfield Domain Driven Design
PDF
MongoDB World 2019: Fast Machine Learning Development with MongoDB
PPTX
Strata London - Deep Learning 05-2015
PDF
Knowledge graphs + Chatbots with Neo4j
PDF
Meetup 29042015
PDF
[系列活動] 人工智慧與機器學習在推薦系統上的應用
PDF
Labeling all the Things with the WDI Skill Labeler
PPTX
Building an An AI Startup with MongoDB at x.ai
PDF
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
PPTX
"Join the GDG LNCTE 2024 Orientation!!!"
Word2 vec
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptx
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
Introducción a NLP (Natural Language Processing) en Azure
National Wildlife Federation- OMS- Dreamcore 2011
MachinaFiesta: A Vision into Machine Learning 🚀
Tomáš Mikolov - Distributed Representations for NLP
Brownfield Domain Driven Design
MongoDB World 2019: Fast Machine Learning Development with MongoDB
Strata London - Deep Learning 05-2015
Knowledge graphs + Chatbots with Neo4j
Meetup 29042015
[系列活動] 人工智慧與機器學習在推薦系統上的應用
Labeling all the Things with the WDI Skill Labeler
Building an An AI Startup with MongoDB at x.ai
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic
"Join the GDG LNCTE 2024 Orientation!!!"

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx

Deep Learning and Text Mining

  • 1. Deep Learning and Text Mining Will Stanton Ski Hackathon Kickoff Ceremony, Feb 28, 2015
  • 2. We have a problem ● At Return Path, we process billions of emails a year, from tons of senders ● We want to tag and cluster senders ○ Industry verticals (e-commerce, apparel, travel, etc.) ○ Type of customers they sell to (luxury, soccer moms, etc.) ○ Business model (daily deals, flash sales, etc.) ● It’s too much to do by hand!
  • 3. What to do? ● Standard approaches aren’t great ○ Bag of words classification model (document-term matrix, LSA, LDA) ■ Have to manually label lots of cases first ■ Difficult with lots of data (especially LDA) ○ Bag of words clustering ■ Can’t easily put one company into multiple categories (ie. more general tagging) ■ Needs lots of tuning ● How about deep learning neural networks? ○ Very trendy. Let’s try it!
  • 4. Neural Networks Input Layer First Hidden Layer Second Hidden Layer Output Layer Inputs x = (x1, x2, x3) Output y ● Machine learning algorithms modeled after the way the human brain works ● Learn patterns and structure by passing training data through “neurons” ● Useful for classification, regression, feature extraction, etc.
  • 5. Deep Learning ● Neural networks with lots of hidden layers (hundreds) ● State of the art for machine translation, facial recognition, text classification, speech recognition ○ Tasks with real deep structure, that humans do automatically but computers struggle with ○ Should be good for company tagging!
  • 6. Distributed Representations Pixels EdgesCollection of faces Shapes Typical facial types (features) ● Human brain uses distributed representations ● We can use deep learning to do the same thing with words (letters -> words -> phrases -> sentences -> …)
  • 7. Deep Learning Challenges ● Computationally difficult to train (ie. slow) ○ Each hidden layer means more parameters ○ Each feature means more parameters ● Real human-generated text has a near- infinite number of features and data ○ ie. slow would be a problem ● Solution: use word2vec
  • 8. word2vec ● Published by scientists at Google in 2013 ● Python implementation in 2014 ○ gensim library ● Learns distributed vector representations of words (“word to vec”) using a neural net ○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural net, but it uses deep learning technology (distributed representations, backpropagation, stochastic gradient descent, etc.) and is based on a series of deep learning papers
  • 9. What is the output? ● Distributed vector representations of words ○ each word is encoded as a vector of floats ○ vecqueen = (0.2, -0.3, .7, 0, … , .3) ○ vecwoman = (0.1, -0.2, .6, 0.1, … , .2) ○ length of the vectors = dimension of the word representation ○ key concept of word2vec: words with similar vectors have a similar meaning (context)
  • 10. word2vec Features ● Very fast and scalable ○ Google trained it on 100’s of billions of words ● Uncovers deep latent structure of word relationships ○ Can solve analogies like King::Man as Queen::? or Paris::France as Berlin::? ○ Can solve “one of these things is not like another” ○ Can be used for machine translation or automated sentence completion
  • 11. How does it work? ● Feed the algorithm (lots of) sentences ○ totally unsupervised learning ● word2vec trains a neural net that encodes the context of words within sentences ○ “Skip-grams”: what is the probability that the word “queen” appears 1 word after “woman”, 2 words after, etc.
  • 12. word2vec at Return Path ● At Return Path, we implemented word2vec on data from our Consumer Data Stream ○ billions of email subject lines from millions of users ○ fed 30 million unique subject lines (300m words) and sending domains into word2vec (using Python) Lots of subject lines word2vec word vectors insights
  • 13. Grouping companies with word2vec ● Find daily deals sites like Groupon [word for (word, score) in model.most_similar('groupon.com', topn = 100) if '.com' in word] ['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com', 'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com', 'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com'] ● Find apparel sites like Gap [word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com' in word] ['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com', 'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com', 'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com', 'neimanmarcusemail.com']
  • 14. More word2vec applications ● Find relationships between products ● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) = ‘samsung’ ● ie. iphone::apple as galaxy::? samsung! ● Distinguish different companies ● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’ ● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels ● Other possibilities ○ Find different companies with similar marketing copy ○ Automatically construct high-performing subject lines ○ Many more...
  • 15. Try it yourself ● C implementation exists, but I recommend Python ○ gensim library: https://radimrehurek.com/gensim/ ○ tutorial:http://radimrehurek. com/gensim/models/word2vec.html ○ webapp to try it out as part of tutorial ○ Pretrained Google News and Freebase models: https://code.google.com/p/word2vec/ ○ Only takes 10 lines of code to get started!
  • 16. Thanks for listening! ● Many thanks to: ○ Data Science Association and Level 3 ○ Michael Walker for organizing ● Slides posted on http://will-stanton.com/ ● Email me at will@will-stanton.com ● Return Path is hiring! Voted #2 best midsized company to work for in the country http://careers.returnpath.com/