SlideShare a Scribd company logo
|
Presented By
Date
July 6, 2017
Sujit Pal, Elsevier Labs
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe
for text classification and similarity
| 2
INSPIRATION
| 3
AGENDA
• NLP Pipelines before Deep Learning
• Deconstructing the “Encode, Embed, Attend, Predict” pipeline.
• Example #1: Document Classification
• Example #2: Document Similarity
• Example #3: Sentence Similarity
| 4
NLP PIPELINES BEFORE DEEP LEARNING
• Document Collection centric
• Based on Information Retrieval
• Document collection to matrix
• Densify using feature reduction
• Feed into SVM for classification, etc.
| 5
NLP PIPELINES BEFORE DEEP LEARNING
• Idea borrowed from Machine Learning (ML)
• Represent categorical variables (words) as 1-hot vectors
• Represent sentences as matrix of 1-hot word vectors
• No distributional semantics.
| 6
WORD EMBEDDINGS
• Word2Vec – predict word from
context (CBOW) or context from word
(skip-gram) shown here.
• Trained on large corpora and publicly
available.
• Other embeddings – GloVe, FastText.
| 7
STEP #1: EMBED
• Replace 1-hot vectors with 3rd party embeddings.
• Embeddings encode distributional semantics
• Sentence represented as sequence of dense word vectors
• Converts from word ID to word vector
| 8
STEP #2: ENCODE
• Bag of words – concatenate word vectors together.
• Encode step computes a representation of the sentence as a matrix.
• Each row of sentence matrix encodes the meaning of each word in the context of
the sentence.
• Use either LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit)
• Bidirectional processes words left to right and right to left and concatenates.
| 9
STEP #3: ATTEND
• Reduction operation – could be Sum or Global Average/Max Pooling instead
• Attention takes as input auxiliary context vector.
• Attention tells what to keep during reduction to minimize information loss.
• Different kinds – matrix, matrix + context, matrix + vector, matrix + matrix.
| 10
ATTENTION: MATRIX
• Proposed by Raffel, et al
• Intuition: select most important
element from each timestep
• Learnable weights W and b
depending on target.
• Code on Github
| 11
ATTENTION: MATRIX + VECTOR (LEARNED)
• Proposed by Lin, et al
• Intuition: select most important
element from each timestep and
weight with another learned vector u.
• Code on Github
| 12
ATTENTION: MATRIX + VECTOR (PROVIDED)
• Proposed by Cho, et al
• Intuition: select most important
element from each timestep and
weight it with a learned multiple of
a provided context vector
• Code on Github
| 13
ATTENTION: MATRIX + MATRIX
• Proposed by Parikh, et al
• Intuition: build alignment (similarity) matrix
by multiplying learned vectors from each
matrix, compute context vectors from the
alignment matrix, and mix with original
signal.
• Code on Github
| 14
STEP #4: PREDICT
• Convert reduced vector to a label.
• Generally uses shallow fully connected networks such as the one shown.
• Can also be modified to have a regression head (return the probabilities
from the softmax activation.
| 15
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1
• Embed, Predict
• Bag of Words idea
• Sentence = bag of words
• Document = bag of sentences
• Code on Github
| 16
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2
• Embed, Encode, Predict
• Hierarchical Encoding
• Sentence Encoder: converts
sequence of word vectors to
sentence vector.
• Document Encoder: converts
sequence of sentence vectors
to document vector.
• Sentence encoder Network
embedded inside Document
network.
• Code on Github
| 17
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c)
• Embed, Encode, Attend,
Predict
• Encode step returns matrix,
vector for each time step.
• Attend reduces matrix to
vector.
• 3 types of attention (all
except Matrix Matrix) applied
to different versions of model.
• Code on Github – (a), (b), (c)
| 18
DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
| 19
DOCUMENT SIMILARITY EXAMPLE
• Hierarchical Model (Word to Sentence and
sentence to document)
• Tried w/o Attention, Attention for sentence
encoding, and attention for both sentence
encoding and document compare
• Code in Github – (a), (b), (c)
| 20
SENTENCE SIMILARITY EXAMPLE
• Hierarchical Model (Word to Sentence and
sentence to document).
• Used Matrix Matrix Attention for
comparison
• Code in Github – without attention, with
attention
| 21
SUMMARY
• 4-step recipe is a principled approach to NLP with Deep Learning
• Embed step leverages availability of many pre-trained embeddings.
• Encode step generally uses Bidirectional LSTM to create position sensitive
features, possible to use CNN here also.
• Attention of 3 main types – matrix to vector, with or without implicit context,
matrix and vector to vector, and matrix and matrix to vector. Computes summary
with respect to input or context if provided.
• Predict step converts vector to probability distribution via softmax, usually with a
Fully Connected (Dense) network.
• Interesting pipelines can be composed using complete or partial subsequences of
the 4 step recipe.
| 22
REFERENCES
• Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new
deep learning formula for state-of-the-art NLP models.
• Liao, R. (2016, December 26). Text Classification, Part 3 – Hierarchical attention
network.
• Leonardblier, P. (2016, January 20). Attention Mechanism
• Raffel, C., & Ellis, D. P. (2015). Feed-forward networks with attention can solve
some long-term memory problems. arXiv preprint arXiv:1512.08756.
• Yang, Z., et al. (2016). Hierarchical attention networks for document classification.
In Proceedings of NAACL-HLT (pp. 1480-1489).
• Cho, K., et al. (2015). Describing multimedia content using attention-based
encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
• Parikh, A. P., et al. (2016). A decomposable attention model for natural language
inference. arXiv preprint arXiv:1606.01933.
| 23
THANK YOU
• Code: https://github.com/sujitpal/eeap-examples
• Slides: https://www.slideshare.net/sujitpal/presentation-slides-77511261
• Email: sujit.pal@elsevier.com

More Related Content

PPTX
Topic Extraction using Machine Learning
PPT
[ppt]
PPT
Distributed Hash Table
PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
PPTX
Text Classification
PDF
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
PPTX
Deep Learning for Machine Translation
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Topic Extraction using Machine Learning
[ppt]
Distributed Hash Table
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Text Classification
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Deep Learning for Machine Translation
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

What's hot (19)

PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
PDF
NLP Project: Paragraph Topic Classification
PDF
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
PDF
Introduction to neural networks and Keras
PPTX
Presentation on Text Classification
PDF
Neural Architectures for Named Entity Recognition
PDF
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
PPTX
NAMED ENTITY RECOGNITION
PPTX
Searching for the Best Machine Translation Combination
PDF
TensorFlow and Keras: An Overview
PDF
Fundamentals of data structures ellis horowitz & sartaj sahni
PDF
[slide] A Compare-Aggregate Model with Latent Clustering for Answer Selection
PDF
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PPTX
Deep Neural Methods for Retrieval
PPTX
Neural Models for Information Retrieval
PDF
Object reusability in python
PDF
NumPy Roadmap presentation at NumFOCUS Forum
Bytewise Approximate Match: Theory, Algorithms and Applications
NLP Project: Paragraph Topic Classification
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Introduction to neural networks and Keras
Presentation on Text Classification
Neural Architectures for Named Entity Recognition
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
NAMED ENTITY RECOGNITION
Searching for the Best Machine Translation Combination
TensorFlow and Keras: An Overview
Fundamentals of data structures ellis horowitz & sartaj sahni
[slide] A Compare-Aggregate Model with Latent Clustering for Answer Selection
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Deep Neural Methods for Retrieval
Neural Models for Information Retrieval
Object reusability in python
NumPy Roadmap presentation at NumFOCUS Forum
Ad

Similar to Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework to predict document similarity (20)

PPTX
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
PPTX
PgVector + : Enable Richer Interaction with vector database.pptx
PPTX
Topic extraction using machine learning
PPTX
05 k-means clustering
PDF
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Building Big Data Streaming Architectures
PDF
MODELS 2019: Querying and annotating model histories with time-aware patterns
PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
PPTX
Classroom Presentation_NLPMALLAIAH_CSE_PHD.pptx
PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PDF
New Developments in H2O: April 2017 Edition
PPTX
Transfer Learning in NLP: A Survey
PPTX
Chapter10.pptx
PDF
Text summarization
PPTX
Unsupervised Learning: Clustering
PPTX
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
PDF
Data Science decoded- author: Rohit Dubey
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
PgVector + : Enable Richer Interaction with vector database.pptx
Topic extraction using machine learning
05 k-means clustering
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Building Big Data Streaming Architectures
MODELS 2019: Querying and annotating model histories with time-aware patterns
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Classroom Presentation_NLPMALLAIAH_CSE_PHD.pptx
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
New Developments in H2O: April 2017 Edition
Transfer Learning in NLP: A Survey
Chapter10.pptx
Text summarization
Unsupervised Learning: Clustering
Deep Learning via Semi-Supervised Embedding (第 7 回 Deep Learning 勉強会資料; 大澤)
Data Science decoded- author: Rohit Dubey
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Per capita expenditure prediction using model stacking based on satellite ima...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf

Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework to predict document similarity

  • 1. | Presented By Date July 6, 2017 Sujit Pal, Elsevier Labs Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text classification and similarity
  • 3. | 3 AGENDA • NLP Pipelines before Deep Learning • Deconstructing the “Encode, Embed, Attend, Predict” pipeline. • Example #1: Document Classification • Example #2: Document Similarity • Example #3: Sentence Similarity
  • 4. | 4 NLP PIPELINES BEFORE DEEP LEARNING • Document Collection centric • Based on Information Retrieval • Document collection to matrix • Densify using feature reduction • Feed into SVM for classification, etc.
  • 5. | 5 NLP PIPELINES BEFORE DEEP LEARNING • Idea borrowed from Machine Learning (ML) • Represent categorical variables (words) as 1-hot vectors • Represent sentences as matrix of 1-hot word vectors • No distributional semantics.
  • 6. | 6 WORD EMBEDDINGS • Word2Vec – predict word from context (CBOW) or context from word (skip-gram) shown here. • Trained on large corpora and publicly available. • Other embeddings – GloVe, FastText.
  • 7. | 7 STEP #1: EMBED • Replace 1-hot vectors with 3rd party embeddings. • Embeddings encode distributional semantics • Sentence represented as sequence of dense word vectors • Converts from word ID to word vector
  • 8. | 8 STEP #2: ENCODE • Bag of words – concatenate word vectors together. • Encode step computes a representation of the sentence as a matrix. • Each row of sentence matrix encodes the meaning of each word in the context of the sentence. • Use either LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit) • Bidirectional processes words left to right and right to left and concatenates.
  • 9. | 9 STEP #3: ATTEND • Reduction operation – could be Sum or Global Average/Max Pooling instead • Attention takes as input auxiliary context vector. • Attention tells what to keep during reduction to minimize information loss. • Different kinds – matrix, matrix + context, matrix + vector, matrix + matrix.
  • 10. | 10 ATTENTION: MATRIX • Proposed by Raffel, et al • Intuition: select most important element from each timestep • Learnable weights W and b depending on target. • Code on Github
  • 11. | 11 ATTENTION: MATRIX + VECTOR (LEARNED) • Proposed by Lin, et al • Intuition: select most important element from each timestep and weight with another learned vector u. • Code on Github
  • 12. | 12 ATTENTION: MATRIX + VECTOR (PROVIDED) • Proposed by Cho, et al • Intuition: select most important element from each timestep and weight it with a learned multiple of a provided context vector • Code on Github
  • 13. | 13 ATTENTION: MATRIX + MATRIX • Proposed by Parikh, et al • Intuition: build alignment (similarity) matrix by multiplying learned vectors from each matrix, compute context vectors from the alignment matrix, and mix with original signal. • Code on Github
  • 14. | 14 STEP #4: PREDICT • Convert reduced vector to a label. • Generally uses shallow fully connected networks such as the one shown. • Can also be modified to have a regression head (return the probabilities from the softmax activation.
  • 15. | 15 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1 • Embed, Predict • Bag of Words idea • Sentence = bag of words • Document = bag of sentences • Code on Github
  • 16. | 16 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2 • Embed, Encode, Predict • Hierarchical Encoding • Sentence Encoder: converts sequence of word vectors to sentence vector. • Document Encoder: converts sequence of sentence vectors to document vector. • Sentence encoder Network embedded inside Document network. • Code on Github
  • 17. | 17 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c) • Embed, Encode, Attend, Predict • Encode step returns matrix, vector for each time step. • Attend reduces matrix to vector. • 3 types of attention (all except Matrix Matrix) applied to different versions of model. • Code on Github – (a), (b), (c)
  • 18. | 18 DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
  • 19. | 19 DOCUMENT SIMILARITY EXAMPLE • Hierarchical Model (Word to Sentence and sentence to document) • Tried w/o Attention, Attention for sentence encoding, and attention for both sentence encoding and document compare • Code in Github – (a), (b), (c)
  • 20. | 20 SENTENCE SIMILARITY EXAMPLE • Hierarchical Model (Word to Sentence and sentence to document). • Used Matrix Matrix Attention for comparison • Code in Github – without attention, with attention
  • 21. | 21 SUMMARY • 4-step recipe is a principled approach to NLP with Deep Learning • Embed step leverages availability of many pre-trained embeddings. • Encode step generally uses Bidirectional LSTM to create position sensitive features, possible to use CNN here also. • Attention of 3 main types – matrix to vector, with or without implicit context, matrix and vector to vector, and matrix and matrix to vector. Computes summary with respect to input or context if provided. • Predict step converts vector to probability distribution via softmax, usually with a Fully Connected (Dense) network. • Interesting pipelines can be composed using complete or partial subsequences of the 4 step recipe.
  • 22. | 22 REFERENCES • Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models. • Liao, R. (2016, December 26). Text Classification, Part 3 – Hierarchical attention network. • Leonardblier, P. (2016, January 20). Attention Mechanism • Raffel, C., & Ellis, D. P. (2015). Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756. • Yang, Z., et al. (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489). • Cho, K., et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886. • Parikh, A. P., et al. (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
  • 23. | 23 THANK YOU • Code: https://github.com/sujitpal/eeap-examples • Slides: https://www.slideshare.net/sujitpal/presentation-slides-77511261 • Email: sujit.pal@elsevier.com