SlideShare a Scribd company logo
|
Presented By
Date
July 6, 2017
Sujit Pal, Elsevier Labs
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe
for text classification and similarity
| 2
INSPIRATION
| 3
ACKNOWLEDGEMENTS
| 4
AGENDA
• NLP Pipelines before Deep Learning
• Deconstructing the “Encode, Embed, Attend, Predict” pipeline.
• Example #1: Document Classification
• Example #2: Document Similarity
• Example #3: Sentence Similarity
| 5
NLP PIPELINES BEFORE DEEP LEARNING
• Document Collection centric
• Based on Information Retrieval
• Document collection to matrix
• Densify using feature reduction
• Feed into SVM for classification, etc.
| 6
NLP PIPELINES BEFORE DEEP LEARNING
• Idea borrowed from Machine Learning (ML)
• Represent categorical variables (words) as 1-hot vectors
• Represent sentences as matrix of 1-hot word vectors
• No distributional semantics at word level.
| 7
WORD EMBEDDINGS
• Word2Vec – predict word from
context (CBOW) or context from word
(skip-gram) shown here.
• Other embeddings – GloVe, FastText.
• Pretrained models available
• Encode word “meanings”.
| 8
STEP #1: EMBED
• Converts from word ID to word vector
• Change: replace 1-hot vectors with 3rd party embeddings.
• Embeddings encode distributional semantics
• Sentence represented as sequence of dense word vectors
| 9
STEP #2: ENCODE
• Converts sequence of vectors (word vectors) to a matrix (sentence matrix).
• Bag of words – concatenate word vectors together.
• Each row of sentence matrix encodes the meaning of each word in the context of
the sentence.
• Generally use LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit)
• Bidirectional processes words left to right and right to left and concatenates.
| 10
STEP #3: ATTEND
• Reduces matrix (sentence matrix) to a vector (sentence vector)
• Non-attention mechanism –Sum or Average/Max Pooling
• Attention tells what to keep during reduction to minimize information loss.
• Different kinds – matrix, matrix + context (learned), matrix + vector (provided),
matrix + matrix.
| 11
ATTENTION: MATRIX
• Proposed by Raffel, et al
• Intuition: select most important
element from each timestep
• Learnable weights W and b
depending on target.
• Code on Github
| 12
ATTENTION: MATRIX + VECTOR (LEARNED)
• Proposed by Lin, et al
• Intuition: select most important
element from each timestep and
weight with another learned vector u.
• Code on Github
| 13
ATTENTION: MATRIX + VECTOR (PROVIDED)
• Proposed by Cho, et al
• Intuition: select most important
element from each timestep and
weight it with a learned multiple of
a provided context vector
• Code on Github
| 14
ATTENTION: MATRIX + MATRIX
• Proposed by Parikh, et al
• Intuition: build alignment (similarity) matrix
by multiplying learned vectors from each
matrix, compute context vectors from the
alignment matrix, and mix with original
signal.
• Code on Github
| 15
STEP #4: PREDICT
• Convert reduced vector to a label.
• Generally uses shallow fully connected networks such as the one shown.
• Can also be modified to have a regression head (return the probabilities
from the softmax activation.
| 16
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1
• 20 newsgroups dataset
• 40k training records
• 10k test records
• 20 classes
• Embed, Predict
• Bag of Words idea
• Sentence = bag of words
• Document = bag of sentences
• Code on Github
| 17
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2
• Embed, Encode, Predict
• Hierarchical Encoding
• Sentence Encoder: converts
sequence of word vectors to
sentence vector.
• Document Encoder: converts
sequence of sentence vectors
to document vector.
• Sentence encoder Network
embedded inside Document
network.
• Code on Github
| 18
DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c)
• Embed, Encode, Attend,
Predict
• Encode step returns matrix,
vector for each time step.
• Attend reduces matrix to
vector.
• 3 types of attention (all
except Matrix Matrix) applied
to different versions of model.
• Code on Github – (a), (b), (c)
| 19
DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
| 20
DOCUMENT SIMILARITY EXAMPLE
• Data derived from 20 newsgroups
• Hierarchical Model (Word to Sentence and
sentence to document)
• Tried w/o Attention, Attention for sentence
encoding, and attention for both sentence
encoding and document compare
• Code in Github – (a), (b), (c)
| 21
SENTENCE SIMILARITY EXAMPLE
• 2012 Semantic Similarity Task dataset.
• Hierarchical Model (Word to Sentence and
sentence to document).
• Used Matrix Matrix Attention for
comparison
• Code in Github – without attention, with
attention
| 22
SUMMARY
• 4-step recipe is a principled approach to NLP with Deep Learning
• Embed step leverages availability of many pre-trained embeddings.
• Encode step generally uses Bidirectional LSTM to create position sensitive
features, possible to use CNN here also.
• Attention of 3 main types – matrix to vector, with or without implicit context,
matrix and vector to vector, and matrix and matrix to vector. Computes summary
with respect to input or context if provided.
• Predict step converts vector to probability distribution via softmax, usually with a
Fully Connected (Dense) network.
• Interesting pipelines can be composed using complete or partial subsequences of
the 4 step recipe.
| 23
REFERENCES
• Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new
deep learning formula for state-of-the-art NLP models.
• Liao, R. (2016, December 26). Text Classification, Part 3 – Hierarchical attention
network.
• Leonardblier, P. (2016, January 20). Attention Mechanism
• Raffel, C., & Ellis, D. P. (2015). Feed-forward networks with attention can solve
some long-term memory problems. arXiv preprint arXiv:1512.08756.
• Yang, Z., et al. (2016). Hierarchical attention networks for document classification.
In Proceedings of NAACL-HLT (pp. 1480-1489).
• Cho, K., et al. (2015). Describing multimedia content using attention-based
encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
• Parikh, A. P., et al. (2016). A decomposable attention model for natural language
inference. arXiv preprint arXiv:1606.01933.
| 24
THANK YOU
• Code: https://github.com/sujitpal/eeap-examples
• Slides: https://www.slideshare.net/sujitpal/presentation-slides-77511261
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
50% off on EBook
Discount Code EBDEEP50
Valid till Oct 31 2017

More Related Content

PDF
A Brief Introduction to Knowledge Graphs
PDF
CS6010 Social Network Analysis Unit V
PDF
CS6010 Social Network Analysis Unit III
PPTX
Natural lanaguage processing
PPTX
HITS + Pagerank
PPTX
Social Media Mining - Chapter 8 (Influence and Homophily)
PPTX
PPP (Point to Point Protocol)
A Brief Introduction to Knowledge Graphs
CS6010 Social Network Analysis Unit V
CS6010 Social Network Analysis Unit III
Natural lanaguage processing
HITS + Pagerank
Social Media Mining - Chapter 8 (Influence and Homophily)
PPP (Point to Point Protocol)

What's hot (20)

PPTX
Quantum neural network
PPTX
Apache PIG
PDF
Hadoop ecosystem
PDF
Introduction to Natural Language Processing (NLP)
PDF
[논문리뷰] 딥러닝 적용한 EEG 연구 소개
PDF
Word2Vec
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PDF
Introduction to Generative Adversarial Networks (GANs)
PDF
Transformer Introduction (Seminar Material)
PPT
Introduction to Natural Language Processing
PPTX
Hadoop vs Apache Spark
PPTX
링크드 데이터 사례
PPTX
Relational and non relational database 7
PDF
The three layers of a knowledge graph and what it means for authoring, storag...
PPTX
INTRODUCTION TO NLP, RNN, LSTM, GRU
PDF
Lecture: Word Sense Disambiguation
PDF
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PPTX
Mining Data Streams
PDF
Text classification presentation
Quantum neural network
Apache PIG
Hadoop ecosystem
Introduction to Natural Language Processing (NLP)
[논문리뷰] 딥러닝 적용한 EEG 연구 소개
Word2Vec
NE7012- SOCIAL NETWORK ANALYSIS
Introduction to Generative Adversarial Networks (GANs)
Transformer Introduction (Seminar Material)
Introduction to Natural Language Processing
Hadoop vs Apache Spark
링크드 데이터 사례
Relational and non relational database 7
The three layers of a knowledge graph and what it means for authoring, storag...
INTRODUCTION TO NLP, RNN, LSTM, GRU
Lecture: Word Sense Disambiguation
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
NE7012- SOCIAL NETWORK ANALYSIS
Mining Data Streams
Text classification presentation
Ad

Similar to Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text classification and similarity (20)

PPTX
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
PPTX
Machine Learning - Transformers, Large Language Models and ChatGPT
PDF
Natural Language Processing NLP (Transformers)
PDF
An introduction to the Transformers architecture and BERT
PPTX
Discover How Scientific Data is Used for the Public Good with Natural Languag...
PPTX
Word embedding
PPTX
Lecture1.pptx
PPTX
wordembedding.pptx
PDF
Frontiers of Natural Language Processing
PDF
Natural Language Processing
PDF
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
PDF
Nlp and Neural Networks workshop
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PPTX
Transformer Zoo
PDF
Deep learning based drug protein interaction
PDF
5_RNN_LSTM.pdf
 
PDF
Atlanta MLconf Machine Learning Conference 09-23-2016
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
PPTX
Recent nlp trends
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Machine Learning - Transformers, Large Language Models and ChatGPT
Natural Language Processing NLP (Transformers)
An introduction to the Transformers architecture and BERT
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Word embedding
Lecture1.pptx
wordembedding.pptx
Frontiers of Natural Language Processing
Natural Language Processing
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...
Nlp and Neural Networks workshop
Introduction to Neural Information Retrieval and Large Language Models
Transformer Zoo
Deep learning based drug protein interaction
5_RNN_LSTM.pdf
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Recent nlp trends
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Ad

More from Sujit Pal (20)

PPTX
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
PPTX
Google AI Hackathon: LLM based Evaluator for RAG
PPTX
Building Learning to Rank (LTR) search reranking models using Large Language ...
PPTX
Cheap Trick for Question Answering
PPTX
Searching Across Images and Test
PPTX
Learning a Joint Embedding Representation for Image Search using Self-supervi...
PPTX
The power of community: training a Transformer Language Model on a shoestring
PPTX
Backprop Visualization
PPTX
Accelerating NLP with Dask and Saturn Cloud
PPTX
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
PPTX
Leslie Smith's Papers discussion for DL Journal Club
PPTX
Using Graph and Transformer Embeddings for Vector Based Retrieval
PPTX
Transformer Mods for Document Length Inputs
PPTX
Question Answering as Search - the Anserini Pipeline and Other Stories
PPTX
Building Named Entity Recognition Models Efficiently using NERDS
PPTX
Graph Techniques for Natural Language Processing
PPTX
Learning to Rank Presentation (v2) at LexisNexis Search Guild
PPTX
Search summit-2018-ltr-presentation
PPTX
Search summit-2018-content-engineering-slides
PPTX
SoDA v2 - Named Entity Recognition from streaming text
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Google AI Hackathon: LLM based Evaluator for RAG
Building Learning to Rank (LTR) search reranking models using Large Language ...
Cheap Trick for Question Answering
Searching Across Images and Test
Learning a Joint Embedding Representation for Image Search using Self-supervi...
The power of community: training a Transformer Language Model on a shoestring
Backprop Visualization
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Leslie Smith's Papers discussion for DL Journal Club
Using Graph and Transformer Embeddings for Vector Based Retrieval
Transformer Mods for Document Length Inputs
Question Answering as Search - the Anserini Pipeline and Other Stories
Building Named Entity Recognition Models Efficiently using NERDS
Graph Techniques for Natural Language Processing
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Search summit-2018-ltr-presentation
Search summit-2018-content-engineering-slides
SoDA v2 - Named Entity Recognition from streaming text

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx

Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text classification and similarity

  • 1. | Presented By Date July 6, 2017 Sujit Pal, Elsevier Labs Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text classification and similarity
  • 4. | 4 AGENDA • NLP Pipelines before Deep Learning • Deconstructing the “Encode, Embed, Attend, Predict” pipeline. • Example #1: Document Classification • Example #2: Document Similarity • Example #3: Sentence Similarity
  • 5. | 5 NLP PIPELINES BEFORE DEEP LEARNING • Document Collection centric • Based on Information Retrieval • Document collection to matrix • Densify using feature reduction • Feed into SVM for classification, etc.
  • 6. | 6 NLP PIPELINES BEFORE DEEP LEARNING • Idea borrowed from Machine Learning (ML) • Represent categorical variables (words) as 1-hot vectors • Represent sentences as matrix of 1-hot word vectors • No distributional semantics at word level.
  • 7. | 7 WORD EMBEDDINGS • Word2Vec – predict word from context (CBOW) or context from word (skip-gram) shown here. • Other embeddings – GloVe, FastText. • Pretrained models available • Encode word “meanings”.
  • 8. | 8 STEP #1: EMBED • Converts from word ID to word vector • Change: replace 1-hot vectors with 3rd party embeddings. • Embeddings encode distributional semantics • Sentence represented as sequence of dense word vectors
  • 9. | 9 STEP #2: ENCODE • Converts sequence of vectors (word vectors) to a matrix (sentence matrix). • Bag of words – concatenate word vectors together. • Each row of sentence matrix encodes the meaning of each word in the context of the sentence. • Generally use LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit) • Bidirectional processes words left to right and right to left and concatenates.
  • 10. | 10 STEP #3: ATTEND • Reduces matrix (sentence matrix) to a vector (sentence vector) • Non-attention mechanism –Sum or Average/Max Pooling • Attention tells what to keep during reduction to minimize information loss. • Different kinds – matrix, matrix + context (learned), matrix + vector (provided), matrix + matrix.
  • 11. | 11 ATTENTION: MATRIX • Proposed by Raffel, et al • Intuition: select most important element from each timestep • Learnable weights W and b depending on target. • Code on Github
  • 12. | 12 ATTENTION: MATRIX + VECTOR (LEARNED) • Proposed by Lin, et al • Intuition: select most important element from each timestep and weight with another learned vector u. • Code on Github
  • 13. | 13 ATTENTION: MATRIX + VECTOR (PROVIDED) • Proposed by Cho, et al • Intuition: select most important element from each timestep and weight it with a learned multiple of a provided context vector • Code on Github
  • 14. | 14 ATTENTION: MATRIX + MATRIX • Proposed by Parikh, et al • Intuition: build alignment (similarity) matrix by multiplying learned vectors from each matrix, compute context vectors from the alignment matrix, and mix with original signal. • Code on Github
  • 15. | 15 STEP #4: PREDICT • Convert reduced vector to a label. • Generally uses shallow fully connected networks such as the one shown. • Can also be modified to have a regression head (return the probabilities from the softmax activation.
  • 16. | 16 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #1 • 20 newsgroups dataset • 40k training records • 10k test records • 20 classes • Embed, Predict • Bag of Words idea • Sentence = bag of words • Document = bag of sentences • Code on Github
  • 17. | 17 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #2 • Embed, Encode, Predict • Hierarchical Encoding • Sentence Encoder: converts sequence of word vectors to sentence vector. • Document Encoder: converts sequence of sentence vectors to document vector. • Sentence encoder Network embedded inside Document network. • Code on Github
  • 18. | 18 DOCUMENT CLASSIFICATION EXAMPLE – ITERATION #3 (a, b, c) • Embed, Encode, Attend, Predict • Encode step returns matrix, vector for each time step. • Attend reduces matrix to vector. • 3 types of attention (all except Matrix Matrix) applied to different versions of model. • Code on Github – (a), (b), (c)
  • 19. | 19 DOCUMENT CLASSIFICATION EXAMPLE – RESULTS
  • 20. | 20 DOCUMENT SIMILARITY EXAMPLE • Data derived from 20 newsgroups • Hierarchical Model (Word to Sentence and sentence to document) • Tried w/o Attention, Attention for sentence encoding, and attention for both sentence encoding and document compare • Code in Github – (a), (b), (c)
  • 21. | 21 SENTENCE SIMILARITY EXAMPLE • 2012 Semantic Similarity Task dataset. • Hierarchical Model (Word to Sentence and sentence to document). • Used Matrix Matrix Attention for comparison • Code in Github – without attention, with attention
  • 22. | 22 SUMMARY • 4-step recipe is a principled approach to NLP with Deep Learning • Embed step leverages availability of many pre-trained embeddings. • Encode step generally uses Bidirectional LSTM to create position sensitive features, possible to use CNN here also. • Attention of 3 main types – matrix to vector, with or without implicit context, matrix and vector to vector, and matrix and matrix to vector. Computes summary with respect to input or context if provided. • Predict step converts vector to probability distribution via softmax, usually with a Fully Connected (Dense) network. • Interesting pipelines can be composed using complete or partial subsequences of the 4 step recipe.
  • 23. | 23 REFERENCES • Honnibal, M. (2016, November 10). Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models. • Liao, R. (2016, December 26). Text Classification, Part 3 – Hierarchical attention network. • Leonardblier, P. (2016, January 20). Attention Mechanism • Raffel, C., & Ellis, D. P. (2015). Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756. • Yang, Z., et al. (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489). • Cho, K., et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886. • Parikh, A. P., et al. (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
  • 24. | 24 THANK YOU • Code: https://github.com/sujitpal/eeap-examples • Slides: https://www.slideshare.net/sujitpal/presentation-slides-77511261 • Email: sujit.pal@elsevier.com • Twitter: @palsujit 50% off on EBook Discount Code EBDEEP50 Valid till Oct 31 2017