SlideShare a Scribd company logo
50 Shades of Text - Leveraging Natural
Language Processing
Alessandro Panebianco
Agenda
• About Me

• Natural Language Processing

• Vectorization Techniques

• Word Embeddings

• Sentence Embeddings

• Demo

• Lessons Learned
2
• Computer Engineering 

• Data Science Consultancy

• E-commerce 

• Energy&Utilities
About me
3
email: ale.panebianco@me.com
Natural Language Processing
Language is the method of human communication, either
spoken or written, consisting of the use of words in a
structured and conventional way
4
Natural Language Processing
The goal of Natural Language Processing is for
computers to achieve human-like comprehension of
texts/languages
5
Natural Language Processing
Why?
https://youtu.be/lXUQ-DdSDoE?t=81
6
Natural Language Processing
Applications
• Machine translation (Google Translate)

• Natural language generation (Reddit bot)

• Sentiment analysis (Cambridge Analytica)

• Lexical semantics (Thesaurus)

• Web and application search (Amazon)

• Question answering (chatbots)

…. and many others
7
How do we enable machines to
interpret language?
Transforming raw text into numerical features
8
Vector
Hashing
Trick
Bag of
words TF-IDF Word2Vec GloVe FastText
Vectorization Techniques
Bag of words
How to go from words to vectors?

without music life would be a mistake Radiohead are great band
S1 1 1 1 1 1 1 1 0 0 0 0
S2 0 1 0 0 0 1 0 1 1 1 1
S1: Without music life would be a mistake
S2: Radiohead are a great music band
9
๏ Dictionary size
๏ Sparsity
๏ Word order absence
✓ Easy to implement
✓ Fast
Vectorization Techniques (II)
Hashing Trick
๏ Hash is one-way

๏ Same output for different inputs
10
✓ Same input -> Same output

✓ Range is always fixed (vector size)
Term Frequency - Inverse Document Frequency
Weight rare words higher than common words
Vectorization Techniques (III)
TF-IDF
without music life would be a mistake Radiohead are great band
S1 0.3 0 0.3 0.3 0.3 0 0.3 0 0 0 0
S2 0 0 0 0 0 0 0 0.3 0.3 0.3 0.3
11
๏ Dictionary size
๏ Sparsity
๏ Word order absence
✓ Easy to implement
✓ Fast
✓ Weight words
Word Embeddings
Word2Vec
• The goal of word embeddings is to generate vectors
encoding semantics
12
• Word2Vec does it maximizing the cosine similarity between
two randomly initialized vectors
Context Window
Without music life would be a mistake
Word Embeddings
Word2Vec (II)
Queen
Woman
King
Man
13
• Analogies

• Synonyms

• Syntactic-Semantic vectors 

• Speech tagging

• Named entity recognition
King - Man + Woman = Queen
Word Embeddings
GloVe
• It differs from word2vec for being a count based model
instead of predictive

• Dimensionality reduction on the co-occurrence counts
matrix

• It uses cosine similarity
14
Word Embeddings
FastText
FastText : Word Embeddings = XGBoost : Random Forest
15
Sentence Embeddings
What If we want to represent more than a single word?

Many techniques have been utilized:

• Common aggregation operations (avg, sum,
concatenation, etc.)

• Doc2Vec

• Neural Networks (CNN,LSTM,etc.)
16
Sentence Embeddings (II)
Doc2Vec
Every paragraph is mapped to a unique vector

The paragraph token can be thought of as another word. It
acts as a memory that remembers what is missing from the
current context — or the topic of the paragraph
17
Sentence Embeddings (III)
CNN
• Stacking words together create a Matrix (image)

• Filters act like word scans (i.e. misspellings)

• Max Pooling would highlight the most important words
(i.e. what is the item of a query) 

• The LSTM layer keeps the word order
18
Sentence Embeddings (IV)
LSTM
• RNNs resemble how we process language (i.e. Google
searches)

• The LSTM layer generates a new encoding for the original
input giving relevance to the word order
(return_sequences=True) 

• The convolution layer filters the most important local
features (i.e. what is the item of a query)
19
Demo
Training Data:

https://www.kaggle.com/c/home-depot-product-search-
relevance/data

GloVe vectors:

http://nlp.stanford.edu/data/glove.6B.zip
20
Lessons Learned
• NLP is one of the most mature research fields in the AI
space 

• Make your own word embeddings using an ad-hoc
vocabulary

• With a large corpus, try FastText

• With short texts (i.e. user queries) experiment higher text
granularity (n-grams, characters)

• Explore sentence embeddings through Neural Networks
21
Questions?

More Related Content

PDF
Raspberry Pi - Lecture 6 Working on Raspberry Pi
PDF
Example desing seismic water tanks
PDF
Analysis of rc frame with and without masonry infill wall with different stif...
PDF
Inter process communication using Linux System Calls
PDF
2-Flexural Analysis and Design of Beams.pdf
PDF
The Linux Kernel Scheduler (For Beginners) - SFO17-421
PDF
Unix interview questions
PPTX
Enriching the semantic web tutorial session 1
Raspberry Pi - Lecture 6 Working on Raspberry Pi
Example desing seismic water tanks
Analysis of rc frame with and without masonry infill wall with different stif...
Inter process communication using Linux System Calls
2-Flexural Analysis and Design of Beams.pdf
The Linux Kernel Scheduler (For Beginners) - SFO17-421
Unix interview questions
Enriching the semantic web tutorial session 1

Similar to 50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco (20)

PDF
Webinar: Simpler Semantic Search with Solr
PPTX
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
NLP Introduction and basics of natural language processing
PPTX
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PDF
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
PDF
Natural Language Processing
PDF
The Standards Mosaic Opening the Way to New Technologies
PPTX
NLIDB(Natural Language Interface to DataBases)
PPTX
Using topic modelling frameworks for NLP and semantic search
PDF
PyGotham NY 2017: Natural Language Processing from Scratch
PPTX
Intro to nlp
PDF
Programming Languages #devcon2013
PPTX
XESLite - Handling Event Logs in ProM
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
PDF
introtonlp-190218095523 (1).pdf
PDF
CoreML for NLP (Melb Cocoaheads 08/02/2018)
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Webinar: Simpler Semantic Search with Solr
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Natural Language Processing, Techniques, Current Trends and Applications in I...
NLP Introduction and basics of natural language processing
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
Tomáš Mikolov - Distributed Representations for NLP
MACHINE-DRIVEN TEXT ANALYSIS
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
Natural Language Processing
The Standards Mosaic Opening the Way to New Technologies
NLIDB(Natural Language Interface to DataBases)
Using topic modelling frameworks for NLP and semantic search
PyGotham NY 2017: Natural Language Processing from Scratch
Intro to nlp
Programming Languages #devcon2013
XESLite - Handling Event Logs in ProM
Code as Data workshop: Using source{d} Engine to extract insights from git re...
introtonlp-190218095523 (1).pdf
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Ad

More from Data Science Milan (20)

PDF
ML & Graph algorithms to prevent financial crime in digital payments
PDF
How to use the Economic Complexity Index to guide innovation plans
PDF
Robustness Metrics for ML Models based on Deep Learning Methods
PDF
"You don't need a bigger boat": serverless MLOps for reasonable companies
PDF
Question generation using Natural Language Processing by QuestGen.AI
PDF
Speed up data preparation for ML pipelines on AWS
PPTX
Serverless machine learning architectures at Helixa
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Reinforcement Learning Overview | Marco Del Pra
PDF
Time Series Classification with Deep Learning | Marco Del Pra
PDF
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
PDF
Audience projection of target consumers over multiple domains a ner and baye...
PDF
Weak supervised learning - Kristina Khvatova
PDF
GANs beyond nice pictures: real value of data generation, Alex Honchar
PDF
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
PDF
3D Point Cloud analysis using Deep Learning
PDF
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
PDF
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
PDF
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
PDF
A view of graph data usage by Cerved
ML & Graph algorithms to prevent financial crime in digital payments
How to use the Economic Complexity Index to guide innovation plans
Robustness Metrics for ML Models based on Deep Learning Methods
"You don't need a bigger boat": serverless MLOps for reasonable companies
Question generation using Natural Language Processing by QuestGen.AI
Speed up data preparation for ML pipelines on AWS
Serverless machine learning architectures at Helixa
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Reinforcement Learning Overview | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Audience projection of target consumers over multiple domains a ner and baye...
Weak supervised learning - Kristina Khvatova
GANs beyond nice pictures: real value of data generation, Alex Honchar
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
3D Point Cloud analysis using Deep Learning
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
A view of graph data usage by Cerved
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

  • 1. 50 Shades of Text - Leveraging Natural Language Processing Alessandro Panebianco
  • 2. Agenda • About Me • Natural Language Processing • Vectorization Techniques • Word Embeddings • Sentence Embeddings • Demo • Lessons Learned 2
  • 3. • Computer Engineering • Data Science Consultancy • E-commerce • Energy&Utilities About me 3 email: ale.panebianco@me.com
  • 4. Natural Language Processing Language is the method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way 4
  • 5. Natural Language Processing The goal of Natural Language Processing is for computers to achieve human-like comprehension of texts/languages 5
  • 7. Natural Language Processing Applications • Machine translation (Google Translate) • Natural language generation (Reddit bot) • Sentiment analysis (Cambridge Analytica) • Lexical semantics (Thesaurus) • Web and application search (Amazon) • Question answering (chatbots) …. and many others 7
  • 8. How do we enable machines to interpret language? Transforming raw text into numerical features 8 Vector Hashing Trick Bag of words TF-IDF Word2Vec GloVe FastText
  • 9. Vectorization Techniques Bag of words How to go from words to vectors? without music life would be a mistake Radiohead are great band S1 1 1 1 1 1 1 1 0 0 0 0 S2 0 1 0 0 0 1 0 1 1 1 1 S1: Without music life would be a mistake S2: Radiohead are a great music band 9 ๏ Dictionary size ๏ Sparsity ๏ Word order absence ✓ Easy to implement ✓ Fast
  • 10. Vectorization Techniques (II) Hashing Trick ๏ Hash is one-way ๏ Same output for different inputs 10 ✓ Same input -> Same output ✓ Range is always fixed (vector size)
  • 11. Term Frequency - Inverse Document Frequency Weight rare words higher than common words Vectorization Techniques (III) TF-IDF without music life would be a mistake Radiohead are great band S1 0.3 0 0.3 0.3 0.3 0 0.3 0 0 0 0 S2 0 0 0 0 0 0 0 0.3 0.3 0.3 0.3 11 ๏ Dictionary size ๏ Sparsity ๏ Word order absence ✓ Easy to implement ✓ Fast ✓ Weight words
  • 12. Word Embeddings Word2Vec • The goal of word embeddings is to generate vectors encoding semantics 12 • Word2Vec does it maximizing the cosine similarity between two randomly initialized vectors Context Window Without music life would be a mistake
  • 13. Word Embeddings Word2Vec (II) Queen Woman King Man 13 • Analogies • Synonyms • Syntactic-Semantic vectors • Speech tagging • Named entity recognition King - Man + Woman = Queen
  • 14. Word Embeddings GloVe • It differs from word2vec for being a count based model instead of predictive • Dimensionality reduction on the co-occurrence counts matrix • It uses cosine similarity 14
  • 15. Word Embeddings FastText FastText : Word Embeddings = XGBoost : Random Forest 15
  • 16. Sentence Embeddings What If we want to represent more than a single word? Many techniques have been utilized: • Common aggregation operations (avg, sum, concatenation, etc.) • Doc2Vec • Neural Networks (CNN,LSTM,etc.) 16
  • 17. Sentence Embeddings (II) Doc2Vec Every paragraph is mapped to a unique vector The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context — or the topic of the paragraph 17
  • 18. Sentence Embeddings (III) CNN • Stacking words together create a Matrix (image) • Filters act like word scans (i.e. misspellings) • Max Pooling would highlight the most important words (i.e. what is the item of a query) • The LSTM layer keeps the word order 18
  • 19. Sentence Embeddings (IV) LSTM • RNNs resemble how we process language (i.e. Google searches) • The LSTM layer generates a new encoding for the original input giving relevance to the word order (return_sequences=True) • The convolution layer filters the most important local features (i.e. what is the item of a query) 19
  • 21. Lessons Learned • NLP is one of the most mature research fields in the AI space • Make your own word embeddings using an ad-hoc vocabulary • With a large corpus, try FastText • With short texts (i.e. user queries) experiment higher text granularity (n-grams, characters) • Explore sentence embeddings through Neural Networks 21