Language Modelling in Natural Language Processing-Part II.pdf

Language Modelling
(Part II)
Dr. Deptii Chaudhari
Assistant Professor, Department of Computer Engineering
Hope Foundation’s
International Institute of Information Technology, Hinjawadi, Pune
deptiic@isquareit.edu.in, www.isquareit.edu.in
LinkedIn: https://www.linkedin.com/in/deptii-chaudhari
GitHub: https://github.com/DeptiiC
Google Scholar: https://scholar.google.co.in/citations?hl=en&user=H_tb1lEAAAAJ

Topics
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Vector Semantics
▪ Word2Vec Models
▪ BERT
▪ Graph-based Language Models

Word Embeddings / Vector Semantics
▪ Vectors for representing words are called embeddings.
▪ The idea of vector semantics is to represent a word as a point in a
multidimensional semantic space that is derived from the
distributions of word neighbors.
▪ Vector semantics is the standard way to represent word meaning in
NLP, helping us model many of the aspects of word meaning.
▪ For example, suppose you didn’t know the meaning of the word
ongchoi but you see it in the following contexts:
▪ Ongchoi is delicious sauteed with garlic.
▪ Ongchoi is superb over rice.
▪ ...ongchoi leaves with salty sauces...

▪ And suppose that you had seen many of these context words in other
contexts:
▪ ...spinach sauteed with garlic over rice...
▪ ...chard stems and leaves are delicious...
▪ ...collard greens and other salty leafy greens
▪ The fact that ongchoi occurs with words like rice and garlic and
delicious and salty, as do words like spinach, chard, and collard greens
might suggest that ongchoi is a leafy green similar to these other leafy
greens.
▪ We can do the same thing computationally by just counting words in
the context of ongchoi.

Reasoning with Word Vectors
▪ It has been found that the learned word representations in fact
capture meaning syntactic and semantic regularities in a very simple
way.
▪ Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
▪ For Example : Case of Singular and Plural Relations
▪ If we denote the vector for word i as xi, and focus on the
singular/plural relation, we observe that
▪ xapples – xapple  xcars – xcar  xfamily – xfamilies
▪ and so on…
▪ It is also good at answering analogy questions such as
▪ a as to b, as c as to ?
▪ man is to woman as uncle is to ? (aunt)

Reasoning with Word Vectors
▪ Vector offset for Gender
Relation
▪ Vector offset for Singular Plural
Relation

Analogy Testing using Word Vectors

▪ Word Vectors
▪ One Hot Encoding
▪ Bag of Words
▪ TF-IDF
▪ Word2Vec
▪ Doc2Vec

Word2Vec – Learning word vectors
▪ BoW and TF-IDF consist of a set of words (vocabulary) and a metric like
frequency or term frequency-inverse document frequency (TF-IDF) to describe
each word’s value in the corpus.
▪ That means BoW and TF-IDF can result in sparse matrices and high
dimensional vectors that consume a lot of computer resources if the vocabulary
is very large.
▪ Developed by a team of researchers at Google, word2vec attempts to solve the
issues with the BoW approach:
▪ High-dimension vectors
▪ Words assumed completely independent of each other
Basic Idea
Instead of capturing co-occurrence counts directly, predict (using)
surrounding words of every word.

Why use Word2Vec ?
▪ Word2Vec model preserves the relationship between the words.
▪ It can easily deal with the introduction of new words in the
vocabulary.
▪ It has shown better results in lots of deep learning models.
▪ It uses Context based probabilistic approach which is simple and
scalable.

Word2Vec – Two Variations of Model
▪ Using a neural network with only a couple layers, word2vec tries to learn
relationships between words and embeds them in a lower-dimensional vector
space.
▪ To do this, word2vec trains words against other words that neighbor them in
the input corpus, capturing some of the meaning in the sequence of words.
▪ The researchers devised two novel approaches:
▪ Continuous bag of words (CBoW)
▪ Skip-gram
▪ The CBoW architecture predicts the current word based on the context while
the skip-gram predicts surrounding words given the current word.
▪ CBoW : Given a set of (neighbouring) words, guess single word that
potentially occur along with this set of words.
▪ Skip-gram : Guess potential neighboring words based on the single word
being analyzed

Word2Vec – Two Variations of Model

Word2Vec – CBOW
▪ Let us consider a piece of text as follows.
▪ “The recently introduced continuous skip gram model is an efficient method for
learning high quality distributed vector representations that capture a large
number of syntactic and semantic word relationships.”
▪ Imagine a sliding window over the text, that includes the central word
currently in focus, together with four words that precede it, and the four words
that follow it:
… an efficient method for learning high quality distributed vector …..
Context Context
Focus
Word

Word2Vec – CBOW
▪ The context words from input layers. Each word is encoded in one-hot form. It
is a single hidden and output layer.

Word2Vec – CBOW Training Objective
▪ The training objective is to maximize the conditional probability of observing
the actual output word (the focus word) given the input context words, with
regard to weights.
▪ In our example, given the input (“an”, “efficient”, “method”, “for”, “high”,
“quality”, “distributed”, “vector”), we want to maximize the probability of
getting “learning” as output.
▪ CBOW: Input to Hidden Layer
▪ Since our input vectors are one-hot, multiplying an input vector by the weight
matrix W1 amounts to simply selecting a row from W1
▪ Given C input word vectors, the activation
function for the hidden layer h amounts to
simply summing the corresponding ‘hot’
rows in W1, and dividing by C to take their
average.

Word2Vec – CBOW Training Objective
▪ CBOW: Hidden Layer to Output Layer
▪ From hidden layer to the output layer, the second weight matrix W2
can be used to compute a score for each word in the vocabulary,
and softmax can be used to obtain the posterior distribution of
words.

Word2Vec – CBOW Working Example
Hope can set you free.
1
0
0
0
0
0
0
1
0
0
V5x1, one hot vector of
“Hope”
“set”
3 nodes in
hidden layer
W3x5
W3x5
V5x1, Predicted
one hot vector
of “can”
0
1
0
0
0
Actual Vector
Compare and update the
weights
Output layer with
Softmax function
One-hot context
word Input vector
W5x3

Word2Vec – Skip Gram Model
▪ Inverse of CBOW: Predict the context from given word.
▪ Input is a single focus word, and the target are context words.
▪ The objective is to minimize the summed prediction error across all
context words in the output layer.

Word2Vec – Skip Gram Model Training
▪ The activation function for the hidden layer simply amounts to
copying the corresponding row from the weights matrix W1
(linear).
▪ At the output layer, we output C multinomial distributions instead
of just one.
▪ The training objective is to maximize the summed prediction error
across all context words in the output layer.
▪ In our example, the input would be “learning” and we hope to get
(“an”, “efficient”, “method”, “for”, “high”, “quality”, “distributed”,
“vector”) at the output layer.

Word2Vec – Skip Gram Model Training
▪ Predict the surrounding words in a window of length c of each
word.
▪ Objective Function: Maximize the log probability of any context
word given the current center word.

Word2Vec – Skip-gram Working Example
Hope can set you free.
0
1
0
0
0
“Can”
3 nodes in
hidden layer
W3x5
V5x1, Predicted one hot
vector of “set”
1
0
0
0
0
Actual Target
Vectors
Compare and update the
weights
One-hot focus
word Input vector
W5x3
V5x1, Predicted one hot
vector of “Hope”
0
0
1
0
0

When to use CBOW and Skip-Gram Models?
▪ CBOW
▪ With small corpus CBOW is faster.
▪ It is much faster to train than Skipgram.
▪ CBOW is low on memory. It does not need to have huge RAM
requirements.
▪ Slightly better accuracy for frequent words.
▪ Skip-gram
▪ It works slower with larger corpus and higher dimensions.
▪ Skip-gram model works well with small amount of training data.
▪ It also performs well when words and phrases are rare.

Doc2Vec Model
▪ Doc2Vec is a model that represents each document as a vector of
numbers, similar to how Word2Vec represents each word as a
vector.
▪ Doc2Vec can capture the meaning and context of a document and
can be used for tasks such as document similarity, clustering, or
classification.
▪ Doc2Vec works by training a neural network on a large corpus of
documents, where each document has a unique identifier or tag.
▪ The network learns to predict the words in a document given its
tag, and vice versa.
▪ The tag vector becomes the document vector, and the word vectors
are shared with Word2Vec.

Doc2Vec Model
Prof. Deptii Chaudhari (I2IT, Pune)
▪ There are two main variants of the Doc2Vec approach:
▪ Distributed Memory Model of Paragraph Vectors (PV-DM)
▪ Distributed Bag of Words (DBOW)
▪ The Distributed-Memory Model closely resembles the CBOW
model of Word2vec.
▪ This model tries to predict a target word given its surrounding
context words with the addition of a paragraph ID.
▪ The Distributed Bag-Of-Words Model based on the Word2vec skip-
gram model, with one exception instead of using the target word as
the input, it takes the document ID as the input and tries to predict
randomly sampled words from the document.

Doc2Vec Model
▪ Distributed Memory (DM): Randomly sample consecutive words from a
paragraph and predict a center word from the randomly sampled set of words
by taking as input — the context words and a paragraph id.

Doc2Vec Model
▪ The DBOW model “ignores the context words in the input, but force the model
to predict words randomly sampled from the paragraph in the output.”

BERT
▪ BERT stands for Bidirectional Encoder Representations from
Transformers.
▪ Google developed BERT to serve as a bidirectional transformer
model that examines words within text by considering both left-to-
right and right-to-left contexts.
▪ BERT makes use of Transformer, an attention mechanism that
learns contextual relations between words (or sub-words) in a text.
▪ As opposed to directional models, which read the text input
sequentially (left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.
▪ This characteristic allows the model to learn the context of a word
based on all of its surroundings (left and right of the word).

Features of BERT
▪ Transformer Architecture: BERT is based on the Transformer architecture,
which is a type of neural network that uses self-attention mechanisms.
▪ It’s designed to generate a language model, so only the encoder mechanism is
used.
▪ Bidirectional Approach: Traditional language models process text
sequentially, either from left to right or right to left.
▪ This method limits the model’s awareness to the immediate context preceding
the target word.
▪ BERT uses a bidirectional approach, considering both the left and right context
of words in a sentence.
▪ Pre-training and Fine-tuning: The BERT model undergoes a two-step process:
Pre-training on large amounts of unlabeled text to learn contextual
embeddings, and fine-tuning on labeled data for specific NLP tasks.

Features of BERT
▪ Contextual Embeddings: During the pre-training phase, BERT
learns contextual embeddings, which are the representations of
words that take into account their surrounding context in a
sentence.
▪ Fine-Tuning: After the pre-training phase, the BERT model is then
fine-tuned for specific natural language processing (NLP) tasks.
▪ This step tailors the model to more targeted applications by
adapting its general language understanding to the nuances of the
particular task.
▪ MathBERT: There’s also a variant of BERT called MathBERT, which
is jointly trained with mathematical formulas and their
corresponding contexts.

BERT – Pretraining and Finetuning
[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)

▪ Pre-Training on Large Data
▪ BERT is pre-trained on large amount of unlabeled text data.
▪ The model learns contextual embeddings, which are the
representations of words that take into account their surrounding
context in a sentence.
▪ BERT engages in various unsupervised pre-training tasks.
▪ For instance, it might learn to predict missing words in a sentence
(Masked Language Model or MLM task), understand the
relationship between two sentences, or predict the next sentence in
a pair.

▪ Fine-Tuning on Labeled Data
▪ After the pre-training phase, the BERT model, armed with its
contextual embeddings, is then fine-tuned for specific natural language
processing (NLP) tasks.
▪ This step tailors the model to more targeted applications by adapting
its general language understanding to the nuances of the particular
task.
▪ BERT is fine-tuned using labeled data specific to the downstream tasks
of interest.
▪ These tasks could include sentiment analysis, question-answering,
named entity recognition, or any other NLP application.
▪ The model’s parameters are adjusted to optimize its performance for
the particular requirements of the task at hand.

Working of BERT
▪ BERT is designed to generate a language model so, only the encoder
mechanism is used.
▪ Sequence of tokens are fed to the Transformer encoder.
▪ These tokens are first embedded into vectors and then processed in
the neural network.
▪ The output is a sequence of vectors, each corresponding to an input
token, providing contextualized representations.
▪ Traditional models predict the next word in a sequence, which is a
directional approach and may limit context learning. BERT addresses
this challenge with two innovative training strategies:
▪ Masked Language Model (MLM)
▪ Next Sentence Prediction (NSP)

BERT - Masked Language Model (MLM)
▪ In BERT’s pre-training process, a portion of words in each input sequence
is masked and the model is trained to predict the original values of these
masked words based on the context provided by the surrounding words.
▪ Masking words: Before BERT learns from sentences, it hides some words
(about 15%) and replaces them with a special symbol, like [MASK].
▪ Guessing Hidden Words: BERT’s job is to figure out what these hidden
words are by looking at the words around them. It’s like a game of
guessing where some words are missing, and BERT tries to fill in the
blanks.
▪ BERT adds a special layer on top of its learning system to make these
guesses. It then checks how close its guesses are to the actual hidden
words.
▪ It does this by converting its guesses into probabilities.

BERT - Next Sentence Prediction (NSP)
▪ BERT predicts if the second sentence is connected to the first.
▪ This is done by transforming the output of the [CLS] token into a 2×1
shaped vector using a classification layer, and then calculating the
probability of whether the second sentence follows the first using
SoftMax.
▪ In the training process, BERT learns to understand the relationship
between pairs of sentences, predicting if the second sentence follows the
first in the original document.
▪ 50% of the input pairs have the second sentence as the subsequent
sentence in the original document, and the other 50% have a randomly
chosen sentence.

BERT - Next Sentence Prediction (NSP)
▪ To help the model distinguish between connected and disconnected
sentence pairs. The input is processed before entering the model:
▪ A [CLS] token is inserted at the beginning of the first sentence, and a
[SEP] token is added at the end of each sentence.
▪ A sentence embedding indicating Sentence A or Sentence B is added to
each token.
▪ A positional embedding indicates the position of each token in the
sequence.
▪ BERT predicts if the second sentence is connected to the first. This is
done by transforming the output of the [CLS] token into a 2×1 shaped
vector using a classification layer, and then calculating the probability of
whether the second sentence follows the first using SoftMax.

BERT - Training
▪ The model is trained on both previously mentioned tasks
simultaneously. This is made possible by clever usage of inputs and
outputs.

BERT - Training
▪ BERT relies on a Transformer (the attention mechanism that learns contextual
relationships between words in a text).
▪ A basic Transformer consists of an encoder to read the text input and a decoder to
produce a prediction for the task.
▪ Since BERT’s goal is to generate a language representation model, it only needs the
encoder part. The input to the encoder for BERT is a sequence of tokens, which are
first converted into vectors and then processed in the neural network.
▪ But before processing can start, BERT needs the input to be decorated with some extra
metadata:
▪ Token embeddings: A [CLS] token is added to the input word tokens at the
beginning of the first sentence and a [SEP] token is inserted at the end of each
sentence.
▪ Segment embeddings: A marker indicating Sentence A or Sentence B is added to
each token. This allows the encoder to distinguish between sentences.
▪ Positional embeddings: A positional embedding is added to each token to indicate
its position in the sentence.

BERT - Training
▪ To predict if the second sentence is connected to the first one or not, basically
the complete input sequence goes through the Transformer based model, the
output of the [CLS] token is transformed into a 2×1 shaped vector using a
simple classification layer, and the IsNext-Label is assigned using softmax.
▪ The model is trained with both Masked LM and Next Sentence Prediction
together. This is to minimize the combined loss function of the two strategies —
“together is better”.

BERT - Architectures
▪ There are two types of pre-trained versions of BERT depending on
the scale of the model architecture:
▪ BERT-BASE has 12 layers in the Encoder stack while BERT-LARGE
has 24 layers in the Encoder stack.
▪ BERT architectures (BASE and LARGE) also have larger feedforward
networks (768 and 1024 hidden units respectively), and more
attention heads (12 and 16 respectively) than the Transformer
architecture suggested in the original paper. It contains 512 hidden
units and 8 attention heads.
▪ BERT-BASE contains 110M parameters while BERT-LARGE has
340M parameters.

BERT - Architectures
▪ There are two types of pre-trained versions of BERT depending on the scale of
the model architecture:
Fun fact: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was
trained on 16 TPUs for 4 days!

Graph based language models
▪ A graph is a mathematical structure consisting of:
▪ Nodes (Vertices): Represent entities (e.g., words, sentences, or
documents).
▪ Edges: Represent relationships between entities (e.g., co-occurrence,
syntactic dependency, semantic similarity).
▪ In language modeling, Nodes could be words, and edges represent how
often two words occur together or their grammatical relationship.
▪ Graph-based models can be used for language modeling by representing
linguistic data, such as words, phrases, or sentences, as nodes and their
relationships as edges within a graph.
▪ This approach leverages the structural and relational properties of natural
language, enabling more sophisticated and context-aware representations.

Graph based language models
▪ Why Use Graphs in Language Modeling?
▪ Language is complex and often non-linear. Graphs can help to
▪ Capture Relationships: Words or phrases aren't just sequential—they have
rich relationships (e.g., syntactic or semantic).
▪ Integrate Knowledge: Graphs like knowledge graphs (e.g., WordNet,
ConceptNet) can provide background knowledge about word meanings and
relationships.
▪ Handle Global Context: Graphs can encode both local (sentence-level) and
global (corpus-level) context.

Graph Representations in Language Modeling
▪ Word Co-occurrence Graphs:
▪ Nodes: Words.
▪ Edges: Represent co-occurrence
within a window size in text.
▪ Application: Keyword extraction,
sentence similarity.
▪ Syntactic Dependency Graphs:
▪ Nodes: Words.
▪ Edges: Represent grammatical
relationships (e.g., subject-object).
▪ Application: Parsing, sentiment
analysis.

Graph Representations in Language Modeling
▪ Semantic Graphs:
▪ Nodes: Concepts or words.
▪ Edges: Represent semantic relationships
(e.g., synonyms, hypernyms).
▪ Application: Word-sense
disambiguation, machine translation.
▪ Knowledge Graphs:
▪ Nodes: Entities (e.g., "Einstein").
▪ Edges: Relationships (e.g., "was a
scientist").
▪ Application: Question answering,
fact-checking.

Graph Neural Networks (GNNs): Modern Graph-based Language Models
▪ Graph Neural Networks (GNNs) are a class of neural networks designed to
operate on graph-structured data, capturing the relationships and
interactions between entities.
▪ At their core, GNNs use a message-passing framework:
▪ Node Representation (Features): Each node starts with an initial feature
vector (e.g., a word embedding in NLP or an atom's properties in a
molecule).
▪ Neighborhood Aggregation: Each node aggregates information from its
neighbors, capturing the local graph structure.
▪ Updating Node Representations: Using the aggregated information, the
node updates its representation to reflect its context in the graph.
▪ This process is repeated for multiple iterations (or layers), allowing nodes to
gather information from progressively larger neighborhoods.

Types of GNNs
▪ Graph Convolutional Networks (GCNs)
▪ Graph Attention Networks (GATs)
▪ GraphSAGE
▪ Message Passing Neural Networks (MPNNs)
Applications of GNNs
▪ Text Summarization: Represent sentences or documents as nodes, and use
GNNs to identify important content.
▪ Knowledge Graph Completion: Predict missing edges or relationships in a
knowledge graph.
▪ Semantic Parsing: Use dependency graphs of sentences for syntactic analysis.
▪ Social Networks:
▪ Node classification (e.g., user profiling).
▪ Link prediction (e.g., recommending friends).

Log Linear Models
▪ A log-linear model is a type of statistical model used to predict probabilities.
▪ It combines features (important information from the data) and weights
(importance of each feature) to make predictions.
▪ Features are pieces of information used to make predictions. In a language
model, features might include the previous words or context.
▪ Example: For predicting the next word in "The cat sat on the ___", features
could be "The cat" or "sat on".
▪ Each feature is given a weight that tells the model how important it is. Some
features might be more important than others.
▪ Example: The feature "sat on" might be more useful for predicting the next
word than "The cat".
▪ A score is calculated by multiplying each feature's value by it corresponding
weight and summing the results.

Log Linear Models
▪ The scores are then turned into probabilities using a function called softmax.
▪ The softmax function converts scores into values between 0 and 1, where
higher scores lead to higher probabilities.
▪ Example: The score for "mat" might be 4.3, while "hat" has a score of 3.9.
After applying softmax, the model will predict "mat" as the next word
because it has the higher probability.
▪ The formula for softmax is:

Log Linear Model - Example
▪ Let us predict the next word in the sentence: "The cat sat on the ___."
▪ Step 1: Selecting Features
▪ Features could be:
▪ Word History: Previous words in the sentence ("The cat sat on the").
▪ Syntactic Patterns: Certain words follow others (e.g., "on the" is commonly
followed by a noun).
▪ Context Word Frequencies: How often certain words appear in similar
contexts.
▪ Step 2: Assigning Weights
▪ Each feature is assigned a weight to indicate its importance. Suppose the
weights are: Feature Weight
"Previous word is 'the'" 1.5
"Context suggests a noun" 2.0
"Word has been seen often" 0.8

▪ Step 3: Compute Scores
▪ For each possible word (e.g., "mat," "hat," and "dog"), the model computes a score
by combining feature values and weights.
• For "mat":
• Previous word is "the" → Feature value = 1, Weight = 1.5 → Contribution = 1×1.5=1.5
• Context suggests a noun → Feature value = 1, Weight = 2.0 → Contribution = 1×2.0=2.0
• Word seen often → Feature value = 1, Weight = 0.8 → Contribution = 1×0.8=0.8
• Total Score = 1.5 + 2.0 + 0.8 = 4.3
• For "hat":
• Previous word is "the" → 1×1.5=1.5
• Context suggests a noun → 1×2.0=2.0
• Word seen less often → Feature value = 0.5, Weight = 0.8 → Contribution = 0.5×0.8=0.4
• Total Score = 1.5 + 2.0 + 0.4 = 3.9
• For "dog":
• Previous word is "the" → 1×1.5=1.5
• Context suggests a noun → 1×2.0=2.0
• Word seen rarely → Feature value = 0.2, Weight = 0.8 → Contribution = 0.2×0.8=0.16
• Total Score = 1.5 + 2.0 + 0.16 = 3.66

▪ Step 4: Convert Scores to Probabilities
▪ The scores are converted into probabilities using the softmax function, which
ensures all probabilities add up to 1.The formula for softmax is:
For the scores:
"mat" = 4.3
"hat" = 3.9
"dog" = 3.66
▪ Step 5: Predict
▪ The model predicts "mat" as the most probable next word because it has
the highest probability (43%).

Language Modelling in Natural Language Processing-Part II.pdf

More Related Content

Similar to Language Modelling in Natural Language Processing-Part II.pdf (20)

Recently uploaded (20)

Language Modelling in Natural Language Processing-Part II.pdf