SlideShare a Scribd company logo
Language Modelling
(Part II)
Dr. Deptii Chaudhari
Assistant Professor, Department of Computer Engineering
Hope Foundation’s
International Institute of Information Technology, Hinjawadi, Pune
deptiic@isquareit.edu.in, www.isquareit.edu.in
LinkedIn: https://www.linkedin.com/in/deptii-chaudhari
GitHub: https://github.com/DeptiiC
Google Scholar: https://scholar.google.co.in/citations?hl=en&user=H_tb1lEAAAAJ
Topics
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Vector Semantics
▪ Word2Vec Models
▪ BERT
▪ Graph-based Language Models
Word Embeddings / Vector Semantics
▪ Vectors for representing words are called embeddings.
▪ The idea of vector semantics is to represent a word as a point in a
multidimensional semantic space that is derived from the
distributions of word neighbors.
▪ Vector semantics is the standard way to represent word meaning in
NLP, helping us model many of the aspects of word meaning.
▪ For example, suppose you didn’t know the meaning of the word
ongchoi but you see it in the following contexts:
▪ Ongchoi is delicious sauteed with garlic.
▪ Ongchoi is superb over rice.
▪ ...ongchoi leaves with salty sauces...
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word Embeddings / Vector Semantics
▪ And suppose that you had seen many of these context words in other
contexts:
▪ ...spinach sauteed with garlic over rice...
▪ ...chard stems and leaves are delicious...
▪ ...collard greens and other salty leafy greens
▪ The fact that ongchoi occurs with words like rice and garlic and
delicious and salty, as do words like spinach, chard, and collard greens
might suggest that ongchoi is a leafy green similar to these other leafy
greens.
▪ We can do the same thing computationally by just counting words in
the context of ongchoi.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Reasoning with Word Vectors
▪ It has been found that the learned word representations in fact
capture meaning syntactic and semantic regularities in a very simple
way.
▪ Specifically, the regularities are observed as constant vector offsets
between pairs of words sharing a particular relationship.
▪ For Example : Case of Singular and Plural Relations
▪ If we denote the vector for word i as xi, and focus on the
singular/plural relation, we observe that
▪ xapples – xapple  xcars – xcar  xfamily – xfamilies
▪ and so on…
▪ It is also good at answering analogy questions such as
▪ a as to b, as c as to ?
▪ man is to woman as uncle is to ? (aunt)
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Reasoning with Word Vectors
▪ Vector offset for Gender
Relation
▪ Vector offset for Singular Plural
Relation
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Analogy Testing using Word Vectors
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word Embeddings / Vector Semantics
▪ Word Vectors
▪ One Hot Encoding
▪ Bag of Words
▪ TF-IDF
▪ Word2Vec
▪ Doc2Vec
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Learning word vectors
▪ BoW and TF-IDF consist of a set of words (vocabulary) and a metric like
frequency or term frequency-inverse document frequency (TF-IDF) to describe
each word’s value in the corpus.
▪ That means BoW and TF-IDF can result in sparse matrices and high
dimensional vectors that consume a lot of computer resources if the vocabulary
is very large.
▪ Developed by a team of researchers at Google, word2vec attempts to solve the
issues with the BoW approach:
▪ High-dimension vectors
▪ Words assumed completely independent of each other
Basic Idea
Instead of capturing co-occurrence counts directly, predict (using)
surrounding words of every word.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Why use Word2Vec ?
▪ Word2Vec model preserves the relationship between the words.
▪ It can easily deal with the introduction of new words in the
vocabulary.
▪ It has shown better results in lots of deep learning models.
▪ It uses Context based probabilistic approach which is simple and
scalable.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Two Variations of Model
▪ Using a neural network with only a couple layers, word2vec tries to learn
relationships between words and embeds them in a lower-dimensional vector
space.
▪ To do this, word2vec trains words against other words that neighbor them in
the input corpus, capturing some of the meaning in the sequence of words.
▪ The researchers devised two novel approaches:
▪ Continuous bag of words (CBoW)
▪ Skip-gram
▪ The CBoW architecture predicts the current word based on the context while
the skip-gram predicts surrounding words given the current word.
▪ CBoW : Given a set of (neighbouring) words, guess single word that
potentially occur along with this set of words.
▪ Skip-gram : Guess potential neighboring words based on the single word
being analyzed
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Two Variations of Model
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – CBOW
▪ Let us consider a piece of text as follows.
▪ “The recently introduced continuous skip gram model is an efficient method for
learning high quality distributed vector representations that capture a large
number of syntactic and semantic word relationships.”
▪ Imagine a sliding window over the text, that includes the central word
currently in focus, together with four words that precede it, and the four words
that follow it:
… an efficient method for learning high quality distributed vector …..
Context Context
Focus
Word
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – CBOW
▪ The context words from input layers. Each word is encoded in one-hot form. It
is a single hidden and output layer.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – CBOW Training Objective
▪ The training objective is to maximize the conditional probability of observing
the actual output word (the focus word) given the input context words, with
regard to weights.
▪ In our example, given the input (“an”, “efficient”, “method”, “for”, “high”,
“quality”, “distributed”, “vector”), we want to maximize the probability of
getting “learning” as output.
▪ CBOW: Input to Hidden Layer
▪ Since our input vectors are one-hot, multiplying an input vector by the weight
matrix W1 amounts to simply selecting a row from W1
▪ Given C input word vectors, the activation
function for the hidden layer h amounts to
simply summing the corresponding ‘hot’
rows in W1, and dividing by C to take their
average.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – CBOW Training Objective
▪ CBOW: Hidden Layer to Output Layer
▪ From hidden layer to the output layer, the second weight matrix W2
can be used to compute a score for each word in the vocabulary,
and softmax can be used to obtain the posterior distribution of
words.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – CBOW Working Example
Hope can set you free.
1
0
0
0
0
0
0
1
0
0
V5x1, one hot vector of
“Hope”
V5x1, one hot vector of
“set”
3 nodes in
hidden layer
W3x5
W3x5
V5x1, Predicted
one hot vector
of “can”
0
1
0
0
0
Actual Vector
Compare and update the
weights
Output layer with
Softmax function
One-hot context
word Input vector
W5x3
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Skip Gram Model
▪ Inverse of CBOW: Predict the context from given word.
▪ Input is a single focus word, and the target are context words.
▪ The objective is to minimize the summed prediction error across all
context words in the output layer.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Skip Gram Model Training
▪ The activation function for the hidden layer simply amounts to
copying the corresponding row from the weights matrix W1
(linear).
▪ At the output layer, we output C multinomial distributions instead
of just one.
▪ The training objective is to maximize the summed prediction error
across all context words in the output layer.
▪ In our example, the input would be “learning” and we hope to get
(“an”, “efficient”, “method”, “for”, “high”, “quality”, “distributed”,
“vector”) at the output layer.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Skip Gram Model Training
▪ Predict the surrounding words in a window of length c of each
word.
▪ Objective Function: Maximize the log probability of any context
word given the current center word.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Word2Vec – Skip-gram Working Example
Hope can set you free.
0
1
0
0
0
V5x1, one hot vector of
“Can”
3 nodes in
hidden layer
W3x5
V5x1, Predicted one hot
vector of “set”
1
0
0
0
0
Actual Target
Vectors
Compare and update the
weights
One-hot focus
word Input vector
W5x3
V5x1, Predicted one hot
vector of “Hope”
0
0
1
0
0
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
When to use CBOW and Skip-Gram Models?
▪ CBOW
▪ With small corpus CBOW is faster.
▪ It is much faster to train than Skipgram.
▪ CBOW is low on memory. It does not need to have huge RAM
requirements.
▪ Slightly better accuracy for frequent words.
▪ Skip-gram
▪ It works slower with larger corpus and higher dimensions.
▪ Skip-gram model works well with small amount of training data.
▪ It also performs well when words and phrases are rare.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Doc2Vec Model
▪ Doc2Vec is a model that represents each document as a vector of
numbers, similar to how Word2Vec represents each word as a
vector.
▪ Doc2Vec can capture the meaning and context of a document and
can be used for tasks such as document similarity, clustering, or
classification.
▪ Doc2Vec works by training a neural network on a large corpus of
documents, where each document has a unique identifier or tag.
▪ The network learns to predict the words in a document given its
tag, and vice versa.
▪ The tag vector becomes the document vector, and the word vectors
are shared with Word2Vec.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Doc2Vec Model
Prof. Deptii Chaudhari (I2IT, Pune)
▪ There are two main variants of the Doc2Vec approach:
▪ Distributed Memory Model of Paragraph Vectors (PV-DM)
▪ Distributed Bag of Words (DBOW)
▪ The Distributed-Memory Model closely resembles the CBOW
model of Word2vec.
▪ This model tries to predict a target word given its surrounding
context words with the addition of a paragraph ID.
▪ The Distributed Bag-Of-Words Model based on the Word2vec skip-
gram model, with one exception instead of using the target word as
the input, it takes the document ID as the input and tries to predict
randomly sampled words from the document.
Doc2Vec Model
▪ Distributed Memory (DM): Randomly sample consecutive words from a
paragraph and predict a center word from the randomly sampled set of words
by taking as input — the context words and a paragraph id.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Doc2Vec Model
▪ The DBOW model “ignores the context words in the input, but force the model
to predict words randomly sampled from the paragraph in the output.”
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT
▪ BERT stands for Bidirectional Encoder Representations from
Transformers.
▪ Google developed BERT to serve as a bidirectional transformer
model that examines words within text by considering both left-to-
right and right-to-left contexts.
▪ BERT makes use of Transformer, an attention mechanism that
learns contextual relations between words (or sub-words) in a text.
▪ As opposed to directional models, which read the text input
sequentially (left-to-right or right-to-left), the Transformer encoder
reads the entire sequence of words at once.
▪ This characteristic allows the model to learn the context of a word
based on all of its surroundings (left and right of the word).
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Features of BERT
▪ Transformer Architecture: BERT is based on the Transformer architecture,
which is a type of neural network that uses self-attention mechanisms.
▪ It’s designed to generate a language model, so only the encoder mechanism is
used.
▪ Bidirectional Approach: Traditional language models process text
sequentially, either from left to right or right to left.
▪ This method limits the model’s awareness to the immediate context preceding
the target word.
▪ BERT uses a bidirectional approach, considering both the left and right context
of words in a sentence.
▪ Pre-training and Fine-tuning: The BERT model undergoes a two-step process:
Pre-training on large amounts of unlabeled text to learn contextual
embeddings, and fine-tuning on labeled data for specific NLP tasks.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Features of BERT
▪ Contextual Embeddings: During the pre-training phase, BERT
learns contextual embeddings, which are the representations of
words that take into account their surrounding context in a
sentence.
▪ Fine-Tuning: After the pre-training phase, the BERT model is then
fine-tuned for specific natural language processing (NLP) tasks.
▪ This step tailors the model to more targeted applications by
adapting its general language understanding to the nuances of the
particular task.
▪ MathBERT: There’s also a variant of BERT called MathBERT, which
is jointly trained with mathematical formulas and their
corresponding contexts.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT – Pretraining and Finetuning
[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT – Pretraining and Finetuning
▪ Pre-Training on Large Data
▪ BERT is pre-trained on large amount of unlabeled text data.
▪ The model learns contextual embeddings, which are the
representations of words that take into account their surrounding
context in a sentence.
▪ BERT engages in various unsupervised pre-training tasks.
▪ For instance, it might learn to predict missing words in a sentence
(Masked Language Model or MLM task), understand the
relationship between two sentences, or predict the next sentence in
a pair.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT – Pretraining and Finetuning
▪ Fine-Tuning on Labeled Data
▪ After the pre-training phase, the BERT model, armed with its
contextual embeddings, is then fine-tuned for specific natural language
processing (NLP) tasks.
▪ This step tailors the model to more targeted applications by adapting
its general language understanding to the nuances of the particular
task.
▪ BERT is fine-tuned using labeled data specific to the downstream tasks
of interest.
▪ These tasks could include sentiment analysis, question-answering,
named entity recognition, or any other NLP application.
▪ The model’s parameters are adjusted to optimize its performance for
the particular requirements of the task at hand.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Working of BERT
▪ BERT is designed to generate a language model so, only the encoder
mechanism is used.
▪ Sequence of tokens are fed to the Transformer encoder.
▪ These tokens are first embedded into vectors and then processed in
the neural network.
▪ The output is a sequence of vectors, each corresponding to an input
token, providing contextualized representations.
▪ Traditional models predict the next word in a sequence, which is a
directional approach and may limit context learning. BERT addresses
this challenge with two innovative training strategies:
▪ Masked Language Model (MLM)
▪ Next Sentence Prediction (NSP)
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Masked Language Model (MLM)
▪ In BERT’s pre-training process, a portion of words in each input sequence
is masked and the model is trained to predict the original values of these
masked words based on the context provided by the surrounding words.
▪ Masking words: Before BERT learns from sentences, it hides some words
(about 15%) and replaces them with a special symbol, like [MASK].
▪ Guessing Hidden Words: BERT’s job is to figure out what these hidden
words are by looking at the words around them. It’s like a game of
guessing where some words are missing, and BERT tries to fill in the
blanks.
▪ BERT adds a special layer on top of its learning system to make these
guesses. It then checks how close its guesses are to the actual hidden
words.
▪ It does this by converting its guesses into probabilities.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Next Sentence Prediction (NSP)
▪ BERT predicts if the second sentence is connected to the first.
▪ This is done by transforming the output of the [CLS] token into a 2×1
shaped vector using a classification layer, and then calculating the
probability of whether the second sentence follows the first using
SoftMax.
▪ In the training process, BERT learns to understand the relationship
between pairs of sentences, predicting if the second sentence follows the
first in the original document.
▪ 50% of the input pairs have the second sentence as the subsequent
sentence in the original document, and the other 50% have a randomly
chosen sentence.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Next Sentence Prediction (NSP)
▪ To help the model distinguish between connected and disconnected
sentence pairs. The input is processed before entering the model:
▪ A [CLS] token is inserted at the beginning of the first sentence, and a
[SEP] token is added at the end of each sentence.
▪ A sentence embedding indicating Sentence A or Sentence B is added to
each token.
▪ A positional embedding indicates the position of each token in the
sequence.
▪ BERT predicts if the second sentence is connected to the first. This is
done by transforming the output of the [CLS] token into a 2×1 shaped
vector using a classification layer, and then calculating the probability of
whether the second sentence follows the first using SoftMax.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Training
▪ The model is trained on both previously mentioned tasks
simultaneously. This is made possible by clever usage of inputs and
outputs.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Training
▪ BERT relies on a Transformer (the attention mechanism that learns contextual
relationships between words in a text).
▪ A basic Transformer consists of an encoder to read the text input and a decoder to
produce a prediction for the task.
▪ Since BERT’s goal is to generate a language representation model, it only needs the
encoder part. The input to the encoder for BERT is a sequence of tokens, which are
first converted into vectors and then processed in the neural network.
▪ But before processing can start, BERT needs the input to be decorated with some extra
metadata:
▪ Token embeddings: A [CLS] token is added to the input word tokens at the
beginning of the first sentence and a [SEP] token is inserted at the end of each
sentence.
▪ Segment embeddings: A marker indicating Sentence A or Sentence B is added to
each token. This allows the encoder to distinguish between sentences.
▪ Positional embeddings: A positional embedding is added to each token to indicate
its position in the sentence.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Training
▪ To predict if the second sentence is connected to the first one or not, basically
the complete input sequence goes through the Transformer based model, the
output of the [CLS] token is transformed into a 2×1 shaped vector using a
simple classification layer, and the IsNext-Label is assigned using softmax.
▪ The model is trained with both Masked LM and Next Sentence Prediction
together. This is to minimize the combined loss function of the two strategies —
“together is better”.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Architectures
▪ There are two types of pre-trained versions of BERT depending on
the scale of the model architecture:
▪ BERT-BASE has 12 layers in the Encoder stack while BERT-LARGE
has 24 layers in the Encoder stack.
▪ BERT architectures (BASE and LARGE) also have larger feedforward
networks (768 and 1024 hidden units respectively), and more
attention heads (12 and 16 respectively) than the Transformer
architecture suggested in the original paper. It contains 512 hidden
units and 8 attention heads.
▪ BERT-BASE contains 110M parameters while BERT-LARGE has
340M parameters.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
BERT - Architectures
▪ There are two types of pre-trained versions of BERT depending on the scale of
the model architecture:
Fun fact: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was
trained on 16 TPUs for 4 days!
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Graph based language models
▪ A graph is a mathematical structure consisting of:
▪ Nodes (Vertices): Represent entities (e.g., words, sentences, or
documents).
▪ Edges: Represent relationships between entities (e.g., co-occurrence,
syntactic dependency, semantic similarity).
▪ In language modeling, Nodes could be words, and edges represent how
often two words occur together or their grammatical relationship.
▪ Graph-based models can be used for language modeling by representing
linguistic data, such as words, phrases, or sentences, as nodes and their
relationships as edges within a graph.
▪ This approach leverages the structural and relational properties of natural
language, enabling more sophisticated and context-aware representations.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Graph based language models
▪ Why Use Graphs in Language Modeling?
▪ Language is complex and often non-linear. Graphs can help to
▪ Capture Relationships: Words or phrases aren't just sequential—they have
rich relationships (e.g., syntactic or semantic).
▪ Integrate Knowledge: Graphs like knowledge graphs (e.g., WordNet,
ConceptNet) can provide background knowledge about word meanings and
relationships.
▪ Handle Global Context: Graphs can encode both local (sentence-level) and
global (corpus-level) context.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Graph Representations in Language Modeling
▪ Word Co-occurrence Graphs:
▪ Nodes: Words.
▪ Edges: Represent co-occurrence
within a window size in text.
▪ Application: Keyword extraction,
sentence similarity.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Syntactic Dependency Graphs:
▪ Nodes: Words.
▪ Edges: Represent grammatical
relationships (e.g., subject-object).
▪ Application: Parsing, sentiment
analysis.
Graph Representations in Language Modeling
▪ Semantic Graphs:
▪ Nodes: Concepts or words.
▪ Edges: Represent semantic relationships
(e.g., synonyms, hypernyms).
▪ Application: Word-sense
disambiguation, machine translation.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Knowledge Graphs:
▪ Nodes: Entities (e.g., "Einstein").
▪ Edges: Relationships (e.g., "was a
scientist").
▪ Application: Question answering,
fact-checking.
Graph Neural Networks (GNNs): Modern Graph-based Language Models
▪ Graph Neural Networks (GNNs) are a class of neural networks designed to
operate on graph-structured data, capturing the relationships and
interactions between entities.
▪ At their core, GNNs use a message-passing framework:
▪ Node Representation (Features): Each node starts with an initial feature
vector (e.g., a word embedding in NLP or an atom's properties in a
molecule).
▪ Neighborhood Aggregation: Each node aggregates information from its
neighbors, capturing the local graph structure.
▪ Updating Node Representations: Using the aggregated information, the
node updates its representation to reflect its context in the graph.
▪ This process is repeated for multiple iterations (or layers), allowing nodes to
gather information from progressively larger neighborhoods.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Types of GNNs
▪ Graph Convolutional Networks (GCNs)
▪ Graph Attention Networks (GATs)
▪ GraphSAGE
▪ Message Passing Neural Networks (MPNNs)
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Applications of GNNs
▪ Text Summarization: Represent sentences or documents as nodes, and use
GNNs to identify important content.
▪ Knowledge Graph Completion: Predict missing edges or relationships in a
knowledge graph.
▪ Semantic Parsing: Use dependency graphs of sentences for syntactic analysis.
▪ Social Networks:
▪ Node classification (e.g., user profiling).
▪ Link prediction (e.g., recommending friends).
Log Linear Models
▪ A log-linear model is a type of statistical model used to predict probabilities.
▪ It combines features (important information from the data) and weights
(importance of each feature) to make predictions.
▪ Features are pieces of information used to make predictions. In a language
model, features might include the previous words or context.
▪ Example: For predicting the next word in "The cat sat on the ___", features
could be "The cat" or "sat on".
▪ Each feature is given a weight that tells the model how important it is. Some
features might be more important than others.
▪ Example: The feature "sat on" might be more useful for predicting the next
word than "The cat".
▪ A score is calculated by multiplying each feature's value by it corresponding
weight and summing the results.
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Log Linear Models
▪ The scores are then turned into probabilities using a function called softmax.
▪ The softmax function converts scores into values between 0 and 1, where
higher scores lead to higher probabilities.
▪ Example: The score for "mat" might be 4.3, while "hat" has a score of 3.9.
After applying softmax, the model will predict "mat" as the next word
because it has the higher probability.
▪ The formula for softmax is:
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
Log Linear Model - Example
▪ Let us predict the next word in the sentence: "The cat sat on the ___."
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Step 1: Selecting Features
▪ Features could be:
▪ Word History: Previous words in the sentence ("The cat sat on the").
▪ Syntactic Patterns: Certain words follow others (e.g., "on the" is commonly
followed by a noun).
▪ Context Word Frequencies: How often certain words appear in similar
contexts.
▪ Step 2: Assigning Weights
▪ Each feature is assigned a weight to indicate its importance. Suppose the
weights are: Feature Weight
"Previous word is 'the'" 1.5
"Context suggests a noun" 2.0
"Word has been seen often" 0.8
Log Linear Model - Example
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Step 3: Compute Scores
▪ For each possible word (e.g., "mat," "hat," and "dog"), the model computes a score
by combining feature values and weights.
• For "mat":
• Previous word is "the" → Feature value = 1, Weight = 1.5 → Contribution = 1×1.5=1.5
• Context suggests a noun → Feature value = 1, Weight = 2.0 → Contribution = 1×2.0=2.0
• Word seen often → Feature value = 1, Weight = 0.8 → Contribution = 1×0.8=0.8
• Total Score = 1.5 + 2.0 + 0.8 = 4.3
• For "hat":
• Previous word is "the" → 1×1.5=1.5
• Context suggests a noun → 1×2.0=2.0
• Word seen less often → Feature value = 0.5, Weight = 0.8 → Contribution = 0.5×0.8=0.4
• Total Score = 1.5 + 2.0 + 0.4 = 3.9
• For "dog":
• Previous word is "the" → 1×1.5=1.5
• Context suggests a noun → 1×2.0=2.0
• Word seen rarely → Feature value = 0.2, Weight = 0.8 → Contribution = 0.2×0.8=0.16
• Total Score = 1.5 + 2.0 + 0.16 = 3.66
Log Linear Model - Example
Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
▪ Step 4: Convert Scores to Probabilities
▪ The scores are converted into probabilities using the softmax function, which
ensures all probabilities add up to 1.The formula for softmax is:
For the scores:
"mat" = 4.3
"hat" = 3.9
"dog" = 3.66
▪ Step 5: Predict
▪ The model predicts "mat" as the most probable next word because it has
the highest probability (43%).

More Related Content

PPTX
Word embedding
PDF
Word2Vec
PPTX
Word_Embedding.pptx
PPTX
Natural language processing unit - 2 ppt
PPTX
CVDL Unit-5.pptx this is relates to computer vision
PPTX
word vector embeddings in natural languag processing
PPTX
Lecture1.pptx
PPTX
presentation2-180202073525.pptx
Word embedding
Word2Vec
Word_Embedding.pptx
Natural language processing unit - 2 ppt
CVDL Unit-5.pptx this is relates to computer vision
word vector embeddings in natural languag processing
Lecture1.pptx
presentation2-180202073525.pptx

Similar to Language Modelling in Natural Language Processing-Part II.pdf (20)

PPTX
Word2 vec
PPTX
wordembedding.pptx
PPTX
PDF
Deep learning Malaysia presentation 12/4/2017
PDF
Lda2vec text by the bay 2016 with notes
PPTX
A Panorama of Natural Language Processing
PPTX
Vectorization In NLP.pptx
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PPTX
Deep Learning Bangalore meet up
PPTX
DLBLR talk
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PDF
Word2Vec on Italian language
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PDF
Word2vec ultimate beginner
PPTX
What is word2vec?
PDF
Word2vec on the italian language: first experiments
PPTX
Word2vec slide(lab seminar)
PDF
Word2vec: From intuition to practice using gensim
PPTX
Designing, Visualizing and Understanding Deep Neural Networks
Word2 vec
wordembedding.pptx
Deep learning Malaysia presentation 12/4/2017
Lda2vec text by the bay 2016 with notes
A Panorama of Natural Language Processing
Vectorization In NLP.pptx
Introduction to Neural Information Retrieval and Large Language Models
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Deep Learning Bangalore meet up
DLBLR talk
Embedding for fun fumarola Meetup Milano DLI luglio
Word2Vec on Italian language
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Word2vec ultimate beginner
What is word2vec?
Word2vec on the italian language: first experiments
Word2vec slide(lab seminar)
Word2vec: From intuition to practice using gensim
Designing, Visualizing and Understanding Deep Neural Networks
Ad

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Well-logging-methods_new................
DOCX
573137875-Attendance-Management-System-original
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
composite construction of structures.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
additive manufacturing of ss316l using mig welding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Digital Logic Computer Design lecture notes
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Well-logging-methods_new................
573137875-Attendance-Management-System-original
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
composite construction of structures.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Operating System & Kernel Study Guide-1 - converted.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
additive manufacturing of ss316l using mig welding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Ad

Language Modelling in Natural Language Processing-Part II.pdf

  • 1. Language Modelling (Part II) Dr. Deptii Chaudhari Assistant Professor, Department of Computer Engineering Hope Foundation’s International Institute of Information Technology, Hinjawadi, Pune deptiic@isquareit.edu.in, www.isquareit.edu.in LinkedIn: https://www.linkedin.com/in/deptii-chaudhari GitHub: https://github.com/DeptiiC Google Scholar: https://scholar.google.co.in/citations?hl=en&user=H_tb1lEAAAAJ
  • 2. Topics Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Vector Semantics ▪ Word2Vec Models ▪ BERT ▪ Graph-based Language Models
  • 3. Word Embeddings / Vector Semantics ▪ Vectors for representing words are called embeddings. ▪ The idea of vector semantics is to represent a word as a point in a multidimensional semantic space that is derived from the distributions of word neighbors. ▪ Vector semantics is the standard way to represent word meaning in NLP, helping us model many of the aspects of word meaning. ▪ For example, suppose you didn’t know the meaning of the word ongchoi but you see it in the following contexts: ▪ Ongchoi is delicious sauteed with garlic. ▪ Ongchoi is superb over rice. ▪ ...ongchoi leaves with salty sauces... Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 4. Word Embeddings / Vector Semantics ▪ And suppose that you had seen many of these context words in other contexts: ▪ ...spinach sauteed with garlic over rice... ▪ ...chard stems and leaves are delicious... ▪ ...collard greens and other salty leafy greens ▪ The fact that ongchoi occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard greens might suggest that ongchoi is a leafy green similar to these other leafy greens. ▪ We can do the same thing computationally by just counting words in the context of ongchoi. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 5. Reasoning with Word Vectors ▪ It has been found that the learned word representations in fact capture meaning syntactic and semantic regularities in a very simple way. ▪ Specifically, the regularities are observed as constant vector offsets between pairs of words sharing a particular relationship. ▪ For Example : Case of Singular and Plural Relations ▪ If we denote the vector for word i as xi, and focus on the singular/plural relation, we observe that ▪ xapples – xapple  xcars – xcar  xfamily – xfamilies ▪ and so on… ▪ It is also good at answering analogy questions such as ▪ a as to b, as c as to ? ▪ man is to woman as uncle is to ? (aunt) Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 6. Reasoning with Word Vectors ▪ Vector offset for Gender Relation ▪ Vector offset for Singular Plural Relation Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 7. Analogy Testing using Word Vectors Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 8. Word Embeddings / Vector Semantics ▪ Word Vectors ▪ One Hot Encoding ▪ Bag of Words ▪ TF-IDF ▪ Word2Vec ▪ Doc2Vec Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 9. Word2Vec – Learning word vectors ▪ BoW and TF-IDF consist of a set of words (vocabulary) and a metric like frequency or term frequency-inverse document frequency (TF-IDF) to describe each word’s value in the corpus. ▪ That means BoW and TF-IDF can result in sparse matrices and high dimensional vectors that consume a lot of computer resources if the vocabulary is very large. ▪ Developed by a team of researchers at Google, word2vec attempts to solve the issues with the BoW approach: ▪ High-dimension vectors ▪ Words assumed completely independent of each other Basic Idea Instead of capturing co-occurrence counts directly, predict (using) surrounding words of every word. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 10. Why use Word2Vec ? ▪ Word2Vec model preserves the relationship between the words. ▪ It can easily deal with the introduction of new words in the vocabulary. ▪ It has shown better results in lots of deep learning models. ▪ It uses Context based probabilistic approach which is simple and scalable. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 11. Word2Vec – Two Variations of Model ▪ Using a neural network with only a couple layers, word2vec tries to learn relationships between words and embeds them in a lower-dimensional vector space. ▪ To do this, word2vec trains words against other words that neighbor them in the input corpus, capturing some of the meaning in the sequence of words. ▪ The researchers devised two novel approaches: ▪ Continuous bag of words (CBoW) ▪ Skip-gram ▪ The CBoW architecture predicts the current word based on the context while the skip-gram predicts surrounding words given the current word. ▪ CBoW : Given a set of (neighbouring) words, guess single word that potentially occur along with this set of words. ▪ Skip-gram : Guess potential neighboring words based on the single word being analyzed Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 12. Word2Vec – Two Variations of Model Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 13. Word2Vec – CBOW ▪ Let us consider a piece of text as follows. ▪ “The recently introduced continuous skip gram model is an efficient method for learning high quality distributed vector representations that capture a large number of syntactic and semantic word relationships.” ▪ Imagine a sliding window over the text, that includes the central word currently in focus, together with four words that precede it, and the four words that follow it: … an efficient method for learning high quality distributed vector ….. Context Context Focus Word Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 14. Word2Vec – CBOW ▪ The context words from input layers. Each word is encoded in one-hot form. It is a single hidden and output layer. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 15. Word2Vec – CBOW Training Objective ▪ The training objective is to maximize the conditional probability of observing the actual output word (the focus word) given the input context words, with regard to weights. ▪ In our example, given the input (“an”, “efficient”, “method”, “for”, “high”, “quality”, “distributed”, “vector”), we want to maximize the probability of getting “learning” as output. ▪ CBOW: Input to Hidden Layer ▪ Since our input vectors are one-hot, multiplying an input vector by the weight matrix W1 amounts to simply selecting a row from W1 ▪ Given C input word vectors, the activation function for the hidden layer h amounts to simply summing the corresponding ‘hot’ rows in W1, and dividing by C to take their average. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 16. Word2Vec – CBOW Training Objective ▪ CBOW: Hidden Layer to Output Layer ▪ From hidden layer to the output layer, the second weight matrix W2 can be used to compute a score for each word in the vocabulary, and softmax can be used to obtain the posterior distribution of words. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 17. Word2Vec – CBOW Working Example Hope can set you free. 1 0 0 0 0 0 0 1 0 0 V5x1, one hot vector of “Hope” V5x1, one hot vector of “set” 3 nodes in hidden layer W3x5 W3x5 V5x1, Predicted one hot vector of “can” 0 1 0 0 0 Actual Vector Compare and update the weights Output layer with Softmax function One-hot context word Input vector W5x3 Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 18. Word2Vec – Skip Gram Model ▪ Inverse of CBOW: Predict the context from given word. ▪ Input is a single focus word, and the target are context words. ▪ The objective is to minimize the summed prediction error across all context words in the output layer. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 19. Word2Vec – Skip Gram Model Training ▪ The activation function for the hidden layer simply amounts to copying the corresponding row from the weights matrix W1 (linear). ▪ At the output layer, we output C multinomial distributions instead of just one. ▪ The training objective is to maximize the summed prediction error across all context words in the output layer. ▪ In our example, the input would be “learning” and we hope to get (“an”, “efficient”, “method”, “for”, “high”, “quality”, “distributed”, “vector”) at the output layer. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 20. Word2Vec – Skip Gram Model Training ▪ Predict the surrounding words in a window of length c of each word. ▪ Objective Function: Maximize the log probability of any context word given the current center word. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 21. Word2Vec – Skip-gram Working Example Hope can set you free. 0 1 0 0 0 V5x1, one hot vector of “Can” 3 nodes in hidden layer W3x5 V5x1, Predicted one hot vector of “set” 1 0 0 0 0 Actual Target Vectors Compare and update the weights One-hot focus word Input vector W5x3 V5x1, Predicted one hot vector of “Hope” 0 0 1 0 0 Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 22. When to use CBOW and Skip-Gram Models? ▪ CBOW ▪ With small corpus CBOW is faster. ▪ It is much faster to train than Skipgram. ▪ CBOW is low on memory. It does not need to have huge RAM requirements. ▪ Slightly better accuracy for frequent words. ▪ Skip-gram ▪ It works slower with larger corpus and higher dimensions. ▪ Skip-gram model works well with small amount of training data. ▪ It also performs well when words and phrases are rare. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 23. Doc2Vec Model ▪ Doc2Vec is a model that represents each document as a vector of numbers, similar to how Word2Vec represents each word as a vector. ▪ Doc2Vec can capture the meaning and context of a document and can be used for tasks such as document similarity, clustering, or classification. ▪ Doc2Vec works by training a neural network on a large corpus of documents, where each document has a unique identifier or tag. ▪ The network learns to predict the words in a document given its tag, and vice versa. ▪ The tag vector becomes the document vector, and the word vectors are shared with Word2Vec. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 24. Doc2Vec Model Prof. Deptii Chaudhari (I2IT, Pune) ▪ There are two main variants of the Doc2Vec approach: ▪ Distributed Memory Model of Paragraph Vectors (PV-DM) ▪ Distributed Bag of Words (DBOW) ▪ The Distributed-Memory Model closely resembles the CBOW model of Word2vec. ▪ This model tries to predict a target word given its surrounding context words with the addition of a paragraph ID. ▪ The Distributed Bag-Of-Words Model based on the Word2vec skip- gram model, with one exception instead of using the target word as the input, it takes the document ID as the input and tries to predict randomly sampled words from the document.
  • 25. Doc2Vec Model ▪ Distributed Memory (DM): Randomly sample consecutive words from a paragraph and predict a center word from the randomly sampled set of words by taking as input — the context words and a paragraph id. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 26. Doc2Vec Model ▪ The DBOW model “ignores the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output.” Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 27. BERT ▪ BERT stands for Bidirectional Encoder Representations from Transformers. ▪ Google developed BERT to serve as a bidirectional transformer model that examines words within text by considering both left-to- right and right-to-left contexts. ▪ BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. ▪ As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. ▪ This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 28. Features of BERT ▪ Transformer Architecture: BERT is based on the Transformer architecture, which is a type of neural network that uses self-attention mechanisms. ▪ It’s designed to generate a language model, so only the encoder mechanism is used. ▪ Bidirectional Approach: Traditional language models process text sequentially, either from left to right or right to left. ▪ This method limits the model’s awareness to the immediate context preceding the target word. ▪ BERT uses a bidirectional approach, considering both the left and right context of words in a sentence. ▪ Pre-training and Fine-tuning: The BERT model undergoes a two-step process: Pre-training on large amounts of unlabeled text to learn contextual embeddings, and fine-tuning on labeled data for specific NLP tasks. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 29. Features of BERT ▪ Contextual Embeddings: During the pre-training phase, BERT learns contextual embeddings, which are the representations of words that take into account their surrounding context in a sentence. ▪ Fine-Tuning: After the pre-training phase, the BERT model is then fine-tuned for specific natural language processing (NLP) tasks. ▪ This step tailors the model to more targeted applications by adapting its general language understanding to the nuances of the particular task. ▪ MathBERT: There’s also a variant of BERT called MathBERT, which is jointly trained with mathematical formulas and their corresponding contexts. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 30. BERT – Pretraining and Finetuning [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org) Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 31. BERT – Pretraining and Finetuning ▪ Pre-Training on Large Data ▪ BERT is pre-trained on large amount of unlabeled text data. ▪ The model learns contextual embeddings, which are the representations of words that take into account their surrounding context in a sentence. ▪ BERT engages in various unsupervised pre-training tasks. ▪ For instance, it might learn to predict missing words in a sentence (Masked Language Model or MLM task), understand the relationship between two sentences, or predict the next sentence in a pair. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 32. BERT – Pretraining and Finetuning ▪ Fine-Tuning on Labeled Data ▪ After the pre-training phase, the BERT model, armed with its contextual embeddings, is then fine-tuned for specific natural language processing (NLP) tasks. ▪ This step tailors the model to more targeted applications by adapting its general language understanding to the nuances of the particular task. ▪ BERT is fine-tuned using labeled data specific to the downstream tasks of interest. ▪ These tasks could include sentiment analysis, question-answering, named entity recognition, or any other NLP application. ▪ The model’s parameters are adjusted to optimize its performance for the particular requirements of the task at hand. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 33. Working of BERT ▪ BERT is designed to generate a language model so, only the encoder mechanism is used. ▪ Sequence of tokens are fed to the Transformer encoder. ▪ These tokens are first embedded into vectors and then processed in the neural network. ▪ The output is a sequence of vectors, each corresponding to an input token, providing contextualized representations. ▪ Traditional models predict the next word in a sequence, which is a directional approach and may limit context learning. BERT addresses this challenge with two innovative training strategies: ▪ Masked Language Model (MLM) ▪ Next Sentence Prediction (NSP) Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 34. BERT - Masked Language Model (MLM) ▪ In BERT’s pre-training process, a portion of words in each input sequence is masked and the model is trained to predict the original values of these masked words based on the context provided by the surrounding words. ▪ Masking words: Before BERT learns from sentences, it hides some words (about 15%) and replaces them with a special symbol, like [MASK]. ▪ Guessing Hidden Words: BERT’s job is to figure out what these hidden words are by looking at the words around them. It’s like a game of guessing where some words are missing, and BERT tries to fill in the blanks. ▪ BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words. ▪ It does this by converting its guesses into probabilities. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 35. BERT - Next Sentence Prediction (NSP) ▪ BERT predicts if the second sentence is connected to the first. ▪ This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax. ▪ In the training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document. ▪ 50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 36. BERT - Next Sentence Prediction (NSP) ▪ To help the model distinguish between connected and disconnected sentence pairs. The input is processed before entering the model: ▪ A [CLS] token is inserted at the beginning of the first sentence, and a [SEP] token is added at the end of each sentence. ▪ A sentence embedding indicating Sentence A or Sentence B is added to each token. ▪ A positional embedding indicates the position of each token in the sequence. ▪ BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 37. BERT - Training ▪ The model is trained on both previously mentioned tasks simultaneously. This is made possible by clever usage of inputs and outputs. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 38. BERT - Training ▪ BERT relies on a Transformer (the attention mechanism that learns contextual relationships between words in a text). ▪ A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. ▪ Since BERT’s goal is to generate a language representation model, it only needs the encoder part. The input to the encoder for BERT is a sequence of tokens, which are first converted into vectors and then processed in the neural network. ▪ But before processing can start, BERT needs the input to be decorated with some extra metadata: ▪ Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence. ▪ Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences. ▪ Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 39. BERT - Training ▪ To predict if the second sentence is connected to the first one or not, basically the complete input sequence goes through the Transformer based model, the output of the [CLS] token is transformed into a 2×1 shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax. ▪ The model is trained with both Masked LM and Next Sentence Prediction together. This is to minimize the combined loss function of the two strategies — “together is better”. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 40. BERT - Architectures ▪ There are two types of pre-trained versions of BERT depending on the scale of the model architecture: ▪ BERT-BASE has 12 layers in the Encoder stack while BERT-LARGE has 24 layers in the Encoder stack. ▪ BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads. ▪ BERT-BASE contains 110M parameters while BERT-LARGE has 340M parameters. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 41. BERT - Architectures ▪ There are two types of pre-trained versions of BERT depending on the scale of the model architecture: Fun fact: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was trained on 16 TPUs for 4 days! Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 42. Graph based language models ▪ A graph is a mathematical structure consisting of: ▪ Nodes (Vertices): Represent entities (e.g., words, sentences, or documents). ▪ Edges: Represent relationships between entities (e.g., co-occurrence, syntactic dependency, semantic similarity). ▪ In language modeling, Nodes could be words, and edges represent how often two words occur together or their grammatical relationship. ▪ Graph-based models can be used for language modeling by representing linguistic data, such as words, phrases, or sentences, as nodes and their relationships as edges within a graph. ▪ This approach leverages the structural and relational properties of natural language, enabling more sophisticated and context-aware representations. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 43. Graph based language models ▪ Why Use Graphs in Language Modeling? ▪ Language is complex and often non-linear. Graphs can help to ▪ Capture Relationships: Words or phrases aren't just sequential—they have rich relationships (e.g., syntactic or semantic). ▪ Integrate Knowledge: Graphs like knowledge graphs (e.g., WordNet, ConceptNet) can provide background knowledge about word meanings and relationships. ▪ Handle Global Context: Graphs can encode both local (sentence-level) and global (corpus-level) context. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 44. Graph Representations in Language Modeling ▪ Word Co-occurrence Graphs: ▪ Nodes: Words. ▪ Edges: Represent co-occurrence within a window size in text. ▪ Application: Keyword extraction, sentence similarity. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Syntactic Dependency Graphs: ▪ Nodes: Words. ▪ Edges: Represent grammatical relationships (e.g., subject-object). ▪ Application: Parsing, sentiment analysis.
  • 45. Graph Representations in Language Modeling ▪ Semantic Graphs: ▪ Nodes: Concepts or words. ▪ Edges: Represent semantic relationships (e.g., synonyms, hypernyms). ▪ Application: Word-sense disambiguation, machine translation. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Knowledge Graphs: ▪ Nodes: Entities (e.g., "Einstein"). ▪ Edges: Relationships (e.g., "was a scientist"). ▪ Application: Question answering, fact-checking.
  • 46. Graph Neural Networks (GNNs): Modern Graph-based Language Models ▪ Graph Neural Networks (GNNs) are a class of neural networks designed to operate on graph-structured data, capturing the relationships and interactions between entities. ▪ At their core, GNNs use a message-passing framework: ▪ Node Representation (Features): Each node starts with an initial feature vector (e.g., a word embedding in NLP or an atom's properties in a molecule). ▪ Neighborhood Aggregation: Each node aggregates information from its neighbors, capturing the local graph structure. ▪ Updating Node Representations: Using the aggregated information, the node updates its representation to reflect its context in the graph. ▪ This process is repeated for multiple iterations (or layers), allowing nodes to gather information from progressively larger neighborhoods. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 47. Types of GNNs ▪ Graph Convolutional Networks (GCNs) ▪ Graph Attention Networks (GATs) ▪ GraphSAGE ▪ Message Passing Neural Networks (MPNNs) Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) Applications of GNNs ▪ Text Summarization: Represent sentences or documents as nodes, and use GNNs to identify important content. ▪ Knowledge Graph Completion: Predict missing edges or relationships in a knowledge graph. ▪ Semantic Parsing: Use dependency graphs of sentences for syntactic analysis. ▪ Social Networks: ▪ Node classification (e.g., user profiling). ▪ Link prediction (e.g., recommending friends).
  • 48. Log Linear Models ▪ A log-linear model is a type of statistical model used to predict probabilities. ▪ It combines features (important information from the data) and weights (importance of each feature) to make predictions. ▪ Features are pieces of information used to make predictions. In a language model, features might include the previous words or context. ▪ Example: For predicting the next word in "The cat sat on the ___", features could be "The cat" or "sat on". ▪ Each feature is given a weight that tells the model how important it is. Some features might be more important than others. ▪ Example: The feature "sat on" might be more useful for predicting the next word than "The cat". ▪ A score is calculated by multiplying each feature's value by it corresponding weight and summing the results. Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 49. Log Linear Models ▪ The scores are then turned into probabilities using a function called softmax. ▪ The softmax function converts scores into values between 0 and 1, where higher scores lead to higher probabilities. ▪ Example: The score for "mat" might be 4.3, while "hat" has a score of 3.9. After applying softmax, the model will predict "mat" as the next word because it has the higher probability. ▪ The formula for softmax is: Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune)
  • 50. Log Linear Model - Example ▪ Let us predict the next word in the sentence: "The cat sat on the ___." Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Step 1: Selecting Features ▪ Features could be: ▪ Word History: Previous words in the sentence ("The cat sat on the"). ▪ Syntactic Patterns: Certain words follow others (e.g., "on the" is commonly followed by a noun). ▪ Context Word Frequencies: How often certain words appear in similar contexts. ▪ Step 2: Assigning Weights ▪ Each feature is assigned a weight to indicate its importance. Suppose the weights are: Feature Weight "Previous word is 'the'" 1.5 "Context suggests a noun" 2.0 "Word has been seen often" 0.8
  • 51. Log Linear Model - Example Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Step 3: Compute Scores ▪ For each possible word (e.g., "mat," "hat," and "dog"), the model computes a score by combining feature values and weights. • For "mat": • Previous word is "the" → Feature value = 1, Weight = 1.5 → Contribution = 1×1.5=1.5 • Context suggests a noun → Feature value = 1, Weight = 2.0 → Contribution = 1×2.0=2.0 • Word seen often → Feature value = 1, Weight = 0.8 → Contribution = 1×0.8=0.8 • Total Score = 1.5 + 2.0 + 0.8 = 4.3 • For "hat": • Previous word is "the" → 1×1.5=1.5 • Context suggests a noun → 1×2.0=2.0 • Word seen less often → Feature value = 0.5, Weight = 0.8 → Contribution = 0.5×0.8=0.4 • Total Score = 1.5 + 2.0 + 0.4 = 3.9 • For "dog": • Previous word is "the" → 1×1.5=1.5 • Context suggests a noun → 1×2.0=2.0 • Word seen rarely → Feature value = 0.2, Weight = 0.8 → Contribution = 0.2×0.8=0.16 • Total Score = 1.5 + 2.0 + 0.16 = 3.66
  • 52. Log Linear Model - Example Guest Session on “Language Modelling” – Dr. Deptii Chaudhari (I2IT, Pune) ▪ Step 4: Convert Scores to Probabilities ▪ The scores are converted into probabilities using the softmax function, which ensures all probabilities add up to 1.The formula for softmax is: For the scores: "mat" = 4.3 "hat" = 3.9 "dog" = 3.66 ▪ Step 5: Predict ▪ The model predicts "mat" as the most probable next word because it has the highest probability (43%).