FDP_atal_on transformer_NLP_by_example.pptx

Transformers for Generative
AI and computer vision:
Theory and Practice

Outline
 The journey of Gen AI
 The foundation of Gen AI
 Sequential models and limitation
 Introduction of Transformer
 Transformer Architecture
 Vision Transformer
 Discussion

The journey of Gen AI
 Data Analysis: The Foundation
 Businesses used data analysis tools to track customer purchasing behavior and optimize their
marketing strategies.
 In healthcare, it powered descriptive analytics to monitor patient health trends and reduce risks.
 Data Mining: Digging Deeper
 In retail, Amazon’s "People Who Bought This Also Bought" recommendations were powered by
data mining to identify product associations.
 In fraud detection, banks analyzed customer transaction patterns to uncover fraudulent
activities in real time.
 Machine Learning: Enabling Predictive Intelligence
 In e-commerce, Netflix revolutionized content recommendations with its machine learning-
based personalization engine.
 In transportation, Uber used demand prediction algorithms to optimize ride availability and
pricing.
 In finance, machine learning models were deployed for credit scoring and portfolio risk analysis.

The journey continues
 Deep Learning: Unlocking Complex Representations
 In healthcare, deep learning models helped detect diseases like cancer through medical image
analysis.
 In automotive, Tesla leveraged deep learning to enhance self-driving car systems.
 In communication, Google Translate used deep learning to provide near-human translation for over
100 languages.
 Artificial Intelligence: Broadening Horizons
 In education, AI-powered tools like Duolingo revolutionized personalized learning experiences.
 In healthcare, IBM’s Watson analyzed millions of medical papers to assist doctors with accurate
diagnoses and treatments.
 In urban planning, smart cities used AI to optimize energy consumption, reduce traffic congestion, and
improve public safety.
 Generative AI: Creating Intelligence
 In media, tools like ChatGPT and DALL-E generate human-like text, images, and creative art for content
creators and marketers.
 In entertainment, AI-generated scripts, music, and video game environments are reshaping the
creative process.
 In medicine, generative AI is used to design new drugs by simulating and optimizing molecular
structures.

What is next after Gen AI?
Will Machine/AI replace
Human

The foundation of Gen AI
 Key Goals for Generating Anything
 Understanding the Characteristics of Data
 Identifying patterns, relationships, and features unique to the data points.
 Example: Understanding linguistic structures in text, pitch and tone in audio,
or object motion in videos.
 Sequential Nature of Real-World Data :
 Text: Words in a sentence depend on the sequence for meaning,
 Audio: Speech depends on phoneme order and timing,
 Videos: Frames rely on temporal order to convey events.
 Contextual understanding:
 Capturing the relationship between data points to maintain coherence.
 Example: Predicting the next word in a sentence or summarizing a video
segment based on surrounding context.

Sequence Learning Models
 RNN (Recurrent Neural Network / Recursive Neural Network)
 LSTM (Long Short Term Memory)
 GRU (Gated Recurrent Unit)
 Transformer

Recurrent Neural Network (RNN)
 It is able to ‘memorize’ parts of the inputs and use them to make accurate predictions.
 These networks are at the heart of speech recognition, translation and more.

Problem with Standard RNN
 The simplest RNN model has a major drawback, called vanishing
gradient problem, which prevents it from being accurate.
 This means that the network experiences difficulty in
memorizing words from far away in the sequence and makes
predictions based on only the most recent ones.
 Solutions:
Various versions of RNN: LSTM, GRU
Attention model
Application based solution

Attention Model
 An attention mechanism is a part of a neural network.
 At each decoder step, it decides which source parts are more important. In this
setting, the encoder does not have to compress the whole source into a single
vector - it gives representations for all source tokens (for example, all RNN states
instead of the last one

FDP_atal_on transformer_NLP_by_example.pptx

Transformer
Why Transformer?
Problem with sequence models with NLP task
Problem with CNN

Transformer: Attention based model
NLP
For NLP applications, attention is often described as the
relationship between words (tokens) in a sentence.
CV
In a computer vision application, attention looks at the
relationships between patches (tokens) in an image.

What is the Transformer?
Self-attention based architecture without using any
sequence based architecture1
Extract features for each word using a self-attention
mechanism to figure out how important all the other
words in the sentence are w.r.t. to the aforementioned
word.
Attention :
The attention that the token pays to other tokens
The attention that a set of token pay to the token
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Major difference between Seq-2-seq
and transformer
 Input value:
 Architecture
 Encoder and Decoder stacks

Transformer
Transformer
I am living at Ahmedabad
હું અમદાવાદ રહું છું
NER : Ahmedabad Intent: Information Text Segmentation

Input Embedding and Positional
Embedding
 Embedding: Convert input tokens into the numerical form
 Positional embedding: Maintain the order of the words in
given sentence
 Output Shape:
Samples x sequence length x embedding size

Self-Attention
 The self-attention mechanism allows the inputs to interact with each other (“self”) and
find out who they should pay more attention to (“attention”).
 Transformer's encoder can be thought of as a sequence of reasoning steps (layers).
 At each step, tokens look at each other (this is where we need attention - self-attention),
exchange information and try to understand each other better in the context of the
whole sentence. This happens in several layers.
 In each decoder layer, tokens of the prefix also interact with each other via a self-
attention mechanism, but additionally, they look at the encoder states (without this, no
translation can happen, right?).
 Self attention mechanism first computes three vector (query, key and value) for each
word in the sentence.

Actor of attention
 Two actors:
 Attender
 Attendee
 In Transformer, any token can attend to any other tokens including itself.

Query, Key, Value
 Query, Key and Value? (is it related to IR (Information Retrieval)?)
• Query: The query is a representation of the current word used to score against all
the other words (using their keys). We only care about the query of the token we’re
currently processing.
• Key: Key vectors are like labels for all the words in the segment. They’re what we
match against in our search for relevant words.
• Value: Value vectors are actual word representations, once we’ve scored how
relevant each word is, these are the values we add up to represent the current word.

Query vector
 The query vector represents the element of interest or the context that you
want to obtain information about.
 The query vector is used to determine the similarity or relevance between this
context and other elements in the input sequence, specifically the key vectors.
 Suppose you’re translating a sentence from English to French, and you’re at
a particular word in the English sentence (the query). The keys are
representations of all words in the English sentence, and the values are their
corresponding translations in French. Example :
“apple” (the word you want to translate)

Key vector
 The key vector, like the query vector, is a projection of the input data and is
associated with each element in the input sequence.
 The key vectors are used to compute how relevant each element in the input
sequence is to the query.
 This relevance is often calculated using a dot product or another similarity
measure between the query and key vectors.
 [“cat”, “apple”, “tree”, “juice”] (representations of words in English)

Value vector
 The value vector is also a projection of the input data and is associated with
each element in the input sequence, just like the key vector.
 The value vectors store the actual information that will be used to update the
representation of the query. These values are weighted by the attention
scores (computed from the query-key interaction) to determine how much
each element contributes to the final output.
 The attention scores, computed based on the query and key, are used to
weight the value vectors. Higher attention scores mean that the
corresponding values are more important for the output.
 Example :[“chat”, “pomme”, “arbre”, “jus”] (corresponding French translations)

In simple word:
• query - asking for information;
• key - saying that it has some information;
• value - giving the information.

FDP_atal_on transformer_NLP_by_example.pptx

More Related Content

Similar to FDP_atal_on transformer_NLP_by_example.pptx (20)

Recently uploaded (20)

FDP_atal_on transformer_NLP_by_example.pptx