2. Outline
The journey of Gen AI
The foundation of Gen AI
Sequential models and limitation
Introduction of Transformer
Transformer Architecture
Vision Transformer
Discussion
3. The journey of Gen AI
Data Analysis: The Foundation
Businesses used data analysis tools to track customer purchasing behavior and optimize their
marketing strategies.
In healthcare, it powered descriptive analytics to monitor patient health trends and reduce risks.
Data Mining: Digging Deeper
In retail, Amazon’s "People Who Bought This Also Bought" recommendations were powered by
data mining to identify product associations.
In fraud detection, banks analyzed customer transaction patterns to uncover fraudulent
activities in real time.
Machine Learning: Enabling Predictive Intelligence
In e-commerce, Netflix revolutionized content recommendations with its machine learning-
based personalization engine.
In transportation, Uber used demand prediction algorithms to optimize ride availability and
pricing.
In finance, machine learning models were deployed for credit scoring and portfolio risk analysis.
4. The journey continues
Deep Learning: Unlocking Complex Representations
In healthcare, deep learning models helped detect diseases like cancer through medical image
analysis.
In automotive, Tesla leveraged deep learning to enhance self-driving car systems.
In communication, Google Translate used deep learning to provide near-human translation for over
100 languages.
Artificial Intelligence: Broadening Horizons
In education, AI-powered tools like Duolingo revolutionized personalized learning experiences.
In healthcare, IBM’s Watson analyzed millions of medical papers to assist doctors with accurate
diagnoses and treatments.
In urban planning, smart cities used AI to optimize energy consumption, reduce traffic congestion, and
improve public safety.
Generative AI: Creating Intelligence
In media, tools like ChatGPT and DALL-E generate human-like text, images, and creative art for content
creators and marketers.
In entertainment, AI-generated scripts, music, and video game environments are reshaping the
creative process.
In medicine, generative AI is used to design new drugs by simulating and optimizing molecular
structures.
5. What is next after Gen AI?
Will Machine/AI replace
Human
6. The foundation of Gen AI
Key Goals for Generating Anything
Understanding the Characteristics of Data
Identifying patterns, relationships, and features unique to the data points.
Example: Understanding linguistic structures in text, pitch and tone in audio,
or object motion in videos.
Sequential Nature of Real-World Data :
Text: Words in a sentence depend on the sequence for meaning,
Audio: Speech depends on phoneme order and timing,
Videos: Frames rely on temporal order to convey events.
Contextual understanding:
Capturing the relationship between data points to maintain coherence.
Example: Predicting the next word in a sentence or summarizing a video
segment based on surrounding context.
8. Recurrent Neural Network (RNN)
It is able to ‘memorize’ parts of the inputs and use them to make accurate predictions.
These networks are at the heart of speech recognition, translation and more.
10. Problem with Standard RNN
The simplest RNN model has a major drawback, called vanishing
gradient problem, which prevents it from being accurate.
This means that the network experiences difficulty in
memorizing words from far away in the sequence and makes
predictions based on only the most recent ones.
Solutions:
Various versions of RNN: LSTM, GRU
Attention model
Application based solution
12. Attention Model
An attention mechanism is a part of a neural network.
At each decoder step, it decides which source parts are more important. In this
setting, the encoder does not have to compress the whole source into a single
vector - it gives representations for all source tokens (for example, all RNN states
instead of the last one
15. Transformer: Attention based model
NLP
For NLP applications, attention is often described as the
relationship between words (tokens) in a sentence.
CV
In a computer vision application, attention looks at the
relationships between patches (tokens) in an image.
16. What is the Transformer?
Self-attention based architecture without using any
sequence based architecture1
Extract features for each word using a self-attention
mechanism to figure out how important all the other
words in the sentence are w.r.t. to the aforementioned
word.
Attention :
The attention that the token pays to other tokens
The attention that a set of token pay to the token
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
17. Major difference between Seq-2-seq
and transformer
Input value:
Architecture
Encoder and Decoder stacks
20. Input Embedding and Positional
Embedding
Embedding: Convert input tokens into the numerical form
Positional embedding: Maintain the order of the words in
given sentence
Output Shape:
Samples x sequence length x embedding size
22. Self-Attention
The self-attention mechanism allows the inputs to interact with each other (“self”) and
find out who they should pay more attention to (“attention”).
Transformer's encoder can be thought of as a sequence of reasoning steps (layers).
At each step, tokens look at each other (this is where we need attention - self-attention),
exchange information and try to understand each other better in the context of the
whole sentence. This happens in several layers.
In each decoder layer, tokens of the prefix also interact with each other via a self-
attention mechanism, but additionally, they look at the encoder states (without this, no
translation can happen, right?).
Self attention mechanism first computes three vector (query, key and value) for each
word in the sentence.
23. Actor of attention
Two actors:
Attender
Attendee
In Transformer, any token can attend to any other tokens including itself.
24. Query, Key, Value
Query, Key and Value? (is it related to IR (Information Retrieval)?)
• Query: The query is a representation of the current word used to score against all
the other words (using their keys). We only care about the query of the token we’re
currently processing.
• Key: Key vectors are like labels for all the words in the segment. They’re what we
match against in our search for relevant words.
• Value: Value vectors are actual word representations, once we’ve scored how
relevant each word is, these are the values we add up to represent the current word.
25. Query vector
The query vector represents the element of interest or the context that you
want to obtain information about.
The query vector is used to determine the similarity or relevance between this
context and other elements in the input sequence, specifically the key vectors.
Suppose you’re translating a sentence from English to French, and you’re at
a particular word in the English sentence (the query). The keys are
representations of all words in the English sentence, and the values are their
corresponding translations in French. Example :
“apple” (the word you want to translate)
26. Key vector
The key vector, like the query vector, is a projection of the input data and is
associated with each element in the input sequence.
The key vectors are used to compute how relevant each element in the input
sequence is to the query.
This relevance is often calculated using a dot product or another similarity
measure between the query and key vectors.
[“cat”, “apple”, “tree”, “juice”] (representations of words in English)
27. Value vector
The value vector is also a projection of the input data and is associated with
each element in the input sequence, just like the key vector.
The value vectors store the actual information that will be used to update the
representation of the query. These values are weighted by the attention
scores (computed from the query-key interaction) to determine how much
each element contributes to the final output.
The attention scores, computed based on the query and key, are used to
weight the value vectors. Higher attention scores mean that the
corresponding values are more important for the output.
Example :[“chat”, “pomme”, “arbre”, “jus”] (corresponding French translations)
28. In simple word:
• query - asking for information;
• key - saying that it has some information;
• value - giving the information.