SlideShare a Scribd company logo
Natural language processing
with Deep Learning
BigData Course 2024
Dr. Artuur Leeuwenberg and Dr. Ruurd Kuiper
Today
9:00-12:00 Lecture
▪ The research field and some applications
▪ Language properties and theory
▪ Language modeling with neural networks
▪ Evaluation
2
13:00-16:00 Fun practical!
▪ Tokenization
▪ Word embeddings
▪ Transformers
The research field
3
Natural language processing
4
Linguistics Computer science
Classic* NLP Tasks
• Parsing (SIGPARSE, since 1989)
• Information extraction (Many, since 1997)
• Machine translation (MT summit/EAMT/AMTA, since 1989)
• Speech synthesis & speech recognition (INTERSPEECH, since 1989)
• Text summarization (DUC, since 2000)
• Question answering (TREC-QA, 1999)
• Chatbots (SIGdial, since 1998)
• Language generation (INLG, since 1990)
Flagship conferences
• ACL + NAACL, EACL since 1979
• EMNLP since 1996
Language resources and annotation:
• LREC since 1998
Evaluation:
• SemEval since 1998
5
NLP in Medicine
In clinical care
- Summarization
o Of patient records
- Question answering
o By patients about their care
o By physicians about an EHR record
- Automatic note generation
o Discharge summaries
o Consultation preparation
- Prediction in electronic medical records
In medical research
- Efficient search in systematic reviews
o Semi-automatization of screening
o Automatic data extraction
(eg., Drug-drug interactions, PICO elements, risk of bias)
- Extracting information from clinical notes
o (Rare) case identification
o Predictors / risk factors / exposures
o Study outcomes
- Scientific writing
o Editing
6
What is language?
And what are some of its properties?
One step back
Because ...
This is what language is to you! 這就是電腦語言!
8
Oh !@#$
Our absolute tool for efficient information transfer
Language
Language
Spoken
(sound)
10
Written
(symbols)
"Invented" 50,000–150,000 years ago ± 5,000 years ago
has been there for a while
Sequential information transfer
11
u(A)
I'
I
Meaning ≠ use ≠ form
12
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
Colorless green ideas sleep furiously
(grammatical, but no meaning)
A: will you join the party? B: I have to work.
(implicature: no, I will not join the party)
H. P. Grice
N. Chomsky
Meaning ≠ use ≠ form
13
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
Colorless green ideas sleep furiously
(grammatical, but no meaning)
A: will you join the party? B: I have to work.
(implicature: no, I will not join the party)
H. P. Grice
N. Chomsky
Meaning ≠ use ≠ form
14
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
! This is what we generally have (raw text data)
Shannon's noisy channel (A Mathematical Theoryof Communication, 1948)
15
u(A)
I'
I
Shannon-information:
1. An event with probability 100% is perfectly unsurprising and yields no information.
2. The less probable an event is, the more surprising it is and the more information it yields.
Efficient transfer of information with limited/no errors:
1. Efficient encoding: using less symbols/code for frequent events (e.g., Huffman encoding)
2. Some redundancy in your message (ie., predictability from context)
Shannon's noisy channel (A Mathematical Theoryof Communication, 1948)
16
u(A)
I'
I
Shannon-information:
1. An event with probability 100% is perfectly unsurprising and yields no information.
2. The less probable an event is, the more surprising it is and the more information it yields.
Efficient transfer of information with limited/no errors:
1. Efficient encoding: using less symbols for frequent events (e.g., Huffman encoding)
2. Some redundancy in your message (ie., predictability from context)
Interesting quote to test your understanding
"If the redundancy is zero any sequence of letters is a reasonable text in the language
and any two-dimensional array of letters forms a crossword puzzle."
Uniform information density (Fenk &Fenk, 1980)
Uniform information den...y means that each part of . sentence
carries more or …. the same amuont of infromatoin.
So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl!
17
Why can we read this?
Uniform information density (Fenk &Fenk, 1980)
Uniform information den...y means that each part of . sentence
carries more or …. the same amuont of infromatoin.
So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl!
18
Why can we read this?
Across the sequence, missing parts are approximately equally predictable from the context.–
Uniform information density (Fenk &Fenk, 1980)
統一資..訊否認意味著. 句子
攜帶更多或…。 同樣數量的信息。 號
因此,儘管有一些麻煩,......... 訊息傳輸仍然成功!
19
Why can we read this?
Across the sequence, missing parts are approximately equally predictable from the context.–
By us humans! Because we live and breath language!
Important properties of language
Compositionality: combination of parts
(e.g., word structure, sentence structure, language context)
Linguistic variation: the same meaning, expressed differently
(e.g., synonyms, abbreviations, regional/individual variation, rare words)
Ambiguity: the same expression, different meaning
(e.g., lexical/word level, sentence level, ...)
Incomplete: world knowledge needed for interpretation/production
(e.g., laws of physics, social norms, world facts)
20
Compositionality
On many levels
• Word-level, phrase-level, sentence-level
• Finite symbols to communicate infinite meanings
21
22
"A large bacterium is cycling in the desert."
Compositionality
Efficient encoding
23
"A large bacterium is cycling in the desert."
Compositionality
Efficient encoding
Chomsky Hierarchy
• How do natural languages compose?
• What consequences does this have for computation?
• Memory complexity
• Time complexity
(w.r.t. parsing: determining if a sequence is
grammatically well formed, i.e. is it part of the
language)
24
(1956)
Chomsky Hierarchy
• How do natural languages compose?
• What consequences does this have for computation?
• Memory complexity
• Time complexity
(w.r.t. parsing: determining if a sequence is
grammatically well formed, i.e. is it part of the
language)
25
LSTMs and Transformers
(1956)
Compositionality
! But not all language is compositional (idioms, proper names)
Under the weather
The elephant in the room
Golden Gate Bridge
26
Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
27
Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
28
Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
29
Often resolvable in context.
Language models
30
A language model
Probability("let", "me", "send", "you", "a", "mail")
❖ How likely is it that we observe this utterance?
31
Shannon, C.E., 1951. Prediction and entropy of printed English. Bell
system technical journal, 30(1), pp.50-64.
A language model
PLM("let", "me", "send", "you", "a", "mail") =
PLM("mail" | "let", "me", "send", "you", "a")
* PLM("a" | "let", "me", "send", "you")
* PLM("you" | "let", "me", "send")
* PLM("send" | "let", "me")
* PLM("me" | "let")
* PLM("<S>", "let")
32
Shannon, C.E., 1951. Prediction and entropy of printed English. Bell
system technical journal, 30(1), pp.50-64.
! Probability distribution
• The vocabulary V should be defined.
• Probabilities should sum to 1.
33
ID Token
1 the
2 a
3 mail
4 few
5 several
6 is
... ...
N send
? P
the 0.001
a 0.001
mail 0.451
few 0.052
severa
l
0.001
is 0.001
...
send 0.001
PLM( ? | "let", "me", "send", "you", "a")
What is one of the first uses of language models?
34
First use: automatic speech recognition
35
PLM(let me send you a mail)
?
PLM(Lett mi scent you a mail)
Katz, S., 1987. Estimation of probabilities
from sparse data for the language model
component of a speech recognizer. IEEE
transactions on acoustics, speech, and
signal processing, 35(3), pp.400-401.
Chen, S.F. and Goodman, J., 1999. An
empirical study of smoothing techniques for
language modeling. Computer Speech &
Language, 13(4), pp.359-394.
Ney, H., Essen, U. and Kneser, R., 1994. On
structuring probabilistic dependences in
stochastic language modelling. Computer
Speech & Language, 8(1), pp.1-38.
Nadas, A., 1984. Estimation of probabilities
in the language model of the IBM speech
recognition system. IEEE Transactions on
Acoustics, Speech, and Signal Processing,
32(4), pp.859-861.
Generating text, using a language model
Which words give high probability?
37
PLM( ? | "let me send you a")
? P
the 0.001
a 0.001
mail 0.451
few 0.052
severa
l
0.001
is 0.001
...
send 0.001
Generation as search
38
PLM("nice woman" | "The") =
0.5 * 0.4
PLM("dog has" | "The") =
0.4 * 0.9
(beams=2)
(beams=1)
https://huggingface.co/blog/how-to-generate
Generation via sampling
Top-k sampling
Sampling from the top-k highest probabilities
Top-p sampling (aka. 'Nucleus sampling')
Sampling from the most likely words, that have
cumulative probability of at most p
https://huggingface.co/blog/how-to-generate
* sorted from high to low
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
K=3
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
p=0.8
Generation via sampling
Top-k sampling
Sampling from the top-k highest probabilities
Top-p sampling (aka. 'Nucleus sampling')
Sampling from the most likely words, that have
cumulative probability of at most p
https://huggingface.co/blog/how-to-generate
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
* sorted from high to low
? P*
mail 0.151
letter 0.142
card 0.024
any 0.012
severa
l
0.010
is 0.003
...
send 0.003
p=0.8
K=3
Temperature parameter
↑
N-gram language model
41
Pbigram("let", "me", "send", "you", "a", "mail") =
Pbigram("mail" | "a") "let" , "me", "send", "you",
* Pbigram("a" | "you") "let", "me", "send",
* Pbigram("you" | "send") "let", "me",
* Pbigram("send" | "me") "let",
* Pbigram("me" | "let")
* Pbigram("<S>", "let")
P( w2 | w1) = N(w1, w2) / N(w1)
By counting!
Markov assumption: next word
only depends on limited history,
for bigrams history of 1 word.
Bigram counts are zero? Smoothing
42
There are many
• Kneser-Ney LMs
• Laplace LMs
• Skip-gram LMs
• Cache-based LMs
• Topic LMs
• Sentence-Mixture LMs
• Cluster-based LMs
…
Overview:
• Joshua, T., and J. Goodman. "A bit of
progress in language modeling
extended version." Machine Learning
and Applied Statistics Group Microsoft
Research. Technical Report, MSR-TR-
2001-72 28 (2001).
Back off model
e.g., interpolate with an n-1 gram model.
Pbigram-backed-off("mail" | "a") = (1-λ) Pbigram("mail" | "a")
* λ Punigram("mail")
Laplace smoothing
Pbigram-laplace("mail" | "a") = N("mail", "a") + 1
N("a") + 1
Evaluation
Held out test set:
• Word error rate (proportion of deletions, insertions, substitutions)
• Perplexity
Caution (when comparing results)!
• Was the same test set used?
• Do the models use the same target vocabulary?
43
44
A small detour, what is deep learning?
45
Deep Learning
One artificial neuron
46
Prediction
Input vector
Parameters
One artificial neuron
47
Activation functions
48
Logistic regression, as one artificial neuron
49
A feed forward neural network
50
Error (aka. Loss) function
51
Prediction Reference
Error (aka. Loss) function
Prediction Reference
wi
t+1 = wi
t - μ
i
(with learning rate μ)
Finding the lowest point (ie. modelparameters that result in the lowest error)
… but there is thick fog …
53
Gradient descent​
i.e., walking down hill in the steepest
direction
No global optimum guaranteed
In contrast to fitting generalized linear models
• Current training procedures (generally based on gradient descent) for deep neural
nets do not guarantee identifying the overall best parameters for the
training data.
• So, initialization of the parameters matters!
• And, repeatedly training a network may result in different models!
54
?
Deep learning
56
Deep learning (having more than 1 hidden layer)
57
Now, back to neural language models
58
Language as input: one-hot encoding
59
"let", "me", "send", "you", "a"
ID Token
1 the
2 a
3 me
4 few
5 you
6 let
... ...
32194 send
0
0
0
0
0
1
...
0
0
0
1
0
0
0
...
0
0
0
0
0
0
0
...
1
0
0
0
0
1
0
...
0
0
1
0
0
0
0
...
0
Limitation:
• Dimensionality is as high
as the vocabulary size
(which can be around
10-100k tokens).
• Sparse vectors
Word embedding
60
0.88
0.21
0.87
0.23
0.01
0.02
...
0.01
"let", "me", "send", "you", "a"
0.00
0.99
0.89
0.12
0.01
0.02
...
0.01
0.78
0.22
0.09
0.11
0.67
0.01
...
0.02
0.02
0.80
0.90
0.03
0.04
0.01
...
0.09
0.01
0.02
0.70
0.11
0.01
0.02
...
0.04
Dimensions represent word properties
(not unique words):
Pro's:
• Lower dimensionality (parameter sharing
• Dense continuous vectors
Con's:
• Interpretation (often the vectors are
learned / latent).
Acts like a verb?
Word embedding
61
0.88
0.21
0.87
0.23
0.01
0.02
...
0.01
"let", "me", "send", "you", "a"
0.00
0.99
0.02
0.12
0.01
0.02
...
0.01
0.78
0.03
0.09
0.11
0.67
0.01
...
0.02
0.02
0.80
0.90
0.03
0.04
0.01
...
0.09
0.01
0.02
0.70
0.11
0.01
0.02
...
0.04
Dimensions represent word properties
(not unique words):
Pro's:
• Lower dimensionality (parameter sharing)
• Dense continuous vectors
Con's:
• Interpretation (often the vectors are
learned / latent).
Could start a sentence?
Skip-gram based word embeddings
• Start with a model with one hidden
layer (activations from this layer are the word
embeddings)
• Predict for each word, the closest n
words in its surrounding context
window (here, 2)
Assumption: words with similar contexts,
have similar 'meaning' (ie. informative
properties).
62
​ ​ Mikolov, Tomas, et al. "Efficient
estimation of word representations in
vector space." arXiv preprint
arXiv:1301.3781 (2013).​ ​ APA​ ​
Skip-gram based word embeddings
• Start with a model with one hidden
layer (activations from this layer are the word
embeddings)
• Predict for each word, the closest n
words in its surrounding context
window (here, 2)
Assumption: words with similar contexts,
have similar 'meaning' (informative
properties).
63
​ ​ Mikolov, Tomas, et al. "Efficient
estimation of word representations in
vector space." arXiv preprint
arXiv:1301.3781 (2013).​ ​ APA​ ​
Image:
https://techs0uls.wordpress.com/2020/03/16/
word-similarity-and-analogy-with-skip-gram/
'Early' neural language models
64
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Bengio, Yoshua, et al. "A neural probabilistic
language model." Advances in neural information
processing systems. Vol. 13. 2000.
'Early' neural language models
65
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Rolled out
'Early' neural language models
66
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Bengio, Yoshua, et al. "A neural probabilistic
language model." Advances in neural information
processing systems. Vol. 13. 2000.
* Both still got the best results when combined with n-gram backoff models!
LSTMs and GRUs
Motivation:
• vanishing gradients in RNNs
67
Image: https://aiml.com/compare-the-
different-sequence-models-rnn-lstm-gru-
and-transformers/
LSTMs and GRU use "gates" to learn what information to maintain along the sequence.
Hochreiter, Sepp, and Jürgen Schmidhuber.
"Long short-term memory." Neural
computation 9.8 (1997): 1735-1780.
Chung, Junyoung, et al. "Empirical evaluation
of gated recurrent neural networks on
sequence modeling." NeurIPS (2014).
Attention is all you need
Motivation:
• Better incorporation of history
• Computation
Components
• Attention
• Position encoding
69
Vaswani, Ashish et al.
"Attention is all you
need." NeurIPS (2017).
Attention
Main idea: The representation of each token can be transformed, based on the
context (without prioritizing closer words, like LSTMs / GRUs / RNNs).
70
Image:
https://medium.com/@vmirly/tutorial-on-
scaled-dot-product-attention-with-pytorch-
implementation-from-scratch-66ed898bf817
Implementation:
• Each word carries three vectors:
o A Query (the initial to be
updated word vector itself)
o A Key (representing what
words should 'match')
o A Value (how words should
transform other words).
Encoders vs. Decoders
Encoders:
- Use left and right context
Decoders
- Use only left context
(allow for generation)
71
Vaswani, Ashish et al.
"Attention is all you
need." NeurIPS (2017).
Encoder
Decoder
GPT-1
Training:
1. Pretraining (ie. the language modeling part)
2. Task specific training
72
"Our language model achieves a very low token level
perplexity of 18.4"
GPT-1
73
Large language models, there are many ...
Proprietary
• Easy to plug and play in the browser
• Seemingly very good performance
• Internet connection required
• Not transparent
• Not in own control (pro and con)
74
Open source
• Requires a bit of coding to setup
• Seemingly mixed performance
• No internet required
• Transparent
• In own control (pro and con)
2B
7B
7B
3-300B
7B
7B
7B
1.8TB
135B
1.5B
40B
13B
130B
7B
1.6TB
Opportunities and risks
75
August 7, 2024
Nov 13, 2023
Hallucinations
• A hallucination ... is a response generated by AI which contains false or misleading
information presented as fact (wikipedia; 13-08-2024).
• In automatic summarization: "A summary S of a document D contains a factual
hallucination if it contains information not found in D that is factually correct." [Maynez,
Joshua, et al. "On Faithfulness and Factuality in Abstractive Summarization." Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. 2020.]
76
Generally defined on a meaning level. Not on form.
Using LMs for generation: what can we expect?
Remember the first ASR application
- LM used to rerank ASR candidates
1) not to fully generate on its own
2) About language form
77
Reasonable assumption to consider
• Likely responses (based on data) are meaningful responses.
78
Usage policies
79
Regulations
80
AI Act Medical device regulation
A final note on application in healthcare research
81
Evaluate first
1. In your own data
2. Using the metrics that matter to you
A model may be great at one
application or textual domain but fail
completely for another.
And vice versa!
You made it to the end! Questions?
82

More Related Content

PDF
Deep learning for natural language embeddings
PDF
Natural Language Processing
PDF
Introduction to Natural language Processing
PDF
Machine learning-and-data-mining-19-mining-text-and-web-data
PDF
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
PDF
OUTDATED Text Mining 2/5: Language Modeling
PPT
PPT slides
PPTX
Do we need linguistic knowledge for speech technology applications in African...
Deep learning for natural language embeddings
Natural Language Processing
Introduction to Natural language Processing
Machine learning-and-data-mining-19-mining-text-and-web-data
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
OUTDATED Text Mining 2/5: Language Modeling
PPT slides
Do we need linguistic knowledge for speech technology applications in African...

Similar to An Introduction to Natural Language Processing with Deep Learning (20)

PPT
Current Dev
PPT
Current Dev. In Phonetics
PPT
intro.ppt
PPT
Machine Translation ppt for engineering students
PPTX
Knowledge Extraction
PDF
Adnan: Introduction to Natural Language Processing
PPTX
Nlp Sentemental analysis of Tweetr And CaseStudy
PPT
Advances In Wsd Aaai 2005
PPT
Advances In Wsd Aaai 2005
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PPT
Advances In Wsd Acl 2005
PDF
Unpacking ERP Responses in Artificial Language Learning
PPTX
Introduction to development of lexical databases
PDF
Computational linguistics
PPT
Natural Language Processing
PPTX
LSDI.pptx
PDF
Contemporary Models of Natural Language Processing
PPTX
introduction to natural language processing lecture.pptx
PPT
NLP introduced and in 47 slides Lecture 1.ppt
Current Dev
Current Dev. In Phonetics
intro.ppt
Machine Translation ppt for engineering students
Knowledge Extraction
Adnan: Introduction to Natural Language Processing
Nlp Sentemental analysis of Tweetr And CaseStudy
Advances In Wsd Aaai 2005
Advances In Wsd Aaai 2005
Deep Learning for Natural Language Processing: Word Embeddings
Advances In Wsd Acl 2005
Unpacking ERP Responses in Artificial Language Learning
Introduction to development of lexical databases
Computational linguistics
Natural Language Processing
LSDI.pptx
Contemporary Models of Natural Language Processing
introduction to natural language processing lecture.pptx
NLP introduced and in 47 slides Lecture 1.ppt
Ad

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Classroom Observation Tools for Teachers
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Complications of Minimal Access Surgery at WLH
PDF
01-Introduction-to-Information-Management.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Basic Mud Logging Guide for educational purpose
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
master seminar digital applications in india
VCE English Exam - Section C Student Revision Booklet
Classroom Observation Tools for Teachers
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
PPH.pptx obstetrics and gynecology in nursing
Abdominal Access Techniques with Prof. Dr. R K Mishra
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Final Presentation General Medicine 03-08-2024.pptx
Pharma ospi slides which help in ospi learning
Supply Chain Operations Speaking Notes -ICLT Program
Complications of Minimal Access Surgery at WLH
01-Introduction-to-Information-Management.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Cell Types and Its function , kingdom of life
Basic Mud Logging Guide for educational purpose
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
master seminar digital applications in india
Ad

An Introduction to Natural Language Processing with Deep Learning

  • 1. Natural language processing with Deep Learning BigData Course 2024 Dr. Artuur Leeuwenberg and Dr. Ruurd Kuiper
  • 2. Today 9:00-12:00 Lecture ▪ The research field and some applications ▪ Language properties and theory ▪ Language modeling with neural networks ▪ Evaluation 2 13:00-16:00 Fun practical! ▪ Tokenization ▪ Word embeddings ▪ Transformers
  • 5. Classic* NLP Tasks • Parsing (SIGPARSE, since 1989) • Information extraction (Many, since 1997) • Machine translation (MT summit/EAMT/AMTA, since 1989) • Speech synthesis & speech recognition (INTERSPEECH, since 1989) • Text summarization (DUC, since 2000) • Question answering (TREC-QA, 1999) • Chatbots (SIGdial, since 1998) • Language generation (INLG, since 1990) Flagship conferences • ACL + NAACL, EACL since 1979 • EMNLP since 1996 Language resources and annotation: • LREC since 1998 Evaluation: • SemEval since 1998 5
  • 6. NLP in Medicine In clinical care - Summarization o Of patient records - Question answering o By patients about their care o By physicians about an EHR record - Automatic note generation o Discharge summaries o Consultation preparation - Prediction in electronic medical records In medical research - Efficient search in systematic reviews o Semi-automatization of screening o Automatic data extraction (eg., Drug-drug interactions, PICO elements, risk of bias) - Extracting information from clinical notes o (Rare) case identification o Predictors / risk factors / exposures o Study outcomes - Scientific writing o Editing 6
  • 7. What is language? And what are some of its properties? One step back
  • 8. Because ... This is what language is to you! 這就是電腦語言! 8 Oh !@#$
  • 9. Our absolute tool for efficient information transfer Language
  • 10. Language Spoken (sound) 10 Written (symbols) "Invented" 50,000–150,000 years ago ± 5,000 years ago has been there for a while
  • 12. Meaning ≠ use ≠ form 12 u(A) I' I Syntax: Words, Sentences Phonology: Speech Semantics: Objects Properties Relations Events Timings Locations Pragmatics: Relevance, Implicature Colorless green ideas sleep furiously (grammatical, but no meaning) A: will you join the party? B: I have to work. (implicature: no, I will not join the party) H. P. Grice N. Chomsky
  • 13. Meaning ≠ use ≠ form 13 u(A) I' I Syntax: Words, Sentences Phonology: Speech Semantics: Objects Properties Relations Events Timings Locations Pragmatics: Relevance, Implicature Colorless green ideas sleep furiously (grammatical, but no meaning) A: will you join the party? B: I have to work. (implicature: no, I will not join the party) H. P. Grice N. Chomsky
  • 14. Meaning ≠ use ≠ form 14 u(A) I' I Syntax: Words, Sentences Phonology: Speech Semantics: Objects Properties Relations Events Timings Locations Pragmatics: Relevance, Implicature ! This is what we generally have (raw text data)
  • 15. Shannon's noisy channel (A Mathematical Theoryof Communication, 1948) 15 u(A) I' I Shannon-information: 1. An event with probability 100% is perfectly unsurprising and yields no information. 2. The less probable an event is, the more surprising it is and the more information it yields. Efficient transfer of information with limited/no errors: 1. Efficient encoding: using less symbols/code for frequent events (e.g., Huffman encoding) 2. Some redundancy in your message (ie., predictability from context)
  • 16. Shannon's noisy channel (A Mathematical Theoryof Communication, 1948) 16 u(A) I' I Shannon-information: 1. An event with probability 100% is perfectly unsurprising and yields no information. 2. The less probable an event is, the more surprising it is and the more information it yields. Efficient transfer of information with limited/no errors: 1. Efficient encoding: using less symbols for frequent events (e.g., Huffman encoding) 2. Some redundancy in your message (ie., predictability from context) Interesting quote to test your understanding "If the redundancy is zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters forms a crossword puzzle."
  • 17. Uniform information density (Fenk &Fenk, 1980) Uniform information den...y means that each part of . sentence carries more or …. the same amuont of infromatoin. So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl! 17 Why can we read this?
  • 18. Uniform information density (Fenk &Fenk, 1980) Uniform information den...y means that each part of . sentence carries more or …. the same amuont of infromatoin. So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl! 18 Why can we read this? Across the sequence, missing parts are approximately equally predictable from the context.–
  • 19. Uniform information density (Fenk &Fenk, 1980) 統一資..訊否認意味著. 句子 攜帶更多或…。 同樣數量的信息。 號 因此,儘管有一些麻煩,......... 訊息傳輸仍然成功! 19 Why can we read this? Across the sequence, missing parts are approximately equally predictable from the context.– By us humans! Because we live and breath language!
  • 20. Important properties of language Compositionality: combination of parts (e.g., word structure, sentence structure, language context) Linguistic variation: the same meaning, expressed differently (e.g., synonyms, abbreviations, regional/individual variation, rare words) Ambiguity: the same expression, different meaning (e.g., lexical/word level, sentence level, ...) Incomplete: world knowledge needed for interpretation/production (e.g., laws of physics, social norms, world facts) 20
  • 21. Compositionality On many levels • Word-level, phrase-level, sentence-level • Finite symbols to communicate infinite meanings 21
  • 22. 22 "A large bacterium is cycling in the desert." Compositionality Efficient encoding
  • 23. 23 "A large bacterium is cycling in the desert." Compositionality Efficient encoding
  • 24. Chomsky Hierarchy • How do natural languages compose? • What consequences does this have for computation? • Memory complexity • Time complexity (w.r.t. parsing: determining if a sequence is grammatically well formed, i.e. is it part of the language) 24 (1956)
  • 25. Chomsky Hierarchy • How do natural languages compose? • What consequences does this have for computation? • Memory complexity • Time complexity (w.r.t. parsing: determining if a sequence is grammatically well formed, i.e. is it part of the language) 25 LSTMs and Transformers (1956)
  • 26. Compositionality ! But not all language is compositional (idioms, proper names) Under the weather The elephant in the room Golden Gate Bridge 26
  • 27. Ambiguity and linguistic variation Many to many: - One form with multiple meanings - One meaning expressible in multiple forms 27
  • 28. Ambiguity and linguistic variation Many to many: - One form with multiple meanings - One meaning expressible in multiple forms 28
  • 29. Ambiguity and linguistic variation Many to many: - One form with multiple meanings - One meaning expressible in multiple forms 29 Often resolvable in context.
  • 31. A language model Probability("let", "me", "send", "you", "a", "mail") ❖ How likely is it that we observe this utterance? 31 Shannon, C.E., 1951. Prediction and entropy of printed English. Bell system technical journal, 30(1), pp.50-64.
  • 32. A language model PLM("let", "me", "send", "you", "a", "mail") = PLM("mail" | "let", "me", "send", "you", "a") * PLM("a" | "let", "me", "send", "you") * PLM("you" | "let", "me", "send") * PLM("send" | "let", "me") * PLM("me" | "let") * PLM("<S>", "let") 32 Shannon, C.E., 1951. Prediction and entropy of printed English. Bell system technical journal, 30(1), pp.50-64.
  • 33. ! Probability distribution • The vocabulary V should be defined. • Probabilities should sum to 1. 33 ID Token 1 the 2 a 3 mail 4 few 5 several 6 is ... ... N send ? P the 0.001 a 0.001 mail 0.451 few 0.052 severa l 0.001 is 0.001 ... send 0.001 PLM( ? | "let", "me", "send", "you", "a")
  • 34. What is one of the first uses of language models? 34
  • 35. First use: automatic speech recognition 35 PLM(let me send you a mail) ? PLM(Lett mi scent you a mail) Katz, S., 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech, and signal processing, 35(3), pp.400-401. Chen, S.F. and Goodman, J., 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), pp.359-394. Ney, H., Essen, U. and Kneser, R., 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1), pp.1-38. Nadas, A., 1984. Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(4), pp.859-861.
  • 36. Generating text, using a language model Which words give high probability? 37 PLM( ? | "let me send you a") ? P the 0.001 a 0.001 mail 0.451 few 0.052 severa l 0.001 is 0.001 ... send 0.001
  • 37. Generation as search 38 PLM("nice woman" | "The") = 0.5 * 0.4 PLM("dog has" | "The") = 0.4 * 0.9 (beams=2) (beams=1) https://huggingface.co/blog/how-to-generate
  • 38. Generation via sampling Top-k sampling Sampling from the top-k highest probabilities Top-p sampling (aka. 'Nucleus sampling') Sampling from the most likely words, that have cumulative probability of at most p https://huggingface.co/blog/how-to-generate * sorted from high to low ? P* mail 0.441 letter 0.348 card 0.031 any 0.005 severa l 0.001 is 0.001 ... send 0.000 K=3 ? P* mail 0.441 letter 0.348 card 0.031 any 0.005 severa l 0.001 is 0.001 ... send 0.000 p=0.8
  • 39. Generation via sampling Top-k sampling Sampling from the top-k highest probabilities Top-p sampling (aka. 'Nucleus sampling') Sampling from the most likely words, that have cumulative probability of at most p https://huggingface.co/blog/how-to-generate ? P* mail 0.441 letter 0.348 card 0.031 any 0.005 severa l 0.001 is 0.001 ... send 0.000 * sorted from high to low ? P* mail 0.151 letter 0.142 card 0.024 any 0.012 severa l 0.010 is 0.003 ... send 0.003 p=0.8 K=3 Temperature parameter ↑
  • 40. N-gram language model 41 Pbigram("let", "me", "send", "you", "a", "mail") = Pbigram("mail" | "a") "let" , "me", "send", "you", * Pbigram("a" | "you") "let", "me", "send", * Pbigram("you" | "send") "let", "me", * Pbigram("send" | "me") "let", * Pbigram("me" | "let") * Pbigram("<S>", "let") P( w2 | w1) = N(w1, w2) / N(w1) By counting! Markov assumption: next word only depends on limited history, for bigrams history of 1 word.
  • 41. Bigram counts are zero? Smoothing 42 There are many • Kneser-Ney LMs • Laplace LMs • Skip-gram LMs • Cache-based LMs • Topic LMs • Sentence-Mixture LMs • Cluster-based LMs … Overview: • Joshua, T., and J. Goodman. "A bit of progress in language modeling extended version." Machine Learning and Applied Statistics Group Microsoft Research. Technical Report, MSR-TR- 2001-72 28 (2001). Back off model e.g., interpolate with an n-1 gram model. Pbigram-backed-off("mail" | "a") = (1-λ) Pbigram("mail" | "a") * λ Punigram("mail") Laplace smoothing Pbigram-laplace("mail" | "a") = N("mail", "a") + 1 N("a") + 1
  • 42. Evaluation Held out test set: • Word error rate (proportion of deletions, insertions, substitutions) • Perplexity Caution (when comparing results)! • Was the same test set used? • Do the models use the same target vocabulary? 43
  • 43. 44
  • 44. A small detour, what is deep learning? 45 Deep Learning
  • 48. Logistic regression, as one artificial neuron 49
  • 49. A feed forward neural network 50
  • 50. Error (aka. Loss) function 51 Prediction Reference
  • 51. Error (aka. Loss) function Prediction Reference wi t+1 = wi t - μ i (with learning rate μ)
  • 52. Finding the lowest point (ie. modelparameters that result in the lowest error) … but there is thick fog … 53 Gradient descent​ i.e., walking down hill in the steepest direction
  • 53. No global optimum guaranteed In contrast to fitting generalized linear models • Current training procedures (generally based on gradient descent) for deep neural nets do not guarantee identifying the overall best parameters for the training data. • So, initialization of the parameters matters! • And, repeatedly training a network may result in different models! 54 ?
  • 55. Deep learning (having more than 1 hidden layer) 57
  • 56. Now, back to neural language models 58
  • 57. Language as input: one-hot encoding 59 "let", "me", "send", "you", "a" ID Token 1 the 2 a 3 me 4 few 5 you 6 let ... ... 32194 send 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 ... 0 Limitation: • Dimensionality is as high as the vocabulary size (which can be around 10-100k tokens). • Sparse vectors
  • 58. Word embedding 60 0.88 0.21 0.87 0.23 0.01 0.02 ... 0.01 "let", "me", "send", "you", "a" 0.00 0.99 0.89 0.12 0.01 0.02 ... 0.01 0.78 0.22 0.09 0.11 0.67 0.01 ... 0.02 0.02 0.80 0.90 0.03 0.04 0.01 ... 0.09 0.01 0.02 0.70 0.11 0.01 0.02 ... 0.04 Dimensions represent word properties (not unique words): Pro's: • Lower dimensionality (parameter sharing • Dense continuous vectors Con's: • Interpretation (often the vectors are learned / latent). Acts like a verb?
  • 59. Word embedding 61 0.88 0.21 0.87 0.23 0.01 0.02 ... 0.01 "let", "me", "send", "you", "a" 0.00 0.99 0.02 0.12 0.01 0.02 ... 0.01 0.78 0.03 0.09 0.11 0.67 0.01 ... 0.02 0.02 0.80 0.90 0.03 0.04 0.01 ... 0.09 0.01 0.02 0.70 0.11 0.01 0.02 ... 0.04 Dimensions represent word properties (not unique words): Pro's: • Lower dimensionality (parameter sharing) • Dense continuous vectors Con's: • Interpretation (often the vectors are learned / latent). Could start a sentence?
  • 60. Skip-gram based word embeddings • Start with a model with one hidden layer (activations from this layer are the word embeddings) • Predict for each word, the closest n words in its surrounding context window (here, 2) Assumption: words with similar contexts, have similar 'meaning' (ie. informative properties). 62 ​ ​ Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).​ ​ APA​ ​
  • 61. Skip-gram based word embeddings • Start with a model with one hidden layer (activations from this layer are the word embeddings) • Predict for each word, the closest n words in its surrounding context window (here, 2) Assumption: words with similar contexts, have similar 'meaning' (informative properties). 63 ​ ​ Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).​ ​ APA​ ​ Image: https://techs0uls.wordpress.com/2020/03/16/ word-similarity-and-analogy-with-skip-gram/
  • 62. 'Early' neural language models 64 Mikolov, Tomas, et al. "Recurrent neural network based language model." Interspeech. Vol. 2. No. 3. 2010. Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179- 211. Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78.10 (1990): 1550- 1560. Bengio, Yoshua, et al. "A neural probabilistic language model." Advances in neural information processing systems. Vol. 13. 2000.
  • 63. 'Early' neural language models 65 Mikolov, Tomas, et al. "Recurrent neural network based language model." Interspeech. Vol. 2. No. 3. 2010. Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179- 211. Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78.10 (1990): 1550- 1560. Rolled out
  • 64. 'Early' neural language models 66 Mikolov, Tomas, et al. "Recurrent neural network based language model." Interspeech. Vol. 2. No. 3. 2010. Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179- 211. Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78.10 (1990): 1550- 1560. Bengio, Yoshua, et al. "A neural probabilistic language model." Advances in neural information processing systems. Vol. 13. 2000. * Both still got the best results when combined with n-gram backoff models!
  • 65. LSTMs and GRUs Motivation: • vanishing gradients in RNNs 67 Image: https://aiml.com/compare-the- different-sequence-models-rnn-lstm-gru- and-transformers/ LSTMs and GRU use "gates" to learn what information to maintain along the sequence. Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." NeurIPS (2014).
  • 66. Attention is all you need Motivation: • Better incorporation of history • Computation Components • Attention • Position encoding 69 Vaswani, Ashish et al. "Attention is all you need." NeurIPS (2017).
  • 67. Attention Main idea: The representation of each token can be transformed, based on the context (without prioritizing closer words, like LSTMs / GRUs / RNNs). 70 Image: https://medium.com/@vmirly/tutorial-on- scaled-dot-product-attention-with-pytorch- implementation-from-scratch-66ed898bf817 Implementation: • Each word carries three vectors: o A Query (the initial to be updated word vector itself) o A Key (representing what words should 'match') o A Value (how words should transform other words).
  • 68. Encoders vs. Decoders Encoders: - Use left and right context Decoders - Use only left context (allow for generation) 71 Vaswani, Ashish et al. "Attention is all you need." NeurIPS (2017). Encoder Decoder
  • 69. GPT-1 Training: 1. Pretraining (ie. the language modeling part) 2. Task specific training 72 "Our language model achieves a very low token level perplexity of 18.4"
  • 71. Large language models, there are many ... Proprietary • Easy to plug and play in the browser • Seemingly very good performance • Internet connection required • Not transparent • Not in own control (pro and con) 74 Open source • Requires a bit of coding to setup • Seemingly mixed performance • No internet required • Transparent • In own control (pro and con) 2B 7B 7B 3-300B 7B 7B 7B 1.8TB 135B 1.5B 40B 13B 130B 7B 1.6TB
  • 72. Opportunities and risks 75 August 7, 2024 Nov 13, 2023
  • 73. Hallucinations • A hallucination ... is a response generated by AI which contains false or misleading information presented as fact (wikipedia; 13-08-2024). • In automatic summarization: "A summary S of a document D contains a factual hallucination if it contains information not found in D that is factually correct." [Maynez, Joshua, et al. "On Faithfulness and Factuality in Abstractive Summarization." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.] 76 Generally defined on a meaning level. Not on form.
  • 74. Using LMs for generation: what can we expect? Remember the first ASR application - LM used to rerank ASR candidates 1) not to fully generate on its own 2) About language form 77
  • 75. Reasonable assumption to consider • Likely responses (based on data) are meaningful responses. 78
  • 77. Regulations 80 AI Act Medical device regulation
  • 78. A final note on application in healthcare research 81 Evaluate first 1. In your own data 2. Using the metrics that matter to you A model may be great at one application or textual domain but fail completely for another. And vice versa!
  • 79. You made it to the end! Questions? 82