2. Today
9:00-12:00 Lecture
▪ The research field and some applications
▪ Language properties and theory
▪ Language modeling with neural networks
▪ Evaluation
2
13:00-16:00 Fun practical!
▪ Tokenization
▪ Word embeddings
▪ Transformers
5. Classic* NLP Tasks
• Parsing (SIGPARSE, since 1989)
• Information extraction (Many, since 1997)
• Machine translation (MT summit/EAMT/AMTA, since 1989)
• Speech synthesis & speech recognition (INTERSPEECH, since 1989)
• Text summarization (DUC, since 2000)
• Question answering (TREC-QA, 1999)
• Chatbots (SIGdial, since 1998)
• Language generation (INLG, since 1990)
Flagship conferences
• ACL + NAACL, EACL since 1979
• EMNLP since 1996
Language resources and annotation:
• LREC since 1998
Evaluation:
• SemEval since 1998
5
6. NLP in Medicine
In clinical care
- Summarization
o Of patient records
- Question answering
o By patients about their care
o By physicians about an EHR record
- Automatic note generation
o Discharge summaries
o Consultation preparation
- Prediction in electronic medical records
In medical research
- Efficient search in systematic reviews
o Semi-automatization of screening
o Automatic data extraction
(eg., Drug-drug interactions, PICO elements, risk of bias)
- Extracting information from clinical notes
o (Rare) case identification
o Predictors / risk factors / exposures
o Study outcomes
- Scientific writing
o Editing
6
12. Meaning ≠ use ≠ form
12
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
Colorless green ideas sleep furiously
(grammatical, but no meaning)
A: will you join the party? B: I have to work.
(implicature: no, I will not join the party)
H. P. Grice
N. Chomsky
13. Meaning ≠ use ≠ form
13
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
Colorless green ideas sleep furiously
(grammatical, but no meaning)
A: will you join the party? B: I have to work.
(implicature: no, I will not join the party)
H. P. Grice
N. Chomsky
14. Meaning ≠ use ≠ form
14
u(A)
I'
I
Syntax: Words, Sentences
Phonology: Speech
Semantics:
Objects
Properties
Relations
Events
Timings
Locations
Pragmatics:
Relevance,
Implicature
! This is what we generally have (raw text data)
15. Shannon's noisy channel (A Mathematical Theoryof Communication, 1948)
15
u(A)
I'
I
Shannon-information:
1. An event with probability 100% is perfectly unsurprising and yields no information.
2. The less probable an event is, the more surprising it is and the more information it yields.
Efficient transfer of information with limited/no errors:
1. Efficient encoding: using less symbols/code for frequent events (e.g., Huffman encoding)
2. Some redundancy in your message (ie., predictability from context)
16. Shannon's noisy channel (A Mathematical Theoryof Communication, 1948)
16
u(A)
I'
I
Shannon-information:
1. An event with probability 100% is perfectly unsurprising and yields no information.
2. The less probable an event is, the more surprising it is and the more information it yields.
Efficient transfer of information with limited/no errors:
1. Efficient encoding: using less symbols for frequent events (e.g., Huffman encoding)
2. Some redundancy in your message (ie., predictability from context)
Interesting quote to test your understanding
"If the redundancy is zero any sequence of letters is a reasonable text in the language
and any two-dimensional array of letters forms a crossword puzzle."
17. Uniform information density (Fenk &Fenk, 1980)
Uniform information den...y means that each part of . sentence
carries more or …. the same amuont of infromatoin.
So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl!
17
Why can we read this?
18. Uniform information density (Fenk &Fenk, 1980)
Uniform information den...y means that each part of . sentence
carries more or …. the same amuont of infromatoin.
So, depsite some niose, … inofrmatoin trnasfer remians qiuet susccsufl!
18
Why can we read this?
Across the sequence, missing parts are approximately equally predictable from the context.–
19. Uniform information density (Fenk &Fenk, 1980)
統一資..訊否認意味著. 句子
攜帶更多或…。 同樣數量的信息。 號
因此,儘管有一些麻煩,......... 訊息傳輸仍然成功!
19
Why can we read this?
Across the sequence, missing parts are approximately equally predictable from the context.–
By us humans! Because we live and breath language!
20. Important properties of language
Compositionality: combination of parts
(e.g., word structure, sentence structure, language context)
Linguistic variation: the same meaning, expressed differently
(e.g., synonyms, abbreviations, regional/individual variation, rare words)
Ambiguity: the same expression, different meaning
(e.g., lexical/word level, sentence level, ...)
Incomplete: world knowledge needed for interpretation/production
(e.g., laws of physics, social norms, world facts)
20
22. 22
"A large bacterium is cycling in the desert."
Compositionality
Efficient encoding
23. 23
"A large bacterium is cycling in the desert."
Compositionality
Efficient encoding
24. Chomsky Hierarchy
• How do natural languages compose?
• What consequences does this have for computation?
• Memory complexity
• Time complexity
(w.r.t. parsing: determining if a sequence is
grammatically well formed, i.e. is it part of the
language)
24
(1956)
25. Chomsky Hierarchy
• How do natural languages compose?
• What consequences does this have for computation?
• Memory complexity
• Time complexity
(w.r.t. parsing: determining if a sequence is
grammatically well formed, i.e. is it part of the
language)
25
LSTMs and Transformers
(1956)
26. Compositionality
! But not all language is compositional (idioms, proper names)
Under the weather
The elephant in the room
Golden Gate Bridge
26
27. Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
27
28. Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
28
29. Ambiguity and linguistic variation
Many to many:
- One form with multiple meanings
- One meaning expressible in multiple forms
29
Often resolvable in context.
31. A language model
Probability("let", "me", "send", "you", "a", "mail")
❖ How likely is it that we observe this utterance?
31
Shannon, C.E., 1951. Prediction and entropy of printed English. Bell
system technical journal, 30(1), pp.50-64.
32. A language model
PLM("let", "me", "send", "you", "a", "mail") =
PLM("mail" | "let", "me", "send", "you", "a")
* PLM("a" | "let", "me", "send", "you")
* PLM("you" | "let", "me", "send")
* PLM("send" | "let", "me")
* PLM("me" | "let")
* PLM("<S>", "let")
32
Shannon, C.E., 1951. Prediction and entropy of printed English. Bell
system technical journal, 30(1), pp.50-64.
33. ! Probability distribution
• The vocabulary V should be defined.
• Probabilities should sum to 1.
33
ID Token
1 the
2 a
3 mail
4 few
5 several
6 is
... ...
N send
? P
the 0.001
a 0.001
mail 0.451
few 0.052
severa
l
0.001
is 0.001
...
send 0.001
PLM( ? | "let", "me", "send", "you", "a")
34. What is one of the first uses of language models?
34
35. First use: automatic speech recognition
35
PLM(let me send you a mail)
?
PLM(Lett mi scent you a mail)
Katz, S., 1987. Estimation of probabilities
from sparse data for the language model
component of a speech recognizer. IEEE
transactions on acoustics, speech, and
signal processing, 35(3), pp.400-401.
Chen, S.F. and Goodman, J., 1999. An
empirical study of smoothing techniques for
language modeling. Computer Speech &
Language, 13(4), pp.359-394.
Ney, H., Essen, U. and Kneser, R., 1994. On
structuring probabilistic dependences in
stochastic language modelling. Computer
Speech & Language, 8(1), pp.1-38.
Nadas, A., 1984. Estimation of probabilities
in the language model of the IBM speech
recognition system. IEEE Transactions on
Acoustics, Speech, and Signal Processing,
32(4), pp.859-861.
36. Generating text, using a language model
Which words give high probability?
37
PLM( ? | "let me send you a")
? P
the 0.001
a 0.001
mail 0.451
few 0.052
severa
l
0.001
is 0.001
...
send 0.001
38. Generation via sampling
Top-k sampling
Sampling from the top-k highest probabilities
Top-p sampling (aka. 'Nucleus sampling')
Sampling from the most likely words, that have
cumulative probability of at most p
https://huggingface.co/blog/how-to-generate
* sorted from high to low
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
K=3
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
p=0.8
39. Generation via sampling
Top-k sampling
Sampling from the top-k highest probabilities
Top-p sampling (aka. 'Nucleus sampling')
Sampling from the most likely words, that have
cumulative probability of at most p
https://huggingface.co/blog/how-to-generate
? P*
mail 0.441
letter 0.348
card 0.031
any 0.005
severa
l
0.001
is 0.001
...
send 0.000
* sorted from high to low
? P*
mail 0.151
letter 0.142
card 0.024
any 0.012
severa
l
0.010
is 0.003
...
send 0.003
p=0.8
K=3
Temperature parameter
↑
40. N-gram language model
41
Pbigram("let", "me", "send", "you", "a", "mail") =
Pbigram("mail" | "a") "let" , "me", "send", "you",
* Pbigram("a" | "you") "let", "me", "send",
* Pbigram("you" | "send") "let", "me",
* Pbigram("send" | "me") "let",
* Pbigram("me" | "let")
* Pbigram("<S>", "let")
P( w2 | w1) = N(w1, w2) / N(w1)
By counting!
Markov assumption: next word
only depends on limited history,
for bigrams history of 1 word.
41. Bigram counts are zero? Smoothing
42
There are many
• Kneser-Ney LMs
• Laplace LMs
• Skip-gram LMs
• Cache-based LMs
• Topic LMs
• Sentence-Mixture LMs
• Cluster-based LMs
…
Overview:
• Joshua, T., and J. Goodman. "A bit of
progress in language modeling
extended version." Machine Learning
and Applied Statistics Group Microsoft
Research. Technical Report, MSR-TR-
2001-72 28 (2001).
Back off model
e.g., interpolate with an n-1 gram model.
Pbigram-backed-off("mail" | "a") = (1-λ) Pbigram("mail" | "a")
* λ Punigram("mail")
Laplace smoothing
Pbigram-laplace("mail" | "a") = N("mail", "a") + 1
N("a") + 1
42. Evaluation
Held out test set:
• Word error rate (proportion of deletions, insertions, substitutions)
• Perplexity
Caution (when comparing results)!
• Was the same test set used?
• Do the models use the same target vocabulary?
43
51. Error (aka. Loss) function
Prediction Reference
wi
t+1 = wi
t - μ
i
(with learning rate μ)
52. Finding the lowest point (ie. modelparameters that result in the lowest error)
… but there is thick fog …
53
Gradient descent
i.e., walking down hill in the steepest
direction
53. No global optimum guaranteed
In contrast to fitting generalized linear models
• Current training procedures (generally based on gradient descent) for deep neural
nets do not guarantee identifying the overall best parameters for the
training data.
• So, initialization of the parameters matters!
• And, repeatedly training a network may result in different models!
54
?
57. Language as input: one-hot encoding
59
"let", "me", "send", "you", "a"
ID Token
1 the
2 a
3 me
4 few
5 you
6 let
... ...
32194 send
0
0
0
0
0
1
...
0
0
0
1
0
0
0
...
0
0
0
0
0
0
0
...
1
0
0
0
0
1
0
...
0
0
1
0
0
0
0
...
0
Limitation:
• Dimensionality is as high
as the vocabulary size
(which can be around
10-100k tokens).
• Sparse vectors
60. Skip-gram based word embeddings
• Start with a model with one hidden
layer (activations from this layer are the word
embeddings)
• Predict for each word, the closest n
words in its surrounding context
window (here, 2)
Assumption: words with similar contexts,
have similar 'meaning' (ie. informative
properties).
62
Mikolov, Tomas, et al. "Efficient
estimation of word representations in
vector space." arXiv preprint
arXiv:1301.3781 (2013). APA
61. Skip-gram based word embeddings
• Start with a model with one hidden
layer (activations from this layer are the word
embeddings)
• Predict for each word, the closest n
words in its surrounding context
window (here, 2)
Assumption: words with similar contexts,
have similar 'meaning' (informative
properties).
63
Mikolov, Tomas, et al. "Efficient
estimation of word representations in
vector space." arXiv preprint
arXiv:1301.3781 (2013). APA
Image:
https://techs0uls.wordpress.com/2020/03/16/
word-similarity-and-analogy-with-skip-gram/
62. 'Early' neural language models
64
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Bengio, Yoshua, et al. "A neural probabilistic
language model." Advances in neural information
processing systems. Vol. 13. 2000.
63. 'Early' neural language models
65
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Rolled out
64. 'Early' neural language models
66
Mikolov, Tomas, et al. "Recurrent neural
network based language model."
Interspeech. Vol. 2. No. 3. 2010.
Elman, Jeffrey L. "Finding
structure in time." Cognitive
science 14.2 (1990): 179-
211.
Werbos, Paul J.
"Backpropagation through
time: what it does and how
to do it." Proceedings of the
IEEE 78.10 (1990): 1550-
1560.
Bengio, Yoshua, et al. "A neural probabilistic
language model." Advances in neural information
processing systems. Vol. 13. 2000.
* Both still got the best results when combined with n-gram backoff models!
65. LSTMs and GRUs
Motivation:
• vanishing gradients in RNNs
67
Image: https://aiml.com/compare-the-
different-sequence-models-rnn-lstm-gru-
and-transformers/
LSTMs and GRU use "gates" to learn what information to maintain along the sequence.
Hochreiter, Sepp, and Jürgen Schmidhuber.
"Long short-term memory." Neural
computation 9.8 (1997): 1735-1780.
Chung, Junyoung, et al. "Empirical evaluation
of gated recurrent neural networks on
sequence modeling." NeurIPS (2014).
66. Attention is all you need
Motivation:
• Better incorporation of history
• Computation
Components
• Attention
• Position encoding
69
Vaswani, Ashish et al.
"Attention is all you
need." NeurIPS (2017).
67. Attention
Main idea: The representation of each token can be transformed, based on the
context (without prioritizing closer words, like LSTMs / GRUs / RNNs).
70
Image:
https://medium.com/@vmirly/tutorial-on-
scaled-dot-product-attention-with-pytorch-
implementation-from-scratch-66ed898bf817
Implementation:
• Each word carries three vectors:
o A Query (the initial to be
updated word vector itself)
o A Key (representing what
words should 'match')
o A Value (how words should
transform other words).
68. Encoders vs. Decoders
Encoders:
- Use left and right context
Decoders
- Use only left context
(allow for generation)
71
Vaswani, Ashish et al.
"Attention is all you
need." NeurIPS (2017).
Encoder
Decoder
69. GPT-1
Training:
1. Pretraining (ie. the language modeling part)
2. Task specific training
72
"Our language model achieves a very low token level
perplexity of 18.4"
71. Large language models, there are many ...
Proprietary
• Easy to plug and play in the browser
• Seemingly very good performance
• Internet connection required
• Not transparent
• Not in own control (pro and con)
74
Open source
• Requires a bit of coding to setup
• Seemingly mixed performance
• No internet required
• Transparent
• In own control (pro and con)
2B
7B
7B
3-300B
7B
7B
7B
1.8TB
135B
1.5B
40B
13B
130B
7B
1.6TB
73. Hallucinations
• A hallucination ... is a response generated by AI which contains false or misleading
information presented as fact (wikipedia; 13-08-2024).
• In automatic summarization: "A summary S of a document D contains a factual
hallucination if it contains information not found in D that is factually correct." [Maynez,
Joshua, et al. "On Faithfulness and Factuality in Abstractive Summarization." Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. 2020.]
76
Generally defined on a meaning level. Not on form.
74. Using LMs for generation: what can we expect?
Remember the first ASR application
- LM used to rerank ASR candidates
1) not to fully generate on its own
2) About language form
77
75. Reasonable assumption to consider
• Likely responses (based on data) are meaningful responses.
78
78. A final note on application in healthcare research
81
Evaluate first
1. In your own data
2. Using the metrics that matter to you
A model may be great at one
application or textual domain but fail
completely for another.
And vice versa!