SlideShare a Scribd company logo
CS 886 Deep Learning for Biotechnology
Ming Li
Jan. 21, 2022
CONTENT
01. Word2Vec
02. Attention / Transformers
03. Pretraining: GPT / BERT
04. Deep learning applications in proteomics
05. Student presentations begin
03 LECTURE
THREE
Pretraining: GPT-2 and BERT
Avoiding Information bottleneck
03 GPT-2, BERT
Last time we introduced transformer
03 GPT-2, BERT
Transformers, GPT-2, and BERT
03
1. A transformer uses Encoder stack to model input, and uses Decoder stack
to model output (using input information from encoder side).
2. But if we do not have input, we just want to model the “next word”, we
can get rid of the Encoder side of a transformer and output “next word”
one by one. This gives us GPT.
3. If we are only interested in training a language model for the input for
some other tasks, then we do not need the Decoder of the transformer, that
gives us BERT.
GPT-2, BERT
GPT-2, BERT
03
GPT-2, BERT
03
1542M
762M
345M
117M parameters
GPT released June 2018
GPT-2 released Nov. 2019 with 1.5B parameters
GPT-3: 175B parameters trained on 45TB texts
GPT-2 in action
GPT-2, BERT
03
not
injure
injure
a
a
human
human
being
being
Byte Pair Encoding (BPE)
GPT-2, BERT
03
Word embedding sometimes is too high level, pure character embedding too
low level. For example, if we have learned
old older oldest
We might also wish the computer to infer
smart smarter smartest
But at the whole word level, this might not be so direct. Thus the idea is to
break the words up into pieces like er, est, and embed frequent fragments of
words.
GPT adapts this BPE scheme.
Byte Pair Encoding (BPE)
GPT-2, BERT
03
GPT uses BPE scheme. The subwords are calculated by:
1. Split word to sequence of characters (add </w> char)
2. Joining the highest frequency pattern.
3. Keep doing step 2, until it hits the pre-defined maximum number of sub-words
or iterations.
Example (5, 2, 6, 3 are number of occurrences)
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9)
{‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7)
…..
Masked Self-Attention (to compute more efficiently)
03 GPT-2, BERT
Masked Self-Attention
03 GPT-2, BERT
Note: encoder-decoder attention block is gone
Masked Self-Attention Calculation
03 GPT-2, BERT
Note: encoder-decoder attention block is gone
Re-use previous computation results: at any step, only
need to results of q, k , v related to the new output
word, no need to re-compute the others. Additional
computation is linear, instead of quadratic.
GPT-2 fully connected network has two layers (Example for GPT-2
small)
03 GPT-2, BERT
768 is small model size
GPT-2 has a parameter top-k, so that we sample words from top k
(highest probability from softmax) words for each output
03 GPT-2, BERT
This top-k parameter, if k=1, we would have output like:
03 GPT-2, BERT
The first time I saw the new version of the game, I was
so excited. I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
GPT Training
03 GPT-2, BERT
GPT-2 uses unsupervised learning approach
to training the language model.
There is no custom training for GPT-2, no
separation of pre-training and fine-tuning
like BERT.
A story generated by GPT-2
03 GPT-2, BERT
“The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-
horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally
solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions,
were exploring the Andes Mountains when they found a small valley, with no other animals or
humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded
by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. `By the time we reached the top of
one peak, the water looked blue, with some crystals on top,’ said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen
from the air without having to move too much to see them – they were so close they could
touch their horns."
Transformer / GPT prediction
03 GPT-2, BERT
GPT-2 Application: Translation
03 GPT-2, BERT
GPT-2 Application: Summarization
03 GPT-2, BERT
Using wikipedia data
03 GPT-2, BERT
BERT (Bidirectional Encoder Representation from Transformers)
03 GPT-2, BERT
Model input dimension 512
Input and output vector size
03 GPT-2, BERT
BERT pretraining
03 GPT-2, BERT
ULM-FiT (2018): Pre-training ideas, transfer learning in NLP.
ELMo: Bidirectional training (LSTM)
Transformer: Although used things from left, but still missing
from the right.
GPT: Use Transformer Decoder half.
BERT: Switches from Decoder to Encoder, so that it can use
both sides in training and invented corresponding training
tasks: masked language model
BERT Pretraining Task 1: masked words
03 GPT-2, BERT
Out of this 15%,
80% are [Mask],
10% random words
10% original words
BERT Pretraining Task 2: two sentences
03 GPT-2, BERT
BERT Pretraining Task 2: two sentences
03 GPT-2, BERT
50% true second sentences
50% random second sentences
Fine-tuning BERT for other specific tasks
03 GPT-2, BERT
SST (Stanford
sentiment treebank):
215k phrases with
fine-grained
sentiment labels in
the parse trees of
11k sentences.
MNLI
QQP (Quaro Question Pairs)
Semantic equivalence)
QNLI (NL inference dataset)
STS-B (texture similarity)
MRPC (paraphrase, Microsoft)
RTE (textual entailment)
SWAG (commonsense inference)
SST-2 (sentiment)
CoLA (linguistic acceptability
SQuAD (question and answer)
NLP Tasks: Multi-Genre Natural Lang. Inference
03 GPT-2, BERT
MNLI: 433k
pairs of
examples,
labeled by
entailment,
neutral or
contraction
NLP Tasks (SQuAD -- Stanford Question Answering Dataset):
03 GPT-2, BERT
Sample: Super Bowl 50 was an American football game to
determine the champion of the National Football League (NFL)
for the 2015 season. The American Football Conference (AFC)
champion Denver Broncos defeated the National Football
Conference (NFC) champion Carolina Panthers 24–10 to earn
their third Super Bowl title. The game was played on February 7,
2016, at Levi's Stadium in the San Francisco Bay Area at Santa
Clara, California. As this was the 50th Super Bowl, the league
emphasized the "golden anniversary" with various gold-themed
initiatives, as well as temporarily suspending the tradition of
naming each Super Bowl game with Roman numerals (under
which the game would have been known as "Super Bowl L"), so
that the logo could prominently feature the Arabic numerals 50.
Which NFL team represented the
AFC at Super Bowl 50?
Ground Truth Answers: Denver
Broncos
Which NFL team represented the
NFC at Super Bowl 50?
Ground Truth Answers: Carolina
Panthers
Add indices for sentences and paragraphs
SegaTron/SegaBERT
H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling
and understanding. AAAI’2021
Conversion speed much faster:
Testing on GLUE dataset
H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling
and understanding. AAAI’2021
Reading comprehesion – SQUAD tasks
F1 = 2 (P*R) / (P+R), P is precision, R is recall, all in percentage, EM – exact match
Improving Transformer-XL
Looking at Attention
Looking at Attention
Feature Extraction
03 GPT-2, BERT
We start with
independent
word embedding
at first level
We end up with
some embedding
for each word
related to current
input
Feature Extraction, which embedding to use?
03 GPT-2, BERT
What we have learned
03 GPT-2, BERT
1. Model size matters (345 million parameters is better than 110
million parameters).
2. With enough training data, more training steps imply higher
accuracy
3. Key innovation: Learning from unannotated data.
4. In biotechnology, we also have a lot of such data (for example
meta-genomes).
Literature & Resources for Transformers
03
Resources:
OpenAI GPT-2 implementation: https://github.com/openai/gpt-2
BERT paper: J. Devlin et al, BERT, pretraining of deep bidirectional
transformers for language understanding. Oct. 2018.
ELMo paper: M. Peters, et al, Deep contextualized word representation, 2018
ULM-FiT paper: Universal language model fine-tuning for text classification. J.
Howeard, S. Ruder., 2018
Jay Alammar, The illustrated GPT-2, https://jalammar.github.io/illustrated-
gpt2/
GPT-2, BERT

More Related Content

PPTX
self-supervised learning and Bert from a
PDF
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
PDF
Devoxx traitement automatique du langage sur du texte en 2019
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
PDF
defense
PPTX
Recent Advances in Natural Language Processing
PDF
RNN and sequence-to-sequence processing
PDF
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
self-supervised learning and Bert from a
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)
Devoxx traitement automatique du langage sur du texte en 2019
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
defense
Recent Advances in Natural Language Processing
RNN and sequence-to-sequence processing
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data

Similar to Deep learning for biotechnology presentation (20)

PDF
1 hour dive into erlang
PDF
1 hour dive into Erlang/OTP
PPTX
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
PPTX
NLP State of the Art | BERT
PPT
EB-eye Back End
DOC
Assignments of source coding theory and applications
PDF
Arrays and pointers
PDF
Segmenting dna sequence into words
PDF
A TUTORIAL ON POINTERS AND ARRAYS IN C
PPT
Traductor ingl e9edimburghl
PDF
Machine translation using mutiplexed pdt for chatting slang
PDF
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
PPT
Dictionaries and Tolerant Retrieval.ppt
PDF
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
PDF
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
PDF
Predicting organic reaction outcomes with weisfeiler lehman network
PDF
SAE: Structured Aspect Extraction
PDF
Pycon Korea 2020
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PPTX
Natural Language processing Parts of speech tagging, its classes, and how to ...
1 hour dive into erlang
1 hour dive into Erlang/OTP
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
NLP State of the Art | BERT
EB-eye Back End
Assignments of source coding theory and applications
Arrays and pointers
Segmenting dna sequence into words
A TUTORIAL ON POINTERS AND ARRAYS IN C
Traductor ingl e9edimburghl
Machine translation using mutiplexed pdt for chatting slang
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Dictionaries and Tolerant Retrieval.ppt
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
Predicting organic reaction outcomes with weisfeiler lehman network
SAE: Structured Aspect Extraction
Pycon Korea 2020
FORECASTING MUSIC GENRE (RNN - LSTM)
Natural Language processing Parts of speech tagging, its classes, and how to ...
Ad

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Classroom Observation Tools for Teachers
PDF
Pre independence Education in Inndia.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
VCE English Exam - Section C Student Revision Booklet
01-Introduction-to-Information-Management.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Microbial diseases, their pathogenesis and prophylaxis
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Classroom Observation Tools for Teachers
Pre independence Education in Inndia.pdf
Final Presentation General Medicine 03-08-2024.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Renaissance Architecture: A Journey from Faith to Humanism
GDM (1) (1).pptx small presentation for students
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
human mycosis Human fungal infections are called human mycosis..pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
TR - Agricultural Crops Production NC III.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
VCE English Exam - Section C Student Revision Booklet
Ad

Deep learning for biotechnology presentation

  • 1. CS 886 Deep Learning for Biotechnology Ming Li Jan. 21, 2022
  • 2. CONTENT 01. Word2Vec 02. Attention / Transformers 03. Pretraining: GPT / BERT 04. Deep learning applications in proteomics 05. Student presentations begin
  • 5. Last time we introduced transformer 03 GPT-2, BERT
  • 6. Transformers, GPT-2, and BERT 03 1. A transformer uses Encoder stack to model input, and uses Decoder stack to model output (using input information from encoder side). 2. But if we do not have input, we just want to model the “next word”, we can get rid of the Encoder side of a transformer and output “next word” one by one. This gives us GPT. 3. If we are only interested in training a language model for the input for some other tasks, then we do not need the Decoder of the transformer, that gives us BERT. GPT-2, BERT
  • 8. GPT-2, BERT 03 1542M 762M 345M 117M parameters GPT released June 2018 GPT-2 released Nov. 2019 with 1.5B parameters GPT-3: 175B parameters trained on 45TB texts
  • 9. GPT-2 in action GPT-2, BERT 03 not injure injure a a human human being being
  • 10. Byte Pair Encoding (BPE) GPT-2, BERT 03 Word embedding sometimes is too high level, pure character embedding too low level. For example, if we have learned old older oldest We might also wish the computer to infer smart smarter smartest But at the whole word level, this might not be so direct. Thus the idea is to break the words up into pieces like er, est, and embed frequent fragments of words. GPT adapts this BPE scheme.
  • 11. Byte Pair Encoding (BPE) GPT-2, BERT 03 GPT uses BPE scheme. The subwords are calculated by: 1. Split word to sequence of characters (add </w> char) 2. Joining the highest frequency pattern. 3. Keep doing step 2, until it hits the pre-defined maximum number of sub-words or iterations. Example (5, 2, 6, 3 are number of occurrences) {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 } {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 } {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9) {‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7) …..
  • 12. Masked Self-Attention (to compute more efficiently) 03 GPT-2, BERT
  • 13. Masked Self-Attention 03 GPT-2, BERT Note: encoder-decoder attention block is gone
  • 14. Masked Self-Attention Calculation 03 GPT-2, BERT Note: encoder-decoder attention block is gone Re-use previous computation results: at any step, only need to results of q, k , v related to the new output word, no need to re-compute the others. Additional computation is linear, instead of quadratic.
  • 15. GPT-2 fully connected network has two layers (Example for GPT-2 small) 03 GPT-2, BERT 768 is small model size
  • 16. GPT-2 has a parameter top-k, so that we sample words from top k (highest probability from softmax) words for each output 03 GPT-2, BERT
  • 17. This top-k parameter, if k=1, we would have output like: 03 GPT-2, BERT The first time I saw the new version of the game, I was so excited. I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of
  • 18. GPT Training 03 GPT-2, BERT GPT-2 uses unsupervised learning approach to training the language model. There is no custom training for GPT-2, no separation of pre-training and fine-tuning like BERT.
  • 19. A story generated by GPT-2 03 GPT-2, BERT “The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four- horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. `By the time we reached the top of one peak, the water looked blue, with some crystals on top,’ said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns."
  • 20. Transformer / GPT prediction 03 GPT-2, BERT
  • 23. Using wikipedia data 03 GPT-2, BERT
  • 24. BERT (Bidirectional Encoder Representation from Transformers) 03 GPT-2, BERT
  • 25. Model input dimension 512 Input and output vector size 03 GPT-2, BERT
  • 26. BERT pretraining 03 GPT-2, BERT ULM-FiT (2018): Pre-training ideas, transfer learning in NLP. ELMo: Bidirectional training (LSTM) Transformer: Although used things from left, but still missing from the right. GPT: Use Transformer Decoder half. BERT: Switches from Decoder to Encoder, so that it can use both sides in training and invented corresponding training tasks: masked language model
  • 27. BERT Pretraining Task 1: masked words 03 GPT-2, BERT Out of this 15%, 80% are [Mask], 10% random words 10% original words
  • 28. BERT Pretraining Task 2: two sentences 03 GPT-2, BERT
  • 29. BERT Pretraining Task 2: two sentences 03 GPT-2, BERT 50% true second sentences 50% random second sentences
  • 30. Fine-tuning BERT for other specific tasks 03 GPT-2, BERT SST (Stanford sentiment treebank): 215k phrases with fine-grained sentiment labels in the parse trees of 11k sentences. MNLI QQP (Quaro Question Pairs) Semantic equivalence) QNLI (NL inference dataset) STS-B (texture similarity) MRPC (paraphrase, Microsoft) RTE (textual entailment) SWAG (commonsense inference) SST-2 (sentiment) CoLA (linguistic acceptability SQuAD (question and answer)
  • 31. NLP Tasks: Multi-Genre Natural Lang. Inference 03 GPT-2, BERT MNLI: 433k pairs of examples, labeled by entailment, neutral or contraction
  • 32. NLP Tasks (SQuAD -- Stanford Question Answering Dataset): 03 GPT-2, BERT Sample: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. Which NFL team represented the AFC at Super Bowl 50? Ground Truth Answers: Denver Broncos Which NFL team represented the NFC at Super Bowl 50? Ground Truth Answers: Carolina Panthers
  • 33. Add indices for sentences and paragraphs SegaTron/SegaBERT H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling and understanding. AAAI’2021
  • 35. Testing on GLUE dataset H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling and understanding. AAAI’2021
  • 36. Reading comprehesion – SQUAD tasks F1 = 2 (P*R) / (P+R), P is precision, R is recall, all in percentage, EM – exact match
  • 40. Feature Extraction 03 GPT-2, BERT We start with independent word embedding at first level We end up with some embedding for each word related to current input
  • 41. Feature Extraction, which embedding to use? 03 GPT-2, BERT
  • 42. What we have learned 03 GPT-2, BERT 1. Model size matters (345 million parameters is better than 110 million parameters). 2. With enough training data, more training steps imply higher accuracy 3. Key innovation: Learning from unannotated data. 4. In biotechnology, we also have a lot of such data (for example meta-genomes).
  • 43. Literature & Resources for Transformers 03 Resources: OpenAI GPT-2 implementation: https://github.com/openai/gpt-2 BERT paper: J. Devlin et al, BERT, pretraining of deep bidirectional transformers for language understanding. Oct. 2018. ELMo paper: M. Peters, et al, Deep contextualized word representation, 2018 ULM-FiT paper: Universal language model fine-tuning for text classification. J. Howeard, S. Ruder., 2018 Jay Alammar, The illustrated GPT-2, https://jalammar.github.io/illustrated- gpt2/ GPT-2, BERT

Editor's Notes

  • #36: CoLA: language acceptability, SST – sentiment, MRPC paraphrase, STS textual similarity, QQP: question paraphrase, RTE MNLI: textual entailment, NLI: question entailment.