LLM Reasoning:
Key Ideas and Limitations
Denny Zhou
September, 2024 CS294/194-196: LLM Agents @ UC Berkeley
Google DeepMind
What do you expect for AI?
Solve the hardest math problems that humans
cannot solve?
Discover new scientific theory?
Solve AGI?
…
My little expectation for AI
AI should be able to learn from just
a few examples, like what humans
usually do
Does ML meet this expectation?
Semi-supervised learning
Bayesian nonparametric
Kernel machines
Sparsity
Low rank
Active learning
…
What is missing in ML?
Reasoning
Humans can learn from just a few examples
because humans can reason
Let’s start from a toy problem
“Make things as simple as possible but no simpler”
— Albert Einstein
Last Letter Concatenation
Rule: Take the last letter of each word, and then concatenate them
Input Output
“Elon Musk” “nk”
“Bill Gates” “ls”
“Barack Obama” ?
Solve it by ML? Tons of labeled data needed!
Encoder Decoder
“Bill Gates”
“ls”
Would you still refer to ML as AI when it requires vast amounts of labeled data to learn
such a “simple” task?
Let’s see how this problem can
be solved by using large
language models (LLMs)!
LLM
“AI is the” “future”
What are Large Language Models (LLMs)?
LLM is a “transformer” model trained to predict the next word
Eg “AI is the future”
Trained with many sentences, e.g. all texts from the Internet
You can think of training LLMs
as training parrots to mimic
human languages
Q: “Elon Musk”
A: “nk”
Q: “Bill Gates”
A: “ls”
Q: “Barack Obama”
A:
LLM
Input
Few-shot prompting for last-letter-concatenation
“ck”
Brown, T.B., 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Let’s add “reasoning process” before “answer”
Q: “Elon Musk”
A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k"
leads to "nk". so the output is "nk".
Q: “Bill Gates”
A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads
to "ls". so the output is "ls".
Q: “Barack Obama"
A:
reasoning process
Let’s add “reasoning process” before “answer”
Q: “Elon Musk”
A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k"
leads to "nk". so the output is "nk".
Q: “Bill Gates”
A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads
to "ls". so the output is "ls".
Q: “Barack Obama"
A: the last letter of "Barack" is "k". the last letter of "Obama" is "a". Concatenating "k", "a"
leads to "ka". so the output is "ka".
reasoning process
One demonstration is enough, like humans
Q: “Elon Musk”
A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k"
leads to "nk". so the output is "nk".
Q: “Barack Obama"
A: the last letter of "Barack" is "k". the last letter of "Obama" is "a". Concatenating "k", "a"
leads to "ka". so the output is "ka".
100% accuracy with only one demonstration example
Key Idea: Derive the Final Answer through
Intermediate Steps
Ling et al. Program Induction by Rationale
Generation: Learning to Solve and Explain
Algebraic Word Problems. ACL 2017
Ling et al 2017 in DeepMind pioneered using natural language rationale to
solve math problems by “... derive the final answer through a series of small
steps”. Trained a sequence-to-sequence model from scratch.
GSM8K: <Problem, Intermediate Steps, Answer>
Cobbe et al. Training
Verifiers to Solve Math Word
Problems. arXiv:2110.14168
[cs.LG]. 2021
Following the work by Ling et al 2017, Cobbe et al 2021 in OpenAI built a much
larger math word problem dataset (GSM8K) with natural language rationales,
and used it to finetune GPT3
Problem: Ali is a dean of a private school where he teaches one class.
John is also a dean of a public school. John has two classes in his
school. Each class has 1/8 the capacity of Ali’s class which has the
capacity of 120 students. What is the combined capacity of both schools?
Solution: Ali’s class has a capacity of 120 students. Each of John’s
classes has a capacity of 120/8 = 15 students. The total capacity of
John’s two classes is 15 students * 2 classes = 30 students. The
combined capacity of the two schools is 120 students + 30 students =
150 students.
Final answer: 150
Show Your Work: Scratchpads for Intermediate
Computation with Language Models
Nye et al. Show Your Work: Scratchpads
for Intermediate Computation with
Language Models. arXiv:2112.00114
[cs.LG], 2021
Chain-of-Thought (CoT) Prompting
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou.
Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022
Training with intermediate steps (Ling et al 2017)
Finetuning with intermediate steps (Cobbe et al 2021, Nye et al 2021)
Prompting with intermediate steps (Nye et al 2021, Wei et al 2022)
This is what really matters!
Regardless of training, fine-tuning, or
prompting, when provided with examples
that include intermediate steps, LLMs will
generate responses that also include
intermediate steps
In addition to intermediate steps, is it
helpful to introduce reasoning strategies
in demonstration examples?
Least-to-Most Prompting
Enable easy-to-hard generalization by decomposition
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale
Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, Ed Chi. Least-to-Most Prompting
Enables Complex Reasoning in Large Language Models. ICLR 2023.
LLM Reasoning - Key Ideas and Limitations
Elsa has 3 apples. Anna has 2 more apples than Elsa. How many
apples do they have together?
Let's break down this problem: 1. How many apples does Anna
have? 2. How many apples do Elsa and Anna have together?
1. Anna has 2 more apples than Elsa. So Anna has 2 + 3 = 5 apples.
2. Anna has 5 apples. Elsa and Anna have 3 + 5 = 8 apples together.
The answer is 8.
SCAN (Compositional Generalization)
Using just 0.1% demonstration examples achieves perfect generalization
CFQ (Compositional Generalization): Text-to-Code
Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier
Bousquet, Denny Zhou. Compositional Semantic Parsing with Large Language Models. ICLR 2023.
Using just 1% data!
Why intermediate steps are helpful?
“There is nothing more practical than a good theory.”
— Kurt Lewin
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers
Transformers to Solve Inherently Serial Problems. ICLR 2024.
● Transformer generating intermediate steps can solve
any inherently serial problem as long as its depth
exceeds a constant threshold
● Transformer generating direct answers either requires
a huge depth to solve or cannot solve at all
Tons of practical implications of this theory
Generating more intermediate steps
Combining LLM with external tools / search
Diagnosing LLM failures and limitations
Even path to AGI
…
How to trigger step by step
reasoning without using
demonstration examples?
Let’s think
step by step
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y.
and Iwasawa, Y. Large language models
are zero-shot reasoners. NeurIPS 2022.
Hmm, zero-shot is cool but usually
significantly worse than few-shot …
LLMs as Analogical Reasoners
Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang,
Ed H. Chi, Denny Zhou. Large Language Models as Analogical Reasoners. ICLR 2024.
LLM Reasoning - Key Ideas and Limitations
“A mathematician is a person who can find analogies
between theorems; a better mathematician is one who can
see analogies between proofs and the best mathematician
can notice analogies between theories. One can imagine
that the ultimate mathematician is one who can see
analogies between analogies.”
— Stefan Banach
What is the area of the square with the four vertices at (-2, 2), (2, -2), (-2, -6), and (-6,
-2)? Recall a related problem, and then solve this one.
The related problem is finding the distance between two points on a coordinate
plane. To find the distance between two points, we use the distance formula: d = √
((x2-x1)^2 + (y2-y1)^2).
In this problem, we have four points: (-2, 2), (2, -2), (-2, -6), and (-6, -2). Let's choose (-2, 2)
and (2, -2) to find the distance d between them: d = √((2-(-2))^2 + (-2-2)^2) = √(4^2 +
(-4)^2) = √(16 + 16) = √32. We know that the distance between two opposite sides of a
square is equal. So, the length of each side of the square is √32. To find the area of the
square, we square the length of one side: Area = (√32)^2 = 32.
LLM
Analogical reasoner vs zero-shot / few-shot CoT
Analogical reasoner vs zero-shot / few-shot CoT
Analogical reasoner vs zero-shot / few-shot CoT
Adaptively generate relevant
examples and knowledge,
rather than using a fix set of
examples
Is it possible to trigger step by step
reasoning even without using any
prompt like “let’s think step by step”?
Chain-of-Thought Reasoning
without Prompting
Xuezhi Wang and Denny Zhou. Chain-of-Thought Reasoning Without Prompting.
arXiv preprint arXiv:2402.10200 (2024).
Decoding at step 0 Full generation after continuing greedy decoding
top-1: 5 5 apples.
top-2: I I have 3 apples, my dad has 2 more apples than me, so he
has 5 apples. 3+5=8.
top-3: We We have 8 apples in total.
top-4: You You have 3 apples, your dad has 2 more apples than you, so
he has 5 apples. 3+5=8.
top-5: The The answer is 5.
Chain-of-Thought Decoding
I have 3 apples, my dad has 2 more apples than me, how many apples do we have in total?
1 Nicolas Cage was born in an odd year.
2 Even.
3 Odd.
4 1964, an even year.
5 He was born in an even year.
6 Cage was born in 1964, an even year.
Chain-of-Thought Decoding
Was Nicolas Cage born in an even or odd year?
1 Nicolas Cage was born in an odd year. 0.117
2 Even. 0.207
3 Odd. 0.198
4 1964, an even year. 0.949
5 He was born in an even year. 0.000
6 Cage was born in 1964, an even year. 0.978
Chain-of-Thought Decoding
Was Nicolas Cage born in an even or odd year?
Key observations:
1. Pre-trained LLMs have had responses with
step-by-step reasoning among the generations
started with the top-k tokens
2. Higher confidence in decoding the final answer
when a step-by-step reasoning path is present
Greedy Decoding vs Chain-of-Thought Decoding
Generating intermediate steps
are helpful, but …
Any concern on generating intermediate
steps instead of direct answers?
Always keep in mind that LLMs are probabilistic models
of generating next tokens. They are not humans.
What LLM does in decoding:
What we want:
Not
align!
One-step further
How to compute the sum then? Sampling!
Self-Consistency
Greatly improves step-by-step reasoning
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou.
Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
[Question] Janet’s ducks lay 16 eggs per day. She eats three for breakfast
every morning and bakes muffins for her friends every day with four. She
sells the remainder for $2 per egg. How much does she make every day?
Response 1: She has 16 - 3 - 4 = 9 eggs left. So she makes $2 * 9 = $18 per day.
Sampled responses:
Response 2: This means she she sells the remainder for $2 * (16 - 4 - 3) = $26 per day.
Response 3: She eats 3 for breakfast, so she has 16 - 3 = 13 left. Then she bakes
muffins, so she has 13 - 4 = 9 eggs left. So she has 9 eggs * $2 = $18.
Most frequent answer is: 18
(Not most frequent reasoning path!)
Crushed GSM8K SOTA with only 8 examples
More consistent, more likely to be correct
[Q1] When the LLM outputs a direct answer without intermediate
steps, will you still sample several times, and then choose the
most common answer?
[Q2] Change self-consistency by letting LLM generate multiple
responses, instead of sampling multiple times, and then choosing
the most common answer. Does this make sense?
How about free-from answers?
Universal Self-Consistency (USC)
Ask LLMs to self-select the most consistent answer
Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou.
Universal Self-Consistency for Large Language Model Generation. arXiv:2311.17311 [cs.CL], 2023.
[Question] Where do people drink less coffee than they do in Mexico?
Response 1: ... Some examples include Japan, China and the United Kingdom.
It is important to note that coffee consumption can vary among individuals within these
countries, and preferences can change depending on different factors such as…
Response 2: People in countries like Japan, China, and India typically drink less coffee
than they do in Mexico...
Response 3: There are several countries where people generally drink less coffee
compared to Mexico. Some of these countries include:
1. Japan:...
2. China...
3. Saudi Arabia...
4. India...
...
The most consistent response: 2
Limitations
LLMs Can Be Easily Distracted
by Irreverent Context
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large
Language Models Can Be Easily Distracted by Irrelevant Context. ICML 2023.
Humans may be easily distracted by irrelevant context
Psychology studies show that irrelevant information may significantly decrease
some children and even adults problem-solving accuracy
… inattentive children’s difficulties in problem solving are partially
due to an inability to inhibit irrelevant information …
Marzocchi, G.M., Lucangeli, D., De Meo, T., Fini, F. and Cornoldi, C., 2002. The disturbing
effect of irrelevant information on arithmetic problem solving in inattentive children.
Developmental neuropsychology, 21(1), pp.73-92.
Does this observation
hold for LLMs?
Lucy has $65 in the bank. She made a $15 deposit and then followed by a $4
withdrawal. Maria's monthly rent is $10. What is Lucy’s bank balance?
Lucy's bank balance is $65 + $15 - $4 - $10 = $66. The answer is $66.
Adding irrelevant contexts to GSM8K leads to 20+ points
performance drop
LLM
Lucy has $65 in the bank. She made a $15 deposit and then followed by a $4 withdrawal.
Maria's monthly rent is $10. What is Lucy’s bank balance? Ignore irrelevant context.
Lucy has $65 in the bank. After making a $15 deposit and then a $4 withdrawal, her bank
balance is $76. Maria's monthly rent is not relevant to this question, so it can be ignored. The
answer is $76.
Prompting LLMs to Ignore irrelevant context can take
some performance back
LLM
We can simply add irreverent sentences
The sky is blue. The sky is blue. The sky is blue. The sky is blue.
Lucy has $65 in the bank. The sky is blue. The grass is green. The
grass is green. She made a $15 deposit and then followed by a
$4 withdrawal. The sky is blue. The sky is blue. The sky is blue. The
sky is blue. The grass is green. The grass is green. The sky is blue.
The grass is green. What is Lucy’s bank balance?
LLMs Cannot Self-Correct
Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou. Large
Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024.
Christina is planning a birthday party and needs .75 gift bags per invited guest,
because 1/4 of attendees don't show up. She invited 16 friends. Gift bags are $2
each. How much will she spend?
... 0.75 * 12 = 9 gift bags … Christina will spend a total of 2 * 9 = $8 ... (wrong)
Review your previous answer and find problems with your answer.
Upon reviewing my previous answer, I recognize a mistake. I incorrectly stated that
Christina needs 0.75 gift bags per invited guest. The given information states that
1/4 of the attendees won't show up…
Based on the problems you found, improve your answer.
... 16 * (3/4) = 12 gift bags … Christina will spend 12 * $2 = $24 … (Correct)
Terry eats 2 yogurts a day. They are currently on sale at 4 yogurts for $5.00. How
much does he spend on yogurt over 30 days?
... So, Terry spends $75 on yogurt over 30 days. (Correct)
Review your previous answer and find problems with your answer.
Upon reviewing my previous answer, I realized that I made a mistake in calculating
Terry's spending on yogurt over 30 days. I incorrectly stated that Terry spends
$2.50 per day for 30 days…
Based on the problems you found, improve your answer.
... the final answer is Terry spends $37.5 on yogurt over 30 days. (wrong)
While allowing LLMs to review
their generated responses can
help correct inaccurate answers,
it may also risk changing correct
answers into incorrect ones
Self-correcting results in worse results
Reported improvements need oracle answers
Oracle: Let LLMs self correct only when the answer is wrong
Multi-LLM debate? Worse than self-consistency!
Oracle feedback needed for LLM
to self-correct
Lesson learned
Self-debug naturally leverage unit tests as oracle
Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou. Teaching Large Language Models to Self-Debug.
arXiv:2302.00093 [cs.CL], 2023. ICLR 2024.
Premise Order Matters in LLM
Reasoning
Xinyun Chen, Ryan A Chi, Xuezhi Wang, Denny Zhou. Premise Order Matters in
Reasoning with Large Language Models. ICML 2024.
[Original GSM8K Problem] Thomas withdraws $1000 in 20 dollar bills
from the bank account. He loses 10 bills while getting home. After that, he
uses half of the remaining bills to pay for a bill. Thomas then triples his
money. He then converts all his bills to 5 dollar bills. How many 5 dollar
bills does he have?
[Reordered GSM8K Problem] Thomas withdraws $1000 in 20 dollar bills
from the bank account. After getting home, he uses half of the remaining
bills to pay for a bill. Thomas then triples his money. He then converts all
his bills to 5 dollar bills. He loses 10 bills while getting home. How many 5
dollar bills does he have?
About 10 points drop on solving rates across all frontier LLMs
1. Thomas withdraws $1000 in $20
bills. $1000 / $20 = 50 bills
2. He loses 10 of those bills. 50 bills
- 10 bills = 40 bills
3. He uses half of the remaining
bills to pay for a bill. Half of 40 bills
is 20 bills. 40 bills - 20 bills = 20
bills remaining
4. Thomas then triples his money.
He has 20 bills of $20, which is 20 *
$20 = $400. Tripling $400 gives us
$400 * 3 =$1200.
5. He converts all his money to $5
bills. $1200 / $5 = 240 bills
1. Thomas withdraws $1000 in $20
bills from the bank. $1000 / $20 =
50 bills
2. He uses half of the remaining
bills to pay for a bill. Half of 50 bills
is 25 bills. So, he uses 25 bills and
has 25 bills left.
3. Thomas then triples his money.
Since he had 25 bills of $20 each,
he had 25 * $20 = $500 before
tripling his money. Tripling $500
gives him $500 * 3 = $1500.
4. He then converts all his bills to
$5 bills. $1500 / $5 = 300 bills
5. He lost 10 bills while getting
home. 300 bills - 10 bills = 290 bills
Response for original Response for reordered
Rules: If duve, then dax. If kere and riff, then tess. If feg and tess, then rabe. If wozz, then
riff. If rabe, then thoxe. If ag and riff and kere, then hepe. If thoxe, then vepe. If nife and hume
and ag, then dax. If feg and thoxe, then ire. If rebe and ene and heff, then kone. If ire and
dax, then wope. If tess and vepe, then nup. If rabe and vide and nife, then ag. If nup and ag
and vide, then duve. If zeck and hepe and dax, then riff. If nup, then hepe. If sene, then
hume. If hepe and tess and ag, then kere. If fum, then vide. If kere and wope, then fum. If jag
and kone, then thoxe. If fum, then wozz.
Facts: Alice is cote. Alice is kone. Alice is duve. Alice is ag. Alice is jag. Alice is tess. Alice is
riff. Alice is feg. Alice is vide.
Query: Is Alice wozz? In each step, indicate whether you are using a fact, or a rule.
Logical inference task: The rules are ordered according to their use in the
inference process, though not all rules are necessary for the query
Rules: If nup, then hepe. If kere and riff, then tess. If feg and tess, then rabe. If wozz, then
riff. If tess and vepe, then nup.If ag and riff and kere, then hepe. If feg and thoxe, then ire. If
nife and hume and ag, then dax. If ire and dax, then wope. If rebe and ene and heff, then
kone. If hepe and tess and ag, then kere. If rabe, then thoxe. If rabe and vide and nife, then
ag. If fum, then wozz. If zeck and hepe and dax, then riff. If kere and wope, then fum. If sene,
then hume. If thoxe, then vepe. If fum, then vide. If duve, then dax. If jag and kone, then
thoxe. If nup and ag and vide, then duve.
Facts: Alice is cote. Alice is kone. Alice is duve. Alice is ag. Alice is jag. Alice is tess. Alice is
riff. Alice is feg. Alice is vide.
Query: Is Alice wozz? In each step, indicate whether you are using a fact, or a rule.
Logical inference task: The rules relevant to the query are randomly ordered,
30+ points performance drop across all frontier LLMs
Summary
● Generating intermediate steps improves LLM performance
○ Training / finetuning / prompting with intermediate steps
○ Zero-shot, analogical reasoning, special decoding
● Self-consistency greatly improves step-by-step reasoning
● Limitation: irrelevant context, self-correction, premise order
What are next?
“If I were given one hour to save the planet, I would spend 59
minutes defining the problem and one minute resolving it.”
— Albert Einstein
1. Define a right problem to work on
2. Solve it from the first principles
LLM Reasoning - Key Ideas and Limitations
THE END
"The best way to predict the future is to invent it." — Alan Kay

More Related Content

PDF
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
PDF
Exploring Opportunities in the Generative AI Value Chain.pdf
PDF
Cavalry Ventures | Deep Dive: Generative AI
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PDF
Using the power of Generative AI at scale
PPTX
The Future of AI is Generative not Discriminative 5/26/2021
PDF
Large Language Models Bootcamp
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Exploring Opportunities in the Generative AI Value Chain.pdf
Cavalry Ventures | Deep Dive: Generative AI
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Using the power of Generative AI at scale
The Future of AI is Generative not Discriminative 5/26/2021
Large Language Models Bootcamp
Unlocking the Power of Generative AI An Executive's Guide.pdf

What's hot (20)

PDF
Build an LLM-powered application using LangChain.pdf
PDF
Explainable AI (XAI) - A Perspective
PDF
Leveraging Generative AI & Best practices
PPTX
Building, Evaluating, and Optimizing your RAG App for Production
PDF
Automatic machine learning (AutoML) 101
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
PPTX
Future of AI - 2023 07 25.pptx
PDF
Revolutionizing your Business with AI (AUC VLabs).pdf
PDF
generative-ai-fundamentals and Large language models
PDF
Generative AI: Past, Present, and Future – A Practitioner's Perspective
PDF
Let's talk about GPT: A crash course in Generative AI for researchers
PDF
Conversational AI and Chatbot Integrations
PDF
Benchmark comparison of Large Language Models
PPTX
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
PDF
Word Embeddings, why the hype ?
PPTX
Deep Learning Technical Pitch Deck
PDF
The future of AI is hybrid
PDF
Genetic Algorithms
PPTX
Explainable AI in Industry (KDD 2019 Tutorial)
PDF
Machine Learning Explanations: LIME framework
Build an LLM-powered application using LangChain.pdf
Explainable AI (XAI) - A Perspective
Leveraging Generative AI & Best practices
Building, Evaluating, and Optimizing your RAG App for Production
Automatic machine learning (AutoML) 101
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
Future of AI - 2023 07 25.pptx
Revolutionizing your Business with AI (AUC VLabs).pdf
generative-ai-fundamentals and Large language models
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Let's talk about GPT: A crash course in Generative AI for researchers
Conversational AI and Chatbot Integrations
Benchmark comparison of Large Language Models
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
Word Embeddings, why the hype ?
Deep Learning Technical Pitch Deck
The future of AI is hybrid
Genetic Algorithms
Explainable AI in Industry (KDD 2019 Tutorial)
Machine Learning Explanations: LIME framework
Ad

Similar to LLM Reasoning - Key Ideas and Limitations (20)

PDF
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
PPT
Intoduction of Artificial Intelligence
PPT
Artificial Intelligence
PPT
DS Mirrors Artificial Intelligence ppt.ppt
PPT
Artificial intelligence
PDF
Deep Dive Into LLM's Machine Learning notes
PPTX
ตัวจริง
PPTX
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
DOC
Introduction to Artificial Intelligence.doc
PPTX
Reasoning in AI
PDF
Teach the importance of logic (programming)in Computer Science and why it is ...
PPT
CS4700-Logic notes for artificial intelligence and machine learning
PDF
Introduction to LLMs
PDF
Can machine think like human being : A Godelian perspective
PPT
Introduction to AI this is test ppt to download other one because it
PPT
Learning Agents by Prof G. Tecuci
PPT
Learning Agents by Prof G. Tecuci
PDF
Sparks of Artificial General Intelligence.pdf
PDF
1.3.1 Inductive and Deductive Reasoning
PPTX
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Intoduction of Artificial Intelligence
Artificial Intelligence
DS Mirrors Artificial Intelligence ppt.ppt
Artificial intelligence
Deep Dive Into LLM's Machine Learning notes
ตัวจริง
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
Introduction to Artificial Intelligence.doc
Reasoning in AI
Teach the importance of logic (programming)in Computer Science and why it is ...
CS4700-Logic notes for artificial intelligence and machine learning
Introduction to LLMs
Can machine think like human being : A Godelian perspective
Introduction to AI this is test ppt to download other one because it
Learning Agents by Prof G. Tecuci
Learning Agents by Prof G. Tecuci
Sparks of Artificial General Intelligence.pdf
1.3.1 Inductive and Deductive Reasoning
Ad

More from VincentLui15 (8)

PDF
Sequoias2025CompensationandEquityReport-SneakPeekpdf.pdf
PPT
Key Findings of China Solar Energy Market Fact Book
PDF
Toward unified framework and symbolic decision making - Berkeley LLM AI Agent...
PDF
Agents for Enterprise Workflows - Berkeley LLM AI Agents MOOC
PDF
Agents for SW development - Berkeley LLM AI Agents MOOC
PDF
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
PDF
Multimodal Knowledge Assistance - Berkeley LLM AI Agents MOOC
PDF
Brief History and Overview of LLM Agents
Sequoias2025CompensationandEquityReport-SneakPeekpdf.pdf
Key Findings of China Solar Energy Market Fact Book
Toward unified framework and symbolic decision making - Berkeley LLM AI Agent...
Agents for Enterprise Workflows - Berkeley LLM AI Agents MOOC
Agents for SW development - Berkeley LLM AI Agents MOOC
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
Multimodal Knowledge Assistance - Berkeley LLM AI Agents MOOC
Brief History and Overview of LLM Agents

Recently uploaded (20)

PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
August Patch Tuesday
PPTX
observCloud-Native Containerability and monitoring.pptx
PPT
Geologic Time for studying geology for geologist
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Architecture types and enterprise applications.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
DOCX
search engine optimization ppt fir known well about this
PDF
Five Habits of High-Impact Board Members
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
The various Industrial Revolutions .pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
August Patch Tuesday
observCloud-Native Containerability and monitoring.pptx
Geologic Time for studying geology for geologist
Web Crawler for Trend Tracking Gen Z Insights.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
WOOl fibre morphology and structure.pdf for textiles
Architecture types and enterprise applications.pdf
O2C Customer Invoices to Receipt V15A.pptx
A review of recent deep learning applications in wood surface defect identifi...
search engine optimization ppt fir known well about this
Five Habits of High-Impact Board Members
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hindi spoken digit analysis for native and non-native speakers
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
DP Operators-handbook-extract for the Mautical Institute
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
The various Industrial Revolutions .pptx

LLM Reasoning - Key Ideas and Limitations

  • 1. LLM Reasoning: Key Ideas and Limitations Denny Zhou September, 2024 CS294/194-196: LLM Agents @ UC Berkeley Google DeepMind
  • 2. What do you expect for AI? Solve the hardest math problems that humans cannot solve? Discover new scientific theory? Solve AGI? …
  • 3. My little expectation for AI AI should be able to learn from just a few examples, like what humans usually do
  • 4. Does ML meet this expectation? Semi-supervised learning Bayesian nonparametric Kernel machines Sparsity Low rank Active learning …
  • 5. What is missing in ML? Reasoning Humans can learn from just a few examples because humans can reason
  • 6. Let’s start from a toy problem “Make things as simple as possible but no simpler” — Albert Einstein
  • 7. Last Letter Concatenation Rule: Take the last letter of each word, and then concatenate them Input Output “Elon Musk” “nk” “Bill Gates” “ls” “Barack Obama” ?
  • 8. Solve it by ML? Tons of labeled data needed! Encoder Decoder “Bill Gates” “ls” Would you still refer to ML as AI when it requires vast amounts of labeled data to learn such a “simple” task?
  • 9. Let’s see how this problem can be solved by using large language models (LLMs)!
  • 10. LLM “AI is the” “future” What are Large Language Models (LLMs)? LLM is a “transformer” model trained to predict the next word Eg “AI is the future” Trained with many sentences, e.g. all texts from the Internet
  • 11. You can think of training LLMs as training parrots to mimic human languages
  • 12. Q: “Elon Musk” A: “nk” Q: “Bill Gates” A: “ls” Q: “Barack Obama” A: LLM Input Few-shot prompting for last-letter-concatenation “ck” Brown, T.B., 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  • 13. Let’s add “reasoning process” before “answer” Q: “Elon Musk” A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k" leads to "nk". so the output is "nk". Q: “Bill Gates” A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads to "ls". so the output is "ls". Q: “Barack Obama" A: reasoning process
  • 14. Let’s add “reasoning process” before “answer” Q: “Elon Musk” A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k" leads to "nk". so the output is "nk". Q: “Bill Gates” A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads to "ls". so the output is "ls". Q: “Barack Obama" A: the last letter of "Barack" is "k". the last letter of "Obama" is "a". Concatenating "k", "a" leads to "ka". so the output is "ka". reasoning process
  • 15. One demonstration is enough, like humans Q: “Elon Musk” A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k" leads to "nk". so the output is "nk". Q: “Barack Obama" A: the last letter of "Barack" is "k". the last letter of "Obama" is "a". Concatenating "k", "a" leads to "ka". so the output is "ka". 100% accuracy with only one demonstration example
  • 16. Key Idea: Derive the Final Answer through Intermediate Steps Ling et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. ACL 2017 Ling et al 2017 in DeepMind pioneered using natural language rationale to solve math problems by “... derive the final answer through a series of small steps”. Trained a sequence-to-sequence model from scratch.
  • 17. GSM8K: <Problem, Intermediate Steps, Answer> Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG]. 2021 Following the work by Ling et al 2017, Cobbe et al 2021 in OpenAI built a much larger math word problem dataset (GSM8K) with natural language rationales, and used it to finetune GPT3 Problem: Ali is a dean of a private school where he teaches one class. John is also a dean of a public school. John has two classes in his school. Each class has 1/8 the capacity of Ali’s class which has the capacity of 120 students. What is the combined capacity of both schools? Solution: Ali’s class has a capacity of 120 students. Each of John’s classes has a capacity of 120/8 = 15 students. The total capacity of John’s two classes is 15 students * 2 classes = 30 students. The combined capacity of the two schools is 120 students + 30 students = 150 students. Final answer: 150
  • 18. Show Your Work: Scratchpads for Intermediate Computation with Language Models Nye et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114 [cs.LG], 2021
  • 19. Chain-of-Thought (CoT) Prompting Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022
  • 20. Training with intermediate steps (Ling et al 2017) Finetuning with intermediate steps (Cobbe et al 2021, Nye et al 2021) Prompting with intermediate steps (Nye et al 2021, Wei et al 2022) This is what really matters!
  • 21. Regardless of training, fine-tuning, or prompting, when provided with examples that include intermediate steps, LLMs will generate responses that also include intermediate steps
  • 22. In addition to intermediate steps, is it helpful to introduce reasoning strategies in demonstration examples?
  • 23. Least-to-Most Prompting Enable easy-to-hard generalization by decomposition Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, Ed Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ICLR 2023.
  • 25. Elsa has 3 apples. Anna has 2 more apples than Elsa. How many apples do they have together? Let's break down this problem: 1. How many apples does Anna have? 2. How many apples do Elsa and Anna have together? 1. Anna has 2 more apples than Elsa. So Anna has 2 + 3 = 5 apples. 2. Anna has 5 apples. Elsa and Anna have 3 + 5 = 8 apples together. The answer is 8.
  • 26. SCAN (Compositional Generalization) Using just 0.1% demonstration examples achieves perfect generalization
  • 27. CFQ (Compositional Generalization): Text-to-Code Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, Denny Zhou. Compositional Semantic Parsing with Large Language Models. ICLR 2023. Using just 1% data!
  • 28. Why intermediate steps are helpful? “There is nothing more practical than a good theory.” — Kurt Lewin
  • 29. Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems. ICLR 2024. ● Transformer generating intermediate steps can solve any inherently serial problem as long as its depth exceeds a constant threshold ● Transformer generating direct answers either requires a huge depth to solve or cannot solve at all
  • 30. Tons of practical implications of this theory Generating more intermediate steps Combining LLM with external tools / search Diagnosing LLM failures and limitations Even path to AGI …
  • 31. How to trigger step by step reasoning without using demonstration examples?
  • 32. Let’s think step by step Kojima, T., Gu, S.S., Reid, M., Matsuo, Y. and Iwasawa, Y. Large language models are zero-shot reasoners. NeurIPS 2022.
  • 33. Hmm, zero-shot is cool but usually significantly worse than few-shot …
  • 34. LLMs as Analogical Reasoners Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, Denny Zhou. Large Language Models as Analogical Reasoners. ICLR 2024.
  • 36. “A mathematician is a person who can find analogies between theorems; a better mathematician is one who can see analogies between proofs and the best mathematician can notice analogies between theories. One can imagine that the ultimate mathematician is one who can see analogies between analogies.” — Stefan Banach
  • 37. What is the area of the square with the four vertices at (-2, 2), (2, -2), (-2, -6), and (-6, -2)? Recall a related problem, and then solve this one. The related problem is finding the distance between two points on a coordinate plane. To find the distance between two points, we use the distance formula: d = √ ((x2-x1)^2 + (y2-y1)^2). In this problem, we have four points: (-2, 2), (2, -2), (-2, -6), and (-6, -2). Let's choose (-2, 2) and (2, -2) to find the distance d between them: d = √((2-(-2))^2 + (-2-2)^2) = √(4^2 + (-4)^2) = √(16 + 16) = √32. We know that the distance between two opposite sides of a square is equal. So, the length of each side of the square is √32. To find the area of the square, we square the length of one side: Area = (√32)^2 = 32. LLM
  • 38. Analogical reasoner vs zero-shot / few-shot CoT
  • 39. Analogical reasoner vs zero-shot / few-shot CoT
  • 40. Analogical reasoner vs zero-shot / few-shot CoT
  • 41. Adaptively generate relevant examples and knowledge, rather than using a fix set of examples
  • 42. Is it possible to trigger step by step reasoning even without using any prompt like “let’s think step by step”?
  • 43. Chain-of-Thought Reasoning without Prompting Xuezhi Wang and Denny Zhou. Chain-of-Thought Reasoning Without Prompting. arXiv preprint arXiv:2402.10200 (2024).
  • 44. Decoding at step 0 Full generation after continuing greedy decoding top-1: 5 5 apples. top-2: I I have 3 apples, my dad has 2 more apples than me, so he has 5 apples. 3+5=8. top-3: We We have 8 apples in total. top-4: You You have 3 apples, your dad has 2 more apples than you, so he has 5 apples. 3+5=8. top-5: The The answer is 5. Chain-of-Thought Decoding I have 3 apples, my dad has 2 more apples than me, how many apples do we have in total?
  • 45. 1 Nicolas Cage was born in an odd year. 2 Even. 3 Odd. 4 1964, an even year. 5 He was born in an even year. 6 Cage was born in 1964, an even year. Chain-of-Thought Decoding Was Nicolas Cage born in an even or odd year?
  • 46. 1 Nicolas Cage was born in an odd year. 0.117 2 Even. 0.207 3 Odd. 0.198 4 1964, an even year. 0.949 5 He was born in an even year. 0.000 6 Cage was born in 1964, an even year. 0.978 Chain-of-Thought Decoding Was Nicolas Cage born in an even or odd year?
  • 47. Key observations: 1. Pre-trained LLMs have had responses with step-by-step reasoning among the generations started with the top-k tokens 2. Higher confidence in decoding the final answer when a step-by-step reasoning path is present
  • 48. Greedy Decoding vs Chain-of-Thought Decoding
  • 50. Any concern on generating intermediate steps instead of direct answers? Always keep in mind that LLMs are probabilistic models of generating next tokens. They are not humans.
  • 51. What LLM does in decoding: What we want: Not align!
  • 52. One-step further How to compute the sum then? Sampling!
  • 53. Self-Consistency Greatly improves step-by-step reasoning Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
  • 54. [Question] Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day? Response 1: She has 16 - 3 - 4 = 9 eggs left. So she makes $2 * 9 = $18 per day. Sampled responses: Response 2: This means she she sells the remainder for $2 * (16 - 4 - 3) = $26 per day. Response 3: She eats 3 for breakfast, so she has 16 - 3 = 13 left. Then she bakes muffins, so she has 13 - 4 = 9 eggs left. So she has 9 eggs * $2 = $18. Most frequent answer is: 18 (Not most frequent reasoning path!)
  • 55. Crushed GSM8K SOTA with only 8 examples
  • 56. More consistent, more likely to be correct
  • 57. [Q1] When the LLM outputs a direct answer without intermediate steps, will you still sample several times, and then choose the most common answer? [Q2] Change self-consistency by letting LLM generate multiple responses, instead of sampling multiple times, and then choosing the most common answer. Does this make sense?
  • 58. How about free-from answers? Universal Self-Consistency (USC) Ask LLMs to self-select the most consistent answer Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou. Universal Self-Consistency for Large Language Model Generation. arXiv:2311.17311 [cs.CL], 2023.
  • 59. [Question] Where do people drink less coffee than they do in Mexico? Response 1: ... Some examples include Japan, China and the United Kingdom. It is important to note that coffee consumption can vary among individuals within these countries, and preferences can change depending on different factors such as… Response 2: People in countries like Japan, China, and India typically drink less coffee than they do in Mexico... Response 3: There are several countries where people generally drink less coffee compared to Mexico. Some of these countries include: 1. Japan:... 2. China... 3. Saudi Arabia... 4. India... ... The most consistent response: 2
  • 61. LLMs Can Be Easily Distracted by Irreverent Context Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large Language Models Can Be Easily Distracted by Irrelevant Context. ICML 2023.
  • 62. Humans may be easily distracted by irrelevant context Psychology studies show that irrelevant information may significantly decrease some children and even adults problem-solving accuracy … inattentive children’s difficulties in problem solving are partially due to an inability to inhibit irrelevant information … Marzocchi, G.M., Lucangeli, D., De Meo, T., Fini, F. and Cornoldi, C., 2002. The disturbing effect of irrelevant information on arithmetic problem solving in inattentive children. Developmental neuropsychology, 21(1), pp.73-92. Does this observation hold for LLMs?
  • 63. Lucy has $65 in the bank. She made a $15 deposit and then followed by a $4 withdrawal. Maria's monthly rent is $10. What is Lucy’s bank balance? Lucy's bank balance is $65 + $15 - $4 - $10 = $66. The answer is $66. Adding irrelevant contexts to GSM8K leads to 20+ points performance drop LLM
  • 64. Lucy has $65 in the bank. She made a $15 deposit and then followed by a $4 withdrawal. Maria's monthly rent is $10. What is Lucy’s bank balance? Ignore irrelevant context. Lucy has $65 in the bank. After making a $15 deposit and then a $4 withdrawal, her bank balance is $76. Maria's monthly rent is not relevant to this question, so it can be ignored. The answer is $76. Prompting LLMs to Ignore irrelevant context can take some performance back LLM
  • 65. We can simply add irreverent sentences The sky is blue. The sky is blue. The sky is blue. The sky is blue. Lucy has $65 in the bank. The sky is blue. The grass is green. The grass is green. She made a $15 deposit and then followed by a $4 withdrawal. The sky is blue. The sky is blue. The sky is blue. The sky is blue. The grass is green. The grass is green. The sky is blue. The grass is green. What is Lucy’s bank balance?
  • 66. LLMs Cannot Self-Correct Reasoning Yet Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou. Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024.
  • 67. Christina is planning a birthday party and needs .75 gift bags per invited guest, because 1/4 of attendees don't show up. She invited 16 friends. Gift bags are $2 each. How much will she spend? ... 0.75 * 12 = 9 gift bags … Christina will spend a total of 2 * 9 = $8 ... (wrong) Review your previous answer and find problems with your answer. Upon reviewing my previous answer, I recognize a mistake. I incorrectly stated that Christina needs 0.75 gift bags per invited guest. The given information states that 1/4 of the attendees won't show up… Based on the problems you found, improve your answer. ... 16 * (3/4) = 12 gift bags … Christina will spend 12 * $2 = $24 … (Correct)
  • 68. Terry eats 2 yogurts a day. They are currently on sale at 4 yogurts for $5.00. How much does he spend on yogurt over 30 days? ... So, Terry spends $75 on yogurt over 30 days. (Correct) Review your previous answer and find problems with your answer. Upon reviewing my previous answer, I realized that I made a mistake in calculating Terry's spending on yogurt over 30 days. I incorrectly stated that Terry spends $2.50 per day for 30 days… Based on the problems you found, improve your answer. ... the final answer is Terry spends $37.5 on yogurt over 30 days. (wrong)
  • 69. While allowing LLMs to review their generated responses can help correct inaccurate answers, it may also risk changing correct answers into incorrect ones
  • 70. Self-correcting results in worse results
  • 71. Reported improvements need oracle answers Oracle: Let LLMs self correct only when the answer is wrong
  • 72. Multi-LLM debate? Worse than self-consistency!
  • 73. Oracle feedback needed for LLM to self-correct Lesson learned
  • 74. Self-debug naturally leverage unit tests as oracle Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou. Teaching Large Language Models to Self-Debug. arXiv:2302.00093 [cs.CL], 2023. ICLR 2024.
  • 75. Premise Order Matters in LLM Reasoning Xinyun Chen, Ryan A Chi, Xuezhi Wang, Denny Zhou. Premise Order Matters in Reasoning with Large Language Models. ICML 2024.
  • 76. [Original GSM8K Problem] Thomas withdraws $1000 in 20 dollar bills from the bank account. He loses 10 bills while getting home. After that, he uses half of the remaining bills to pay for a bill. Thomas then triples his money. He then converts all his bills to 5 dollar bills. How many 5 dollar bills does he have? [Reordered GSM8K Problem] Thomas withdraws $1000 in 20 dollar bills from the bank account. After getting home, he uses half of the remaining bills to pay for a bill. Thomas then triples his money. He then converts all his bills to 5 dollar bills. He loses 10 bills while getting home. How many 5 dollar bills does he have? About 10 points drop on solving rates across all frontier LLMs
  • 77. 1. Thomas withdraws $1000 in $20 bills. $1000 / $20 = 50 bills 2. He loses 10 of those bills. 50 bills - 10 bills = 40 bills 3. He uses half of the remaining bills to pay for a bill. Half of 40 bills is 20 bills. 40 bills - 20 bills = 20 bills remaining 4. Thomas then triples his money. He has 20 bills of $20, which is 20 * $20 = $400. Tripling $400 gives us $400 * 3 =$1200. 5. He converts all his money to $5 bills. $1200 / $5 = 240 bills 1. Thomas withdraws $1000 in $20 bills from the bank. $1000 / $20 = 50 bills 2. He uses half of the remaining bills to pay for a bill. Half of 50 bills is 25 bills. So, he uses 25 bills and has 25 bills left. 3. Thomas then triples his money. Since he had 25 bills of $20 each, he had 25 * $20 = $500 before tripling his money. Tripling $500 gives him $500 * 3 = $1500. 4. He then converts all his bills to $5 bills. $1500 / $5 = 300 bills 5. He lost 10 bills while getting home. 300 bills - 10 bills = 290 bills Response for original Response for reordered
  • 78. Rules: If duve, then dax. If kere and riff, then tess. If feg and tess, then rabe. If wozz, then riff. If rabe, then thoxe. If ag and riff and kere, then hepe. If thoxe, then vepe. If nife and hume and ag, then dax. If feg and thoxe, then ire. If rebe and ene and heff, then kone. If ire and dax, then wope. If tess and vepe, then nup. If rabe and vide and nife, then ag. If nup and ag and vide, then duve. If zeck and hepe and dax, then riff. If nup, then hepe. If sene, then hume. If hepe and tess and ag, then kere. If fum, then vide. If kere and wope, then fum. If jag and kone, then thoxe. If fum, then wozz. Facts: Alice is cote. Alice is kone. Alice is duve. Alice is ag. Alice is jag. Alice is tess. Alice is riff. Alice is feg. Alice is vide. Query: Is Alice wozz? In each step, indicate whether you are using a fact, or a rule. Logical inference task: The rules are ordered according to their use in the inference process, though not all rules are necessary for the query
  • 79. Rules: If nup, then hepe. If kere and riff, then tess. If feg and tess, then rabe. If wozz, then riff. If tess and vepe, then nup.If ag and riff and kere, then hepe. If feg and thoxe, then ire. If nife and hume and ag, then dax. If ire and dax, then wope. If rebe and ene and heff, then kone. If hepe and tess and ag, then kere. If rabe, then thoxe. If rabe and vide and nife, then ag. If fum, then wozz. If zeck and hepe and dax, then riff. If kere and wope, then fum. If sene, then hume. If thoxe, then vepe. If fum, then vide. If duve, then dax. If jag and kone, then thoxe. If nup and ag and vide, then duve. Facts: Alice is cote. Alice is kone. Alice is duve. Alice is ag. Alice is jag. Alice is tess. Alice is riff. Alice is feg. Alice is vide. Query: Is Alice wozz? In each step, indicate whether you are using a fact, or a rule. Logical inference task: The rules relevant to the query are randomly ordered, 30+ points performance drop across all frontier LLMs
  • 80. Summary ● Generating intermediate steps improves LLM performance ○ Training / finetuning / prompting with intermediate steps ○ Zero-shot, analogical reasoning, special decoding ● Self-consistency greatly improves step-by-step reasoning ● Limitation: irrelevant context, self-correction, premise order
  • 81. What are next? “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.” — Albert Einstein
  • 82. 1. Define a right problem to work on 2. Solve it from the first principles
  • 84. THE END "The best way to predict the future is to invent it." — Alan Kay