SlideShare a Scribd company logo
Standing on the shoulders of giants:
Learn from LL(1) to PEG parser the hard way
Kir Chou
1
2
https://www.youtube.com/watch?v=DZTLgVBxET4
About me
Presented at PyCon TW/JP since 2017
https://note35.github.io/about/
https://github.com/note35/Parser-Learning
3
Agenda
● Motivation
● What is parser in CPython?
● Parser 101 - CFG
● Parser 101 - Traditional parser (LL(1) / LR(0))
● Parser 102 - PEG and PEG parser
● Parser 102 - Packrat parser
● CPython’s PEG parser
● Take away
4
Motivation
5
Motivation
What’s New In Python 3.9?
PEP 617, CPython now uses a new parser based on PEG;
“IIRC, I took a Compiler class in school…”
6
Motivation (Cont.)
School taught us the brief concept of the Compiler’s frontend and backend.
School’s parser assignment used Bison + YACC.
And...
7
My motivation = Talk objectives
What is PEG parser?
Why did python use LL(1) parser before?
Why did Guido choose PEG parser?
What other parsers do we have?
What’s the difference between those parsers?
How to implement those parsers?
8
What is parser in CPython?
CPython DevGuide - Design of CPython’s Compiler
9
Compilation
Steps
10
Source Code
Tokens
Abstract Syntax Tree
(AST)
Bytecode
Result
Lexer
Parser
Compiler
VM
Import
11
https://docs.python.org/3/library/tokenize.html#examples
Lexer
12
https://docs.python.org/3/library/ast.html
Parser
13
https://docs.python.org/3/library/dis.html#dis.disassemble
Compiler
= print(2*3+4)
14
Source Code
Tokens
Abstract Syntax Tree
(AST)
Bytecode
Result
Lexer
Parser
Compiler
VM
Import
Talk’s focus!
Parser 101 - CFG
Uncode - GATE Computer Science - Compiler Design Lecture
15
Grammar
Context Free Grammar (CFG)
16
Interpretation of this Grammar
“Both B and a can be derived from A”
Derivation
*some paper write <-
Non-terminal
AND
*support ambigious syntax
A -> B | a
Terminal
rule
What is “Context Free”?
Left-hand side in all the rules only contains 1 non-terminal.
Valid CFG Example:
Invalid CFG Example:
17
S -> aSb
xSy -> axSyb
Semantic Analysis: Parse Tree
Concret Syntax Tree (CST)
An ordered, rooted tree that represents
the syntactic structure of a string
according to some context-free
grammar.
Abstract Syntax Tree (AST)
A tree representation of the abstract
syntactic structure of source code
written in a programming language.
18
CFG Simplification
1. Ambiguous -> Unambiguous
2. Nondeterministic -> Deterministic
3. Left recursion -> No left recursion
19
Ambiguious Definition
A grammar contains rules that can generate more than one tree.
20
E -> E + E | E * E | Num
N N
N
E E
E
+
E
*
E
N
E
E
N N
E E
E
*
+
Ambiguious -> Unambiguous
21
N
E
E
N
N
T F
T
*
+
E -> E + T | T
T -> T * F | F
F -> Num
E -> E + E | E * E | Num
Step1
Rewrite Grammar
Step2
Make sure the
grammar only
generate one tree
T
F F
Non-deterministic -> Deterministic
A grammar contains rules that have common prefix.
22
A -> ab | ac
A -> aA’
A’ -> b | c
Rewrite Grammar
*A non-deterministic grammar can be rewritten into more than one
deterministic grammar.
Left recursion -> No left recursion
A grammar contains direct or indirect left recursion.
23
E -> E + T | T
T -> T * F | F
F -> Num
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
Rewrite Grammar
E in first E + T will recursively derives to second E + T,
E in second E + T will repeat it to third E + T,
and so on recursively.
Recap: CFG Simplification
24
Before After
Ambiguous
Non-deterministic
Left Recursion
Parser 101 - Traditional parser
Uncode - GATE Computer Science - Compiler Design Lecture
25
Parser classification
26
N
E
E
N N
E E
E
*
+
Top-down
Type
Bottom-up
Type N
E
N
E
N
E
+
N
E
E
N
E
E
N
E
E
+
N
E
E
N N
E E
E
*
+
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
27
*Both LL/LR parser scan
input string from left to right
Input String: 2 + 3 * 4
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
28
*The derivation time of
LL/LR parser is different.
N
E
E
N N
E E
E
*
+
N
E
E
N N
E E
E
*
+
+ → * * → +
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
29
Input String: 2 + 3 * 4
I am "a token of number".
If I perform 1-token lookahead and
meet "a token of +",
what to do next?
Top Down - Recursive descent parser
30
LL(k) - Implementation
31
2 + 3 * 4
parse_E()
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
parse_Tp(parse_F())
parse_Ep( )
Step3
*recursively parse the input string
started from first rule parse_E()
Step2
*parse from left to right
*perform k-lookahead
parse_T()
Step1
write function for each non-terminal
32
Grammar
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
*perform 1-lookahead
LL(1) - Example code
Derivation
x
x
Top Down - Non recursive descent parser
33
LL(1) - Parsing table
34
Step1
Build first/follow table for each non-terminal
Note: $ means endmark
Step2
Build parsing table based on first/follow table
LL(1) - Implementation
35
Step3
Implement with stack
(take shift/reduce action
based on parsing table)
N
E
E
N N
E E
E
*
+
LL(1) - Example code
36
Grammar
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
Non-terminal stack
Reduce (Derivation)
Shift
Reduce (Derivation)
Bottom Up - LR(0) parser
37
LR(0) - Deterministic finite automaton
38
E’ -> .E --- (1)
E -> .E + T --- (2)
E -> .T --- (3)
T -> .T * Num --- (4)
T -> .Num --- (5)
Step1
Build Deterministic Finite Automaton(DFA)
E’ -> E.
E -> E. + T
E -> T.
T -> T. * Num
T -> Num.
E -> E + .T
T -> .T * Num
T -> .Num
T -> T * .Num
E -> E + T.
T -> T. * Num
T -> T * Num.
E
T
Num
*
+ T
Num
*
Num
S1 S2
S3
S4 S5
S6 S7
S8
Left recursion support
LR(0) - Parsing table
39
Step2
Build parsing table
(For parser like SLR(1), it
requires first/follow table)
Shift
acc
Reduce (Derivation)
acc
LR(0) - Implementation
40
Step3
Implement with stack
(take shift/reduce action based on parsing table)
N
E
E
N N
E E
E
*
+
LR(0) - Example code
41
Grammar
E -> E + T | T
T -> T * F | F
F -> Num Shift
Reduce (Derivation)
Parser 102 - PEG and PEG parser
42
Grammar
Parsing Expression Grammar (PEG)
43
*Difference from traditional CFG
A will try A -> B first.
Only after it fails at A -> B, A will only try A -> a.
Derivation
*some paper write <-
Non-Terminal
OR (if / elif / ...)
*disallow ambigious syntax
A -> B | a
Terminal
*Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time)
rule
*support Regular Expression
(EBNF grammar) in another
paper
Example of difference
44
Grammar1: A -> a b | a
Grammar2: A -> a | a b
● LL/LR parser will fail to complete when the input grammar is ambiguous.
● PEG parser only tries the first PEG rule. The latter rule will never succeed.
“A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may
be arbitrary and lead to surprising parses.” (source)
PEG Parser
PEG parser means “parser generated based on PEG”.
PEG parser can be a Packrat parser, or other traditional parser with k-lookahead
limitation. Mostly, PEG parser means Packrat parser.
45
CFG
EBNF
grammar
PEG
Packrat
parser
Traditional
parser
PEG Parser
Parser 102 - Packrat parser
46
Type of Packrat parser
47
Top-down
Type
N
E
E
N
E
E
N
E
E
+
N
E
E
N N
E E
E
*
+
Packrat parser is top-down type.
Packrat Parsing - Implementation
48
2 + 3 * 4
parse_E()
E -> E + T | T
T -> T * F | F
F -> Num
parse_T() and parse_F()
parse_E() and parse_T()
Step2
*parse from left to right
*perform infinite lookahead + memoization
Step1
*write function for each non-terminal
(PEG rule)
*Idea of memoization was Introduced in 1970
Step3
*recursively parse the input string
started from first rule parse_E()
Left recursion support
Packrat Parsing - Example code
49
Grammar
E -> E + T | T
T -> T * F | F
F -> Num
Derivation
Memoization
Packrat - what is memoization?
50
509. Fibonacci Number
4
3
2
2
1
fib(0) = 0
fib(1) = 1
fib(2) = fib(1) + fib(0) = 1
fib(3) = fib(2) + fib(1) = fib(1) + fib(0) + fib(1) = 2
...
1
0
1
0
if n = 4, we calculate
fib(2), fib(0) twice, fib(1) thrice, fib(4), fib(3) once
TIme Complexity: O(2^n)
Packrat - what is memoization? (Cont.)
51
509. Fibonacci Number
if n = 4, we…
calculate fib(4), fib(3), fib(2), fib(1), fib(0) once
Time Complexity: O(2^n) => O(n)
Space Complexity: O(1) => O(n)
Left recursion in Packrat parser
52
Approach 1
if (count of operator) < (count function call):
return False
Approach 2
reverse the call stack (adopted in CPython!)
Source: Guido's Medium (Left-recursive PEG Grammars)
53
Normal Memoization
54
Left-recursion
Memoization
*perform
infinite-lookahead
Traditional parser V.S Packrat parser
55
Traditional parser vs Packrat parser
56
Packrat Traditional
Scan Left-to-right (*Right-to-left memo) Left-to-right
Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar
Ambigious Disallowed (determinism) Allowed
Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree)
Worst Time
Complexity
Super linear time (statelessness)
*Because of feature like typedef in C
Expotenial time
Capability Basically covers all traditional cases
(infinite lookahead)
No left-recursion/ambigious for LL
Has k lookup limitations for both (e.g.
dangling else)
Red text: 3 highlighted characteristics of Packrat parser.
57
Parenthesized context managers
PEP 622/634/635/636 - Structural Pattern Matching
New rule in Python 3.10 based on PEG
CPython’s PEG parser
58
CPython Parser - Before/After
CPython3.8 and before use LL(1) parser written by Guido 30 years ago
The parser requires steps to generate CST and convert CST to AST.
CPython3.9 uses PEG (Packrat) parser (Infinite lookahead)
PEG rule supports left-recursion
No more CST to AST step - source
CPython3.10 drops LL(1) parser support
59
This answers
“Why PEG?”
CPython Parser - Workflow
60
Meta Grammar
Tools/peg_generator/
pegen/metagrammar.gram
Grammar
Grammar/python.gram
Token
Grammar/Tokens
my_parser.py
my_parser.c
pegen
(PEG Parser)
Tools/peg_generator/
*CPython contains a peg parser generator written in python3.8+ (because of warlus operator)
Input: Meta Grammar Example
Syntax Directed Translation (SDT)
61
rule
non-Terminal
return type
PEG rule divider
PEG rule
action
(python code)
Parser header
(python code)
Output: Generated PEG Parser
(Partial code)
62
Recap: Benefit / Performance
Benefit
Grammar is more flexible: from LL(1) to LL(∞) (infinite lookahead)
Hardware supports Packrat’s memory consumption now
Skip intermediate parse tree (CST) construction
Performance
Within 10% of LL(1) parser both in speed and memory consumption (PEP 617)
63
Take away
64
Recap
● Parser 101 (Compiler class in school)
○ CFG
○ Traditional Parser
■ Top-down: LL(1)
■ Bottom-up: LR(0)
● Parser 102
○ PEG
○ Packrat Parser
● CPython
○ Parser in CPython
○ CPython’s PEG parser
65
66
Need Answer? note35/Parser-Learning
You can implement traditional parser like LL(1) and LR(0)
parser, and Packrat parser from scratch!
Leetcode: 227. Basic Calculator II
Q. How to verify my understanding?
A. Get your hands dirty!
Q & A
67
Appendix
68
Related Articles
Guido van Rossum
PEG Parsing Series Overview
Bryan Ford
Packrat Parsing: Simple, Powerful, Lazy, Linear Time
Parsing Expression Grammars: A Recognition-Based Syntactic Foundation
69
Related Talks
Guido van Rossum @ North Bay Python 2019
Writing a PEG parser for fun and profit
Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__
The Journey To Replace Python's Parser And What It Means For The Future
Emily Morehouse-Valcarcel @ PyCon 2018
The AST and Me
Alex Gaynor @ PyCon 2013
So you want to write an interpreter?
70
Thanks for your listening!
71

More Related Content

PDF
Python Programming by Dr. C. Sreedhar.pdf
PPT
INTRODUCTION TO LISP
PPT
Pidgin
PDF
Syntax analysis
PPTX
Method: Approach, Design, Procedure
PPTX
PPTX
process control blockPcb
Python Programming by Dr. C. Sreedhar.pdf
INTRODUCTION TO LISP
Pidgin
Syntax analysis
Method: Approach, Design, Procedure
process control blockPcb

What's hot (20)

PPTX
Shift reduce parser
PPTX
Lexical analysis - Compiler Design
PPTX
Phases of Compiler
PPTX
Theory of automata and formal language
PDF
COMPILER DESIGN- Syntax Directed Translation
PPTX
What is token c programming
PPTX
Presentation on Logical Operators
PPTX
Métodos e técnicas_de_ensino_de_le (1)
PPT
PDF
Formal Languages and Automata Theory unit 3
PDF
Lecture 01 introduction to compiler
PDF
Django Testing
PPTX
On what criteria can a syllabus be organized
DOC
PPT
Regular expression with DFA
PPTX
Stack operation algorithms with example
PPTX
Control and conditional statements
PPTX
Algorithm Introduction
PPT
Task based language teaching
PPT
358 33 powerpoint-slides_9-stacks-queues_chapter-9
Shift reduce parser
Lexical analysis - Compiler Design
Phases of Compiler
Theory of automata and formal language
COMPILER DESIGN- Syntax Directed Translation
What is token c programming
Presentation on Logical Operators
Métodos e técnicas_de_ensino_de_le (1)
Formal Languages and Automata Theory unit 3
Lecture 01 introduction to compiler
Django Testing
On what criteria can a syllabus be organized
Regular expression with DFA
Stack operation algorithms with example
Control and conditional statements
Algorithm Introduction
Task based language teaching
358 33 powerpoint-slides_9-stacks-queues_chapter-9
Ad

Similar to Learn from LL(1) to PEG parser the hard way (20)

PPT
PARSING.ppt
PPT
Programming_Language_Syntax.ppt
PPT
Parsing
PPTX
Top Down Parsing, Predictive Parsing
PPT
Cd2 [autosaved]
PDF
Parsing Expression Grammars
PPTX
Parsers -
PDF
Theory of automata and formal language lab manual
PPT
Chapter 3 -Syntax Analyzer.ppt
PPTX
Syntax Analysis.pptx
PPTX
Unitiv 111206005201-phpapp01
PPTX
3. Syntax Analyzer.pptx
PPTX
Compiler Design_Syntax Analyzer_Top Down Parsers.pptx
PPTX
Syntactic specification is concerned with the structure and organization of t...
PDF
Packrat parsing
PPTX
Syntactic Analysis in Compiler Construction
PPTX
compiler design syntax analysis top down parsing
PPT
ch5-bottomupparser_jfdrhgfrfyyssf-gfrrt.PPT
PPT
Lecture 05 syntax analysis 2
PARSING.ppt
Programming_Language_Syntax.ppt
Parsing
Top Down Parsing, Predictive Parsing
Cd2 [autosaved]
Parsing Expression Grammars
Parsers -
Theory of automata and formal language lab manual
Chapter 3 -Syntax Analyzer.ppt
Syntax Analysis.pptx
Unitiv 111206005201-phpapp01
3. Syntax Analyzer.pptx
Compiler Design_Syntax Analyzer_Top Down Parsers.pptx
Syntactic specification is concerned with the structure and organization of t...
Packrat parsing
Syntactic Analysis in Compiler Construction
compiler design syntax analysis top down parsing
ch5-bottomupparser_jfdrhgfrfyyssf-gfrrt.PPT
Lecture 05 syntax analysis 2
Ad

More from Kir Chou (20)

PDF
Time travel: Let’s learn from the history of Python packaging!
PDF
Python パッケージの影響を歴史から理解してみよう!
PDF
The str/bytes nightmare before python2 EOL
PPTX
PyCon TW 2018 - A Python Engineer Under Giant Umbrella (巨大保護傘下的 Python 碼農辛酸史)
PPTX
Introduction of CTF and CGC
PPTX
PyCon TW 2017 - Why do projects fail? Let's talk about the story of Sinon.PY
PPTX
PPT
Spime - personal assistant
PPTX
Ch9 package & port(2013 ncu-nos_nm)
PPTX
Ch8 file system management(2013 ncu-nos_nm)
PPTX
Ch7 user management(2013 ncu-nos_nm)
PPTX
Ch10 firewall(2013 ncu-nos_nm)
PDF
Knowledge Management in Distributed Agile Software Development
PDF
Cms part2
PDF
Cms part1
PDF
Sitcon2014 community by server (kir)
PDF
Webapp(2014 ncucc)
PDF
廢除雙二一議題 保留方論點 (2013ncu全幹會)
PPTX
Ch6 ssh(2013 ncu-nos_nm)
PPTX
Ch5 network basic(2013 ncu-nos_nm)
Time travel: Let’s learn from the history of Python packaging!
Python パッケージの影響を歴史から理解してみよう!
The str/bytes nightmare before python2 EOL
PyCon TW 2018 - A Python Engineer Under Giant Umbrella (巨大保護傘下的 Python 碼農辛酸史)
Introduction of CTF and CGC
PyCon TW 2017 - Why do projects fail? Let's talk about the story of Sinon.PY
Spime - personal assistant
Ch9 package & port(2013 ncu-nos_nm)
Ch8 file system management(2013 ncu-nos_nm)
Ch7 user management(2013 ncu-nos_nm)
Ch10 firewall(2013 ncu-nos_nm)
Knowledge Management in Distributed Agile Software Development
Cms part2
Cms part1
Sitcon2014 community by server (kir)
Webapp(2014 ncucc)
廢除雙二一議題 保留方論點 (2013ncu全幹會)
Ch6 ssh(2013 ncu-nos_nm)
Ch5 network basic(2013 ncu-nos_nm)

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Digital Strategies for Manufacturing Companies
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administration Chapter 2
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
System and Network Administraation Chapter 3
PPTX
history of c programming in notes for students .pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Essential Infomation Tech presentation.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Digital Strategies for Manufacturing Companies
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administration Chapter 2
Design an Analysis of Algorithms I-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
VVF-Customer-Presentation2025-Ver1.9.pptx
How Creative Agencies Leverage Project Management Software.pdf
Softaken Excel to vCard Converter Software.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
System and Network Administraation Chapter 3
history of c programming in notes for students .pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Essential Infomation Tech presentation.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions

Learn from LL(1) to PEG parser the hard way

  • 1. Standing on the shoulders of giants: Learn from LL(1) to PEG parser the hard way Kir Chou 1
  • 3. About me Presented at PyCon TW/JP since 2017 https://note35.github.io/about/ https://github.com/note35/Parser-Learning 3
  • 4. Agenda ● Motivation ● What is parser in CPython? ● Parser 101 - CFG ● Parser 101 - Traditional parser (LL(1) / LR(0)) ● Parser 102 - PEG and PEG parser ● Parser 102 - Packrat parser ● CPython’s PEG parser ● Take away 4
  • 6. Motivation What’s New In Python 3.9? PEP 617, CPython now uses a new parser based on PEG; “IIRC, I took a Compiler class in school…” 6
  • 7. Motivation (Cont.) School taught us the brief concept of the Compiler’s frontend and backend. School’s parser assignment used Bison + YACC. And... 7
  • 8. My motivation = Talk objectives What is PEG parser? Why did python use LL(1) parser before? Why did Guido choose PEG parser? What other parsers do we have? What’s the difference between those parsers? How to implement those parsers? 8
  • 9. What is parser in CPython? CPython DevGuide - Design of CPython’s Compiler 9
  • 10. Compilation Steps 10 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import
  • 14. 14 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import Talk’s focus!
  • 15. Parser 101 - CFG Uncode - GATE Computer Science - Compiler Design Lecture 15
  • 16. Grammar Context Free Grammar (CFG) 16 Interpretation of this Grammar “Both B and a can be derived from A” Derivation *some paper write <- Non-terminal AND *support ambigious syntax A -> B | a Terminal rule
  • 17. What is “Context Free”? Left-hand side in all the rules only contains 1 non-terminal. Valid CFG Example: Invalid CFG Example: 17 S -> aSb xSy -> axSyb
  • 18. Semantic Analysis: Parse Tree Concret Syntax Tree (CST) An ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. Abstract Syntax Tree (AST) A tree representation of the abstract syntactic structure of source code written in a programming language. 18
  • 19. CFG Simplification 1. Ambiguous -> Unambiguous 2. Nondeterministic -> Deterministic 3. Left recursion -> No left recursion 19
  • 20. Ambiguious Definition A grammar contains rules that can generate more than one tree. 20 E -> E + E | E * E | Num N N N E E E + E * E N E E N N E E E * +
  • 21. Ambiguious -> Unambiguous 21 N E E N N T F T * + E -> E + T | T T -> T * F | F F -> Num E -> E + E | E * E | Num Step1 Rewrite Grammar Step2 Make sure the grammar only generate one tree T F F
  • 22. Non-deterministic -> Deterministic A grammar contains rules that have common prefix. 22 A -> ab | ac A -> aA’ A’ -> b | c Rewrite Grammar *A non-deterministic grammar can be rewritten into more than one deterministic grammar.
  • 23. Left recursion -> No left recursion A grammar contains direct or indirect left recursion. 23 E -> E + T | T T -> T * F | F F -> Num E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Rewrite Grammar E in first E + T will recursively derives to second E + T, E in second E + T will repeat it to third E + T, and so on recursively.
  • 24. Recap: CFG Simplification 24 Before After Ambiguous Non-deterministic Left Recursion
  • 25. Parser 101 - Traditional parser Uncode - GATE Computer Science - Compiler Design Lecture 25
  • 26. Parser classification 26 N E E N N E E E * + Top-down Type Bottom-up Type N E N E N E + N E E N E E N E E + N E E N N E E E * +
  • 27. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 27 *Both LL/LR parser scan input string from left to right Input String: 2 + 3 * 4
  • 28. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 28 *The derivation time of LL/LR parser is different. N E E N N E E E * + N E E N N E E E * + + → * * → +
  • 29. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 29 Input String: 2 + 3 * 4 I am "a token of number". If I perform 1-token lookahead and meet "a token of +", what to do next?
  • 30. Top Down - Recursive descent parser 30
  • 31. LL(k) - Implementation 31 2 + 3 * 4 parse_E() E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num parse_Tp(parse_F()) parse_Ep( ) Step3 *recursively parse the input string started from first rule parse_E() Step2 *parse from left to right *perform k-lookahead parse_T() Step1 write function for each non-terminal
  • 32. 32 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num *perform 1-lookahead LL(1) - Example code Derivation x x
  • 33. Top Down - Non recursive descent parser 33
  • 34. LL(1) - Parsing table 34 Step1 Build first/follow table for each non-terminal Note: $ means endmark Step2 Build parsing table based on first/follow table
  • 35. LL(1) - Implementation 35 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  • 36. LL(1) - Example code 36 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Non-terminal stack Reduce (Derivation) Shift Reduce (Derivation)
  • 37. Bottom Up - LR(0) parser 37
  • 38. LR(0) - Deterministic finite automaton 38 E’ -> .E --- (1) E -> .E + T --- (2) E -> .T --- (3) T -> .T * Num --- (4) T -> .Num --- (5) Step1 Build Deterministic Finite Automaton(DFA) E’ -> E. E -> E. + T E -> T. T -> T. * Num T -> Num. E -> E + .T T -> .T * Num T -> .Num T -> T * .Num E -> E + T. T -> T. * Num T -> T * Num. E T Num * + T Num * Num S1 S2 S3 S4 S5 S6 S7 S8 Left recursion support
  • 39. LR(0) - Parsing table 39 Step2 Build parsing table (For parser like SLR(1), it requires first/follow table) Shift acc Reduce (Derivation) acc
  • 40. LR(0) - Implementation 40 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  • 41. LR(0) - Example code 41 Grammar E -> E + T | T T -> T * F | F F -> Num Shift Reduce (Derivation)
  • 42. Parser 102 - PEG and PEG parser 42
  • 43. Grammar Parsing Expression Grammar (PEG) 43 *Difference from traditional CFG A will try A -> B first. Only after it fails at A -> B, A will only try A -> a. Derivation *some paper write <- Non-Terminal OR (if / elif / ...) *disallow ambigious syntax A -> B | a Terminal *Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time) rule *support Regular Expression (EBNF grammar) in another paper
  • 44. Example of difference 44 Grammar1: A -> a b | a Grammar2: A -> a | a b ● LL/LR parser will fail to complete when the input grammar is ambiguous. ● PEG parser only tries the first PEG rule. The latter rule will never succeed. “A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.” (source)
  • 45. PEG Parser PEG parser means “parser generated based on PEG”. PEG parser can be a Packrat parser, or other traditional parser with k-lookahead limitation. Mostly, PEG parser means Packrat parser. 45 CFG EBNF grammar PEG Packrat parser Traditional parser PEG Parser
  • 46. Parser 102 - Packrat parser 46
  • 47. Type of Packrat parser 47 Top-down Type N E E N E E N E E + N E E N N E E E * + Packrat parser is top-down type.
  • 48. Packrat Parsing - Implementation 48 2 + 3 * 4 parse_E() E -> E + T | T T -> T * F | F F -> Num parse_T() and parse_F() parse_E() and parse_T() Step2 *parse from left to right *perform infinite lookahead + memoization Step1 *write function for each non-terminal (PEG rule) *Idea of memoization was Introduced in 1970 Step3 *recursively parse the input string started from first rule parse_E() Left recursion support
  • 49. Packrat Parsing - Example code 49 Grammar E -> E + T | T T -> T * F | F F -> Num Derivation Memoization
  • 50. Packrat - what is memoization? 50 509. Fibonacci Number 4 3 2 2 1 fib(0) = 0 fib(1) = 1 fib(2) = fib(1) + fib(0) = 1 fib(3) = fib(2) + fib(1) = fib(1) + fib(0) + fib(1) = 2 ... 1 0 1 0 if n = 4, we calculate fib(2), fib(0) twice, fib(1) thrice, fib(4), fib(3) once TIme Complexity: O(2^n)
  • 51. Packrat - what is memoization? (Cont.) 51 509. Fibonacci Number if n = 4, we… calculate fib(4), fib(3), fib(2), fib(1), fib(0) once Time Complexity: O(2^n) => O(n) Space Complexity: O(1) => O(n)
  • 52. Left recursion in Packrat parser 52 Approach 1 if (count of operator) < (count function call): return False Approach 2 reverse the call stack (adopted in CPython!) Source: Guido's Medium (Left-recursive PEG Grammars)
  • 55. Traditional parser V.S Packrat parser 55
  • 56. Traditional parser vs Packrat parser 56 Packrat Traditional Scan Left-to-right (*Right-to-left memo) Left-to-right Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar Ambigious Disallowed (determinism) Allowed Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree) Worst Time Complexity Super linear time (statelessness) *Because of feature like typedef in C Expotenial time Capability Basically covers all traditional cases (infinite lookahead) No left-recursion/ambigious for LL Has k lookup limitations for both (e.g. dangling else) Red text: 3 highlighted characteristics of Packrat parser.
  • 57. 57 Parenthesized context managers PEP 622/634/635/636 - Structural Pattern Matching New rule in Python 3.10 based on PEG
  • 59. CPython Parser - Before/After CPython3.8 and before use LL(1) parser written by Guido 30 years ago The parser requires steps to generate CST and convert CST to AST. CPython3.9 uses PEG (Packrat) parser (Infinite lookahead) PEG rule supports left-recursion No more CST to AST step - source CPython3.10 drops LL(1) parser support 59 This answers “Why PEG?”
  • 60. CPython Parser - Workflow 60 Meta Grammar Tools/peg_generator/ pegen/metagrammar.gram Grammar Grammar/python.gram Token Grammar/Tokens my_parser.py my_parser.c pegen (PEG Parser) Tools/peg_generator/ *CPython contains a peg parser generator written in python3.8+ (because of warlus operator)
  • 61. Input: Meta Grammar Example Syntax Directed Translation (SDT) 61 rule non-Terminal return type PEG rule divider PEG rule action (python code) Parser header (python code)
  • 62. Output: Generated PEG Parser (Partial code) 62
  • 63. Recap: Benefit / Performance Benefit Grammar is more flexible: from LL(1) to LL(∞) (infinite lookahead) Hardware supports Packrat’s memory consumption now Skip intermediate parse tree (CST) construction Performance Within 10% of LL(1) parser both in speed and memory consumption (PEP 617) 63
  • 65. Recap ● Parser 101 (Compiler class in school) ○ CFG ○ Traditional Parser ■ Top-down: LL(1) ■ Bottom-up: LR(0) ● Parser 102 ○ PEG ○ Packrat Parser ● CPython ○ Parser in CPython ○ CPython’s PEG parser 65
  • 66. 66 Need Answer? note35/Parser-Learning You can implement traditional parser like LL(1) and LR(0) parser, and Packrat parser from scratch! Leetcode: 227. Basic Calculator II Q. How to verify my understanding? A. Get your hands dirty!
  • 69. Related Articles Guido van Rossum PEG Parsing Series Overview Bryan Ford Packrat Parsing: Simple, Powerful, Lazy, Linear Time Parsing Expression Grammars: A Recognition-Based Syntactic Foundation 69
  • 70. Related Talks Guido van Rossum @ North Bay Python 2019 Writing a PEG parser for fun and profit Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__ The Journey To Replace Python's Parser And What It Means For The Future Emily Morehouse-Valcarcel @ PyCon 2018 The AST and Me Alex Gaynor @ PyCon 2013 So you want to write an interpreter? 70
  • 71. Thanks for your listening! 71