SlideShare a Scribd company logo
Virtual Machine
for Regular Expressions
Alexander Yakushev
@unlog1c
JEEConf 2018
Theory
Stephen Cole Kleene - Inventor of regular expressions
^ This guy
What's a regular expression?
A text-matching automaton.
/^a+b?(cd*|e)$/
"aaaaabcddd" ✓
"aabbdd" ❌
Regular expression as FSM
^a+b?(cd*|e)$
Regular expression as FSM
^a+b?(cd*|e)$
Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.
Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.
But those are only character-level regexps.
At Grammarly, we need to define token-level rules
that are describable with regex semantics.
Why implement regular
expressions yourself?
<person> = (<honorific>? <first-name-dict>+ <last-name-dict>) |
(<determiner>? <profession-dict>)
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
Backtracking implementation (by Rob Pike)
int match(char *regexp, char *text) {
if (regexp[0] == '0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (*text!='0' && (regexp[0]=='.' || regexp[0]==*text))
return match(regexp+1, text+1);
return 0;
}
int matchstar(char c, char *regexp, char *text) {
do {
if (match(regexp, text))
return 1;
} while (*text != '0' && (*text++ == c || c == '.'));
return 0;
}
Backtracking implementation
● Good: simple and short.
○ Fit into 15 lines!
● Bad: exponential complexity.
How to kill a snake
$ python
>>> import re
>>> s = "a" * 50
>>> re.match("(a|aa)*b", s)
# crickets… chirp chirp
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)
3. Dynamic FSM unroll (“lazy” DFA construction)
○ Runtime: linear O(nm)
Dynamic FSM unroll
1. Google RE2*
2. Rust’s regex library†
3. Virtual machine approach‡
* github.com/google/re2
† github.com/rust-lang/regex
‡
swtch.com/~rsc/regexp/regexp2.html
Virtual machines
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Registers
Crude example of an x86 machine
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers
AX 321
BX 432
CX 0
...
IP 7
Thread 2 Registers
AX 879
BX 567
CX 4
...
IP 3
Thread 3 Registers
Crude example of an x86 machine
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers Thread 2 Registers
AX 321
BX 432
CX 0
...
IP 7
Crude example of a virtual machine
Examples of virtual machines
1. VirtualBox/VMWare/KVM
2. Java Virtual Machine
3. Domain-specific VMs
TrexVM (Token RegEX Virtual Machine)
● Consumes input sequence token by token.
○ Never goes back to previous tokens.
● IP (instruction pointer) tracks the currently executed instruction.
● Instructions:
○ CMP x
Compare current token to x, increment IP if equal, fail if not.
○ JUMP label
Unconditionally set IP to the point designated by label.
○ FORK label
Increment IP and spawn additional thread that jumps to label.
TrexVM (Token RegEX Virtual Machine)
● All threads are executed in a lock-step.
○ Execute all CMP instructions simultaneously.
○ If some threads point not to CMP, statically unroll them.
● Execution continues until one thread reaches the end of the
program (successful match) or all threads are dead (failed
match)
TrexVM sample run
aaab
Input string
↑
1→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑1→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
1→
2→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑2→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
2→
3→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑3→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
L1: CMP a
FORK L1
CMP b
Program
aaab
Input string
↑
3→
4→
TrexVM sample run
aaab
Input string
↑
3→ Success!
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
3→ Success!
Regex: a+b
L1: CMP a
FORK L1
CMP b
Program
TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Guess the regular expression:
TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Regex: a?b*c
Guess the regular expression:
TrexVM: next iteration
Added support for match groups.
Each thread now has a register bank (beyond just IP register).
New instructions:
● SAVEL group — save current position in input as beginning of group.
● SAVER group — save current position in input as ending of group.
● FORKSTAY label — FORK which gives the staying thread a higher
priority.
● FORKJUMP label — FORK which gives the jumping thread a higher priority.
Matching stops when the thread with the highest priority succeeds.
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd 1
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd
T2
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd 2
T2
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 2
snd 2
T2
2→
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0
snd
T2
1→
2→
3→
Reg L R
fst 0 2
snd 2 3
T3
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd 3
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
Regex: ([ab]+)(b)
Another run
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1
T2
3→
Reg L R
fst 0
snd
T3
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
Regex: ([ab]+?)(b?)
Implementation
Concatenation
#INCLUDE regex1
#INCLUDE regex2
...
#INCLUDE regexN
(cat regex1 regex2 … regexN)
Quantifiers
#LABEL _LOOP
#INCLUDE regex
FORKJUMP _LOOP
(+ regex)
#LABEL _LOOP
#INCLUDE regex
FORKSTAY _LOOP
(+? regex)
Quantifiers
#LABEL _START
FORKSTAY _END
#INCLUDE regex
JUMP _START
#LABEL _END
(* regex)
#LABEL _START
FORKJUMP _END
#INCLUDE regex
JUMP _START
#LABEL _END
(*? regex)
Quantifiers
FORKSTAY _SKIP
#INCLUDE regex
#LABEL _SKIP
(? regex)
FORKJUMP _SKIP
#INCLUDE regex
#LABEL _SKIP
(?? regex)
Alternation
FORKSTAY _ALT
#INCLUDE regex1
JUMP _END
#LABEL _ALT
#INCLUDE regex2
#LABEL _END
(| regex1 regex2)
Match groups
SAVEL name
#INCLUDE regex
SAVER name
(as group regex)
Possible CMP arguments
● String (check if token is equal)
● Char-level regex (check if token matches)
● Hashset/map (check if it contains the token)
● Arbitrary predicate (check if token satisfies)
TrexVM syntax
(as :person
(| (cat (? is-honorific) (+ first-name-dict) last-name-dict))
(cat (? is-determiner) profession-dict))
Now we can write regular expressions like this:
Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Implemented as a subvirtual
machine inside each thread.
Start/end of sequence anchors (^ and $)
(cat < is-number (? is-word) >)
Extra features
Extra features
(cat (*? .) regex…)
Cheap anchor-free macthing
Loop detection
(rx-find (cat (* (| a (cat a a))) b)
(repeat 300 a))
=> nil
Extra features
Composability
(def numbered-street-name
(+ (| is-ordinal-number street-name-dict))
(def street
(| (cat (? #"d+") numbered-street-name street-marker-dict)
(cat #"d+" numbered-street-name #{"Count" "Drive"}))
Extra features
Shortcomings
● No look-behinds
● No backreferences
● Can’t find all matches (only first match)
● Overkill complexity for trivial cases
Under the hood
Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
● Slow!
○ Real-world regex scan takes ~1ms
per sentence
Optimizations
Inline caching of compiled regexes.
Bad Java regexp usage
for (String sentence : sentences) {
sentence.matches("^(?i[rea]lly* hard+ regex))$");
}
Good Java regexp usage
Pattern p = Pattern.compile("^(?i[rea]lly* hard+ regex))$");
for (String sentence : sentences) {
p.matcher(sentence).matches();
}
Clojure: the power of macros
(for [sentence-tokens all-sentences]
(find (trex (+ (? company-name)) ...)
sentence-tokens))
Optimizations
Inline caching of compiled regexes.
3x performance improvement.
Optimizations
● Started rewriting parts of the
implementation into Java.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
● Made everything mutable.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
● Made everything mutable.
● Made some things immutable again.
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
Final version
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
Final version
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
● Performance x20 of the initial version.
○ Previous regex takes 50μs per sentence.
Final version
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
● Investigate if look-behinds are possible
(through hacks and reduced perf)
Conclusions
● Old papers contain great ideas
● Knowledge from university can be useful
● Make it work, then make it fast
References
● Original paper:
https://swtch.com/~rsc/regexp/regexp2.html
● Open-source implementation (in Clojure):
https://github.com/cgrand/seqexp
● Synacor Challenge
https://challenge.synacor.com
.+the end$

More Related Content

PDF
How it's made: C++ compilers (GCC)
PDF
Integrating R with C++: Rcpp, RInside and RProtoBuf
PDF
Java/Scala Lab: Руслан Шевченко - Implementation of CSP (Communication Sequen...
PDF
Rcpp
PPTX
Streams for the Web
PDF
Making our Future better
PDF
Csp scala wixmeetup2016
PDF
How it's made: C++ compilers (GCC)
Integrating R with C++: Rcpp, RInside and RProtoBuf
Java/Scala Lab: Руслан Шевченко - Implementation of CSP (Communication Sequen...
Rcpp
Streams for the Web
Making our Future better
Csp scala wixmeetup2016

What's hot (20)

PDF
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
PPT
Introduction to gdb
PDF
Performance evaluation with Arm HPC tools for SVE
ODP
C Under Linux
PDF
Porting and Optimization of Numerical Libraries for ARM SVE
PPTX
Berkeley Packet Filters
ODP
ocelot
PDF
Arm tools and roadmap for SVE compiler support
PPTX
Staring into the eBPF Abyss
PDF
Javascript Secrets - Front in Floripa 2015
PDF
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
PPTX
Gnu debugger
PDF
Low pause GC in HotSpot
PDF
Code GPU with CUDA - Optimizing memory and control flow
PDF
LTO plugin
PDF
Debugging node in prod
PDF
不深不淺,帶你認識 LLVM (Found LLVM in your life)
PDF
Knowing your Garbage Collector / Python Madrid
PDF
Introduction to RevKit
DOCX
Exercice.docx
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Introduction to gdb
Performance evaluation with Arm HPC tools for SVE
C Under Linux
Porting and Optimization of Numerical Libraries for ARM SVE
Berkeley Packet Filters
ocelot
Arm tools and roadmap for SVE compiler support
Staring into the eBPF Abyss
Javascript Secrets - Front in Floripa 2015
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Gnu debugger
Low pause GC in HotSpot
Code GPU with CUDA - Optimizing memory and control flow
LTO plugin
Debugging node in prod
不深不淺,帶你認識 LLVM (Found LLVM in your life)
Knowing your Garbage Collector / Python Madrid
Introduction to RevKit
Exercice.docx
Ad

Similar to Virtual Machine for Regular Expressions (20)

PPTX
Advanced procedures in assembly language Full chapter ppt
PDF
Lecture 3 RE NFA DFA
PPTX
Computer Architecture Assignment Help
PDF
lec15_x86procedure_4up.pdf
PPTX
07 140430-ipp-languages used in llvm during compilation
PDF
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
PDF
Exploitation Crash Course
PDF
R/C++ talk at earl 2014
PDF
CAMP-V: Ultimate RISC-V Bootcamp sponsored by MERL and RISC-V
PDF
System Hacking Tutorial #2 - Buffer Overflow - Overwrite EIP
PDF
Continuation Passing Style and Macros in Clojure - Jan 2012
PDF
Make ARM Shellcode Great Again - HITB2018PEK
PPTX
Node.js - Advanced Basics
PDF
Assembly class
PPT
other-architectures.ppt
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PPT
Assembly language programming_fundamentals 8086
PDF
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
PPT
chapt_5+6AssemblyLanguagecompleteclear.ppt
PDF
Visual Studio를 이용한 어셈블리어 학습 part 2
Advanced procedures in assembly language Full chapter ppt
Lecture 3 RE NFA DFA
Computer Architecture Assignment Help
lec15_x86procedure_4up.pdf
07 140430-ipp-languages used in llvm during compilation
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Exploitation Crash Course
R/C++ talk at earl 2014
CAMP-V: Ultimate RISC-V Bootcamp sponsored by MERL and RISC-V
System Hacking Tutorial #2 - Buffer Overflow - Overwrite EIP
Continuation Passing Style and Macros in Clojure - Jan 2012
Make ARM Shellcode Great Again - HITB2018PEK
Node.js - Advanced Basics
Assembly class
other-architectures.ppt
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Assembly language programming_fundamentals 8086
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
chapt_5+6AssemblyLanguagecompleteclear.ppt
Visual Studio를 이용한 어셈블리어 학습 part 2
Ad

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
medical staffing services at VALiNTRY
PPTX
ai tools demonstartion for schools and inter college
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Nekopoi APK 2025 free lastest update
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Transform Your Business with a Software ERP System
CHAPTER 2 - PM Management and IT Context
How to Migrate SBCGlobal Email to Yahoo Easily
Navsoft: AI-Powered Business Solutions & Custom Software Development
Which alternative to Crystal Reports is best for small or large businesses.pdf
Reimagine Home Health with the Power of Agentic AI​
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
top salesforce developer skills in 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
medical staffing services at VALiNTRY
ai tools demonstartion for schools and inter college
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Wondershare Filmora 15 Crack With Activation Key [2025
2025 Textile ERP Trends: SAP, Odoo & Oracle
Nekopoi APK 2025 free lastest update
Odoo Companies in India – Driving Business Transformation.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Virtual Machine for Regular Expressions

  • 1. Virtual Machine for Regular Expressions Alexander Yakushev @unlog1c JEEConf 2018
  • 3. Stephen Cole Kleene - Inventor of regular expressions ^ This guy
  • 4. What's a regular expression? A text-matching automaton. /^a+b?(cd*|e)$/ "aaaaabcddd" ✓ "aabbdd" ❌
  • 5. Regular expression as FSM ^a+b?(cd*|e)$
  • 6. Regular expression as FSM ^a+b?(cd*|e)$
  • 7. Why implement regular expressions yourself? All modern programming languages provide regular expressions with core library.
  • 8. Why implement regular expressions yourself? All modern programming languages provide regular expressions with core library. But those are only character-level regexps.
  • 9. At Grammarly, we need to define token-level rules that are describable with regex semantics. Why implement regular expressions yourself? <person> = (<honorific>? <first-name-dict>+ <last-name-dict>) | (<determiner>? <profession-dict>)
  • 10. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential
  • 11. Backtracking implementation (by Rob Pike) int match(char *regexp, char *text) { if (regexp[0] == '0') return 1; if (regexp[1] == '*') return matchstar(regexp[0], regexp+2, text); if (*text!='0' && (regexp[0]=='.' || regexp[0]==*text)) return match(regexp+1, text+1); return 0; } int matchstar(char c, char *regexp, char *text) { do { if (match(regexp, text)) return 1; } while (*text != '0' && (*text++ == c || c == '.')); return 0; }
  • 12. Backtracking implementation ● Good: simple and short. ○ Fit into 15 lines! ● Bad: exponential complexity.
  • 13. How to kill a snake $ python >>> import re >>> s = "a" * 50 >>> re.match("(a|aa)*b", s) # crickets… chirp chirp
  • 14. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential 2. Full FSM unroll (static NFA->DFA) ○ Compilation: exponential time and memory ○ Runtime: linear O(n)
  • 15. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential 2. Full FSM unroll (static NFA->DFA) ○ Compilation: exponential time and memory ○ Runtime: linear O(n) 3. Dynamic FSM unroll (“lazy” DFA construction) ○ Runtime: linear O(nm)
  • 16. Dynamic FSM unroll 1. Google RE2* 2. Rust’s regex library† 3. Virtual machine approach‡ * github.com/google/re2 † github.com/rust-lang/regex ‡ swtch.com/~rsc/regexp/regexp2.html
  • 18. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Registers Crude example of an x86 machine
  • 19. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Thread 1 Registers AX 321 BX 432 CX 0 ... IP 7 Thread 2 Registers AX 879 BX 567 CX 4 ... IP 3 Thread 3 Registers Crude example of an x86 machine
  • 20. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Thread 1 Registers Thread 2 Registers AX 321 BX 432 CX 0 ... IP 7 Crude example of a virtual machine
  • 21. Examples of virtual machines 1. VirtualBox/VMWare/KVM 2. Java Virtual Machine 3. Domain-specific VMs
  • 22. TrexVM (Token RegEX Virtual Machine) ● Consumes input sequence token by token. ○ Never goes back to previous tokens. ● IP (instruction pointer) tracks the currently executed instruction. ● Instructions: ○ CMP x Compare current token to x, increment IP if equal, fail if not. ○ JUMP label Unconditionally set IP to the point designated by label. ○ FORK label Increment IP and spawn additional thread that jumps to label.
  • 23. TrexVM (Token RegEX Virtual Machine) ● All threads are executed in a lock-step. ○ Execute all CMP instructions simultaneously. ○ If some threads point not to CMP, statically unroll them. ● Execution continues until one thread reaches the end of the program (successful match) or all threads are dead (failed match)
  • 24. TrexVM sample run aaab Input string ↑ 1→ L1: CMP a FORK L1 CMP b Program
  • 25. TrexVM sample run aaab Input string ↑1→ L1: CMP a FORK L1 CMP b Program
  • 26. TrexVM sample run aaab Input string ↑ 1→ 2→ L1: CMP a FORK L1 CMP b Program
  • 27. TrexVM sample run aaab Input string ↑2→ L1: CMP a FORK L1 CMP b Program
  • 28. TrexVM sample run aaab Input string ↑ 2→ 3→ L1: CMP a FORK L1 CMP b Program
  • 29. TrexVM sample run aaab Input string ↑3→ L1: CMP a FORK L1 CMP b Program
  • 30. TrexVM sample run L1: CMP a FORK L1 CMP b Program aaab Input string ↑ 3→ 4→
  • 31. TrexVM sample run aaab Input string ↑ 3→ Success! L1: CMP a FORK L1 CMP b Program
  • 32. TrexVM sample run aaab Input string ↑ 3→ Success! Regex: a+b L1: CMP a FORK L1 CMP b Program
  • 33. TrexVM FORK L1 CMP a L1: FORK L2 CMP b JUMP L1 L2: CMP c Guess the regular expression:
  • 34. TrexVM FORK L1 CMP a L1: FORK L2 CMP b JUMP L1 L2: CMP c Regex: a?b*c Guess the regular expression:
  • 35. TrexVM: next iteration Added support for match groups. Each thread now has a register bank (beyond just IP register). New instructions: ● SAVEL group — save current position in input as beginning of group. ● SAVER group — save current position in input as ending of group. ● FORKSTAY label — FORK which gives the staying thread a higher priority. ● FORKJUMP label — FORK which gives the jumping thread a higher priority. Matching stops when the thread with the highest priority succeeds.
  • 36. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst snd T1 1→
  • 37. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 38. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 39. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 snd T2
  • 40. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 1 snd T2
  • 41. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 1 snd 1 T2
  • 42. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 43. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 snd T2
  • 44. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ Reg L R fst 0 2 snd T2 2→
  • 45. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ Reg L R fst 0 2 snd 2 T2 2→
  • 46. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 2 snd 2 T2 2→ 1→
  • 47. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 snd T2 1→ 2→ 3→ Reg L R fst 0 2 snd 2 3 T3
  • 48. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 3 snd T2 1→ 3→ Reg L R fst 0 2 snd 2 3 T3 2→
  • 49. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 3 snd 3 T2 1→ 3→ Reg L R fst 0 2 snd 2 3 T3 2→
  • 50. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers 3→ Reg L R fst 0 2 snd 2 3 T3 Success! Match groups: fst: aa snd: b
  • 51. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers 3→ Reg L R fst 0 2 snd 2 3 T3 Success! Match groups: fst: aa snd: b Regex: ([ab]+)(b)
  • 53. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst snd T1 1→
  • 54. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→
  • 55. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→
  • 56. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 57. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 58. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 59. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 T2 3→ Reg L R fst 0 snd T3
  • 60. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 1 T2
  • 61. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 1 T2
  • 62. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑
  • 63. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑ Success! Match groups: fst: a snd: b
  • 64. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑ Success! Match groups: fst: a snd: b Regex: ([ab]+?)(b?)
  • 66. Concatenation #INCLUDE regex1 #INCLUDE regex2 ... #INCLUDE regexN (cat regex1 regex2 … regexN)
  • 67. Quantifiers #LABEL _LOOP #INCLUDE regex FORKJUMP _LOOP (+ regex) #LABEL _LOOP #INCLUDE regex FORKSTAY _LOOP (+? regex)
  • 68. Quantifiers #LABEL _START FORKSTAY _END #INCLUDE regex JUMP _START #LABEL _END (* regex) #LABEL _START FORKJUMP _END #INCLUDE regex JUMP _START #LABEL _END (*? regex)
  • 69. Quantifiers FORKSTAY _SKIP #INCLUDE regex #LABEL _SKIP (? regex) FORKJUMP _SKIP #INCLUDE regex #LABEL _SKIP (?? regex)
  • 70. Alternation FORKSTAY _ALT #INCLUDE regex1 JUMP _END #LABEL _ALT #INCLUDE regex2 #LABEL _END (| regex1 regex2)
  • 71. Match groups SAVEL name #INCLUDE regex SAVER name (as group regex)
  • 72. Possible CMP arguments ● String (check if token is equal) ● Char-level regex (check if token matches) ● Hashset/map (check if it contains the token) ● Arbitrary predicate (check if token satisfies)
  • 73. TrexVM syntax (as :person (| (cat (? is-honorific) (+ first-name-dict) last-name-dict)) (cat (? is-determiner) profession-dict)) Now we can write regular expressions like this:
  • 74. Extra features (cat (+ name-dict) (?! is-stop-word)) Negative and positive (nested) look aheads
  • 75. Extra features (cat (+ name-dict) (?! is-stop-word)) Negative and positive (nested) look aheads Implemented as a subvirtual machine inside each thread.
  • 76. Start/end of sequence anchors (^ and $) (cat < is-number (? is-word) >) Extra features
  • 77. Extra features (cat (*? .) regex…) Cheap anchor-free macthing
  • 78. Loop detection (rx-find (cat (* (| a (cat a a))) b) (repeat 300 a)) => nil Extra features
  • 79. Composability (def numbered-street-name (+ (| is-ordinal-number street-name-dict)) (def street (| (cat (? #"d+") numbered-street-name street-marker-dict) (cat #"d+" numbered-street-name #{"Count" "Drive"})) Extra features
  • 80. Shortcomings ● No look-behinds ● No backreferences ● Can’t find all matches (only first match) ● Overkill complexity for trivial cases
  • 82. Naive implementation ● 300 lines of Clojure ● Immutable VM and Thread objects ● Very concise and debuggable code
  • 83. Naive implementation ● 300 lines of Clojure ● Immutable VM and Thread objects ● Very concise and debuggable code ● Slow! ○ Real-world regex scan takes ~1ms per sentence
  • 84. Optimizations Inline caching of compiled regexes.
  • 85. Bad Java regexp usage for (String sentence : sentences) { sentence.matches("^(?i[rea]lly* hard+ regex))$"); }
  • 86. Good Java regexp usage Pattern p = Pattern.compile("^(?i[rea]lly* hard+ regex))$"); for (String sentence : sentences) { p.matcher(sentence).matches(); }
  • 87. Clojure: the power of macros (for [sentence-tokens all-sentences] (find (trex (+ (? company-name)) ...) sentence-tokens))
  • 88. Optimizations Inline caching of compiled regexes. 3x performance improvement.
  • 89. Optimizations ● Started rewriting parts of the implementation into Java.
  • 90. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable.
  • 91. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable. ● Made everything mutable.
  • 92. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable. ● Made everything mutable. ● Made some things immutable again.
  • 93. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) Final version
  • 94. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) ● Mix of mutable and immutable objects with copy-on-write fields. Final version
  • 95. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) ● Mix of mutable and immutable objects with copy-on-write fields. ● Performance x20 of the initial version. ○ Previous regex takes 50μs per sentence. Final version
  • 96. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase.
  • 97. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex.
  • 98. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex. ○ Make it vulnerable to Meltdown/Spectre.
  • 99. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex. ○ Make it vulnerable to Meltdown/Spectre. ● Investigate if look-behinds are possible (through hacks and reduced perf)
  • 100. Conclusions ● Old papers contain great ideas ● Knowledge from university can be useful ● Make it work, then make it fast
  • 101. References ● Original paper: https://swtch.com/~rsc/regexp/regexp2.html ● Open-source implementation (in Clojure): https://github.com/cgrand/seqexp ● Synacor Challenge https://challenge.synacor.com