Virtual Machine for Regular Expressions

Virtual Machine
for Regular Expressions
Alexander Yakushev
@unlog1c
JEEConf 2018

Stephen Cole Kleene - Inventor of regular expressions
^ This guy

What's a regular expression?
A text-matching automaton.
/^a+b?(cd*|e)$/
"aaaaabcddd" ✓
"aabbdd" ❌

Regular expression as FSM
^a+b?(cd*|e)$

Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.

All modern programming languages provide
regular expressions with core library.
But those are only character-level regexps.

At Grammarly, we need to define token-level rules
that are describable with regex semantics.
<person> = (<honorific>? <first-name-dict>+ <last-name-dict>) |
(<determiner>? <profession-dict>)

Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential

Backtracking implementation (by Rob Pike)
int match(char *regexp, char *text) {
if (regexp[0] == '0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (*text!='0' && (regexp[0]=='.' || regexp[0]==*text))
return match(regexp+1, text+1);
return 0;
}
int matchstar(char c, char *regexp, char *text) {
do {
if (match(regexp, text))
return 1;
} while (*text != '0' && (*text++ == c || c == '.'));
return 0;
}

Backtracking implementation
● Good: simple and short.
○ Fit into 15 lines!
● Bad: exponential complexity.

How to kill a snake
$ python
>>> import re
>>> s = "a" * 50
>>> re.match("(a|aa)*b", s)
# crickets… chirp chirp

1. Backtracking
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)

1. Backtracking
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)
3. Dynamic FSM unroll (“lazy” DFA construction)
○ Runtime: linear O(nm)

Dynamic FSM unroll
1. Google RE2*
2. Rust’s regex library†
3. Virtual machine approach‡
* github.com/google/re2
† github.com/rust-lang/regex
‡
swtch.com/~rsc/regexp/regexp2.html

MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Registers
Crude example of an x86 machine

MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers
AX 321
BX 432
CX 0
...
IP 7
Thread 2 Registers
AX 879
BX 567
CX 4
...
IP 3
Thread 3 Registers
Crude example of an x86 machine

MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers Thread 2 Registers
AX 321
BX 432
CX 0
...
IP 7
Crude example of a virtual machine

Examples of virtual machines
1. VirtualBox/VMWare/KVM
2. Java Virtual Machine
3. Domain-specific VMs

TrexVM (Token RegEX Virtual Machine)
● Consumes input sequence token by token.
○ Never goes back to previous tokens.
● IP (instruction pointer) tracks the currently executed instruction.
● Instructions:
○ CMP x
Compare current token to x, increment IP if equal, fail if not.
○ JUMP label
Unconditionally set IP to the point designated by label.
○ FORK label
Increment IP and spawn additional thread that jumps to label.

TrexVM (Token RegEX Virtual Machine)
● All threads are executed in a lock-step.
○ Execute all CMP instructions simultaneously.
○ If some threads point not to CMP, statically unroll them.
● Execution continues until one thread reaches the end of the
program (successful match) or all threads are dead (failed
match)

TrexVM sample run
aaab
Input string
↑
1→ L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑1→
L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑
1→
2→ L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑2→
L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑
2→
3→ L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑3→
L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
L1: CMP a
FORK L1
CMP b
Program
aaab
Input string
↑
3→
4→

TrexVM sample run
aaab
Input string
↑
3→ Success!
L1: CMP a
FORK L1
CMP b
Program

TrexVM sample run
aaab
Input string
↑
3→ Success!
Regex: a+b
L1: CMP a
FORK L1
CMP b
Program

TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Guess the regular expression:

TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Regex: a?b*c
Guess the regular expression:

TrexVM: next iteration
Added support for match groups.
Each thread now has a register bank (beyond just IP register).
New instructions:
● SAVEL group — save current position in input as beginning of group.
● SAVER group — save current position in input as ending of group.
● FORKSTAY label — FORK which gives the staying thread a higher
priority.
● FORKJUMP label — FORK which gives the jumping thread a higher priority.
Matching stops when the thread with the highest priority succeeds.

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst
snd
T1
1→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd
T2

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd 1
T2

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd
T2
2→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd 2
T2
2→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 2
snd 2
T2
2→
1→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0
snd
T2
1→
2→
3→
Reg L R
fst 0 2
snd 2 3
T3

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd 3
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b

SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
Regex: ([ab]+)(b)

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst
snd
T1
1→

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
Thread registers
Reg L R
fst 0
snd
T1
1→

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1
T2
3→
Reg L R
fst 0
snd
T3

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b

SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
Regex: ([ab]+?)(b?)

Concatenation
#INCLUDE regex1
#INCLUDE regex2
...
#INCLUDE regexN
(cat regex1 regex2 … regexN)

Quantifiers
#LABEL _LOOP
#INCLUDE regex
FORKJUMP _LOOP
(+ regex)
#LABEL _LOOP
#INCLUDE regex
FORKSTAY _LOOP
(+? regex)

Quantifiers
#LABEL _START
FORKSTAY _END
#INCLUDE regex
JUMP _START
#LABEL _END
(* regex)
#LABEL _START
FORKJUMP _END
#INCLUDE regex
JUMP _START
#LABEL _END
(*? regex)

Quantifiers
FORKSTAY _SKIP
#INCLUDE regex
#LABEL _SKIP
(? regex)
FORKJUMP _SKIP
#INCLUDE regex
#LABEL _SKIP
(?? regex)

Alternation
FORKSTAY _ALT
#INCLUDE regex1
JUMP _END
#LABEL _ALT
#INCLUDE regex2
#LABEL _END
(| regex1 regex2)

Match groups
SAVEL name
#INCLUDE regex
SAVER name
(as group regex)

Possible CMP arguments
● String (check if token is equal)
● Char-level regex (check if token matches)
● Hashset/map (check if it contains the token)
● Arbitrary predicate (check if token satisfies)

TrexVM syntax
(as :person
(| (cat (? is-honorific) (+ first-name-dict) last-name-dict))
(cat (? is-determiner) profession-dict))
Now we can write regular expressions like this:

Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads

Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Implemented as a subvirtual
machine inside each thread.

Start/end of sequence anchors (^ and $)
(cat < is-number (? is-word) >)
Extra features

Extra features
(cat (*? .) regex…)
Cheap anchor-free macthing

Loop detection
(rx-find (cat (* (| a (cat a a))) b)
(repeat 300 a))
=> nil
Extra features

Composability
(def numbered-street-name
(+ (| is-ordinal-number street-name-dict))
(def street
(| (cat (? #"d+") numbered-street-name street-marker-dict)
(cat #"d+" numbered-street-name #{"Count" "Drive"}))
Extra features

Shortcomings
● No look-behinds
● No backreferences
● Can’t find all matches (only first match)
● Overkill complexity for trivial cases

Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code

Naive implementation
● Immutable VM and Thread objects
● Very concise and debuggable code
● Slow!
○ Real-world regex scan takes ~1ms
per sentence

Optimizations
Inline caching of compiled regexes.

Bad Java regexp usage
for (String sentence : sentences) {
sentence.matches("^(?i[rea]lly* hard+ regex))$");
}

Good Java regexp usage
Pattern p = Pattern.compile("^(?i[rea]lly* hard+ regex))$");
for (String sentence : sentences) {
p.matcher(sentence).matches();
}

Clojure: the power of macros
(for [sentence-tokens all-sentences]
(find (trex (+ (? company-name)) ...)
sentence-tokens))

Optimizations
Inline caching of compiled regexes.
3x performance improvement.

Optimizations
● Started rewriting parts of the
implementation into Java.

Optimizations
● Made some objects mutable.

Optimizations
● Made everything mutable.

Optimizations
● Made everything mutable.
● Made some things immutable again.

● 300 lines of Java
○ VM completely rewritten in Java.
○ Function definitions, compiler, API (find, matches, …)
Final version

● Mix of mutable and immutable objects with copy-on-write
fields.
Final version

● Mix of mutable and immutable objects with copy-on-write
fields.
● Performance x20 of the initial version.
○ Previous regex takes 50μs per sentence.
Final version

Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.

Future work
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.

Future work
○ Make it vulnerable to Meltdown/Spectre.

Future work
○ Make it vulnerable to Meltdown/Spectre.
● Investigate if look-behinds are possible
(through hacks and reduced perf)

Conclusions
● Old papers contain great ideas
● Knowledge from university can be useful
● Make it work, then make it fast

References
● Original paper:
https://swtch.com/~rsc/regexp/regexp2.html
● Open-source implementation (in Clojure):
https://github.com/cgrand/seqexp
● Synacor Challenge
https://challenge.synacor.com

Virtual Machine for Regular Expressions

More Related Content

What's hot (20)

Similar to Virtual Machine for Regular Expressions (20)

Recently uploaded (20)

Virtual Machine for Regular Expressions