A couple of months ago, I came across a video where the prolific streamer and Netflix dev known online as the ThePrimeagen live-coded a simple lexer for an interpreter inspired by the book Crafting Interpreters by Robert Nystrom. If you’ve ever looked into what CPython actually does with your source code, you’ll know that it parses the code into an abstract syntax tree, converts the tree into bytecode instructions (i.e. those *.pyc
files usually in a __pycache__
directory) and runs them on a virtual machine. But before it can do any of that, it has to “lex” the raw text of the file into meaningful tokens like def
, True
, etc. I was struck by the elegance of ThePrimeagen’s implementation in Rust, and so I set myself a task: how close can I get to the Rust version using Python?1
(Sort of) Emulating Rust Enums in Python
Even a superficial glance at the Rust lexer will give you an appreciation for Rust enums and pattern matching. In Rust, enums can contain data, so enumerating each of the token types that the lexer recognizes is straightforward:
pub enum Token {
Ident(String),
Int(String),
Illegal,
Eof,
Assign,
Bang,
Dash,
...
Return,
True,
False,
}
In Python, it’s similarly eas– uh, oh.
Python enums just don’t work this way. Enum
subclass members are expected to be singletons, so we’ll have to use a different abstraction. After some finagling, I landed on namedtuples
as a suitable alternative.
from collections import namedtuple
class Token(TokenBase):
Ident = namedtuple("Ident", ["value"])
Int = namedtuple("Int", ["value"])
Illegal = namedtuple("Illegal", [])
Eof = namedtuple("Eof", [])
Assign = namedtuple("Assign", [])
Bang = namedtuple("Bang", [])
Dash = namedtuple("Dash", [])
...
Return = namedtuple("Return", [])
True_ = namedtuple("True_", [])
False_ = namedtuple("False_", [])
Not nearly as pretty, but the advantage is that each member of Token
is now a dynamically generated NamedTuple
that only has a value
attribute if the token type requires it. So identifiers like value
in let value = 5;
are captured in Token.Ident("value")
, but return
tokens (Token.Return()
) don’t carry around extra baggage.
What’s TokenBase
, you ask? Well, if you instantiate a member of the Token
class without the base class, you’ll get something like:
>>> Token.Ident("val")
Ident("val")
>>> Token.Dash()
Dash()
Close, but not quite as nice as the way that tokens are displayed in the Rust version: Token.Ident("val")
. TokenBase
customizes the __repr__
s for each of the namedtuple
s to look more like Rust. (It also modifies __eq__
and __neq__
so that empty namedtuples
with different names do not compare equal!) Here’s the ugliness to make that work:
class TokenBase:
def __init_subclass__(cls) -> None:
def _token_repr(self):
name = self.__class__.__name__
return (
f"Token.{name}({self.value!r})"
if hasattr(self, "value")
else f"Token.{name}"
)
def _token_eq(self, other):
return repr(self) == repr(other)
def _token_ne(self, other):
return repr(self) != repr(other)
for name, method in vars(cls).items():
if name.startswith("__"):
continue
method.__repr__ = _token_repr
method.__eq__ = _token_eq
method.__ne__ = _token_ne
The trick is that by using __init_subclass__
, the __repr__
, __eq__
, and __neq__
implementations of the NamedTuple
class attributes on Token
will be overwritten when the class is imported. This way, Token.Ident(...)
will be modified without ever instantiating Token
.
The end result is that instantiating members of Token
will create NamedTuple
instances that behave a little bit more like Rust enum
s.
>>> Token.Ident("val")
Token.Ident('val')
>>> Token.Dash()
Token.Dash
>>> Token.Dash() != Token.Return()
True
Shamelessly Copying (Most of) the Lexer
Though Rust is a substantially more complex language than Python and has been designed with completely different tradeoffs in mind, both languages provide similar procedural syntax. What follows is a more or less direct port of the Lexer
struct and its methods to peek or advance to the next character in the input text.
@dataclass
class Lexer:
text: str
position: int = 0
read_position: int = 0
ch: str = field(init=False, default="")
def __post_init__(self):
self.read_char()
def __iter__(self):
while (token := self.next_token()) != Token.Eof():
yield token
yield token
def next_token(self):
...
def peek(self):
if self.read_position < len(self.text):
return self.text[self.read_position]
def read_char(self):
if self.read_position >= len(self.text):
self.ch = ""
else:
self.ch = self.text[self.read_position]
self.position = self.read_position
self.read_position += 1
def skip_whitespace(self):
while self.ch.isspace():
self.read_char()
def read_ident(self):
pos = self.position
while self.peek().isalpha() or self.peek() == "_":
self.read_char()
return self.input[pos : self.read_position]
def read_int(self):
pos = self.position
while self.peek().isdigit():
self.read_char()
return self.input[pos : self.read_position]
The Lexer
starts at position 0 and steps through the text each time next_token()
is called. If the next token contains multiple characters or skippable whitespace, next_token()
will advance the Lexer
position to the end of the last token. The read_position
will then correspond to the start of the next token (or whitespace).
The read_ident
and read_int
functions repeatedly peek
at the next character in the input text, instructing the lexer to keep reading the next character until the end of the identifer or integer. The functions record the start and end positions so that they can slice into the input string to return a multi-character token.
Python makes it easy to create custom iterables, so I added an __iter__
method which will make for token in lexer: print(token)
work with no extra sweat!
Matching match
statements
Things get interesting again when trying to implement next_token()
. The Rust implentation uses a single match
statement to identify each of the tokens defined in the lexer. In the python version, matching on individual characters ports directly:
def next_token(self):
self.skip_whitespace()
match self.ch:
case "{":
tok = Token.LSquirly()
case "}":
tok = Token.RSquirly()
case "(":
tok = Token.Lparen()
case ")":
tok = Token.Rparen()
...
In some cases, such as Token.Bang
, and Token.NotEqual
, the lexer peeks at the next character to see which token to emit. This ports over nicely as well:
case "!":
if self.peek() == "=":
self.read_char()
tok = Token.NotEqual()
else:
tok = Token.Bang()
The case for finding identifiers is particularly elegant in the Rust version: the following excerpt matches on any character that contains an alphabetic character or an underscore to find identifiers, and then uses a nested match statement to identify reserved keywords.
match self.ch {
...,
b'a'..=b'z' | b'A'..=b'Z' | b'_' => {
let ident = self.read_ident();
return Ok(match ident.as_str() {
"fn" => Token::Function,
"let" => Token::Let,
"if" => Token::If,
"false" => Token::False,
"true" => Token::True,
"return" => Token::Return,
"else" => Token::Else,
_ => Token::Ident(ident),
});
},
...
}
The ..=
is a convenient syntax for creating inclusive ranges that in this case span either the uppercase or lowercase alphabet. Python does not have a direct equivalent, but according to PEP 636, it does have match conditionals! Here’s what the equivalent looks like in Python:
case t if t.isalpha() or t == "_":
match self.read_ident():
case "fn":
tok = Token.Function()
...
case _ as val:
tok = Token.Ident(val)
case t if t.isdigit():
tok = Token.Int(self.read_int())
Incidentally, this version supports unicode alphabetic characters as well!
Putting it all together, we get:
def next_token(self):
self.skip_whitespace()
match self.ch:
case "{":
tok = Token.LSquirly()
case "}":
tok = Token.RSquirly()
case "(":
tok = Token.Lparen()
case ")":
tok = Token.Rparen()
case ",":
tok = Token.Comma()
case ";":
tok = Token.Semicolon()
case "+":
tok = Token.Plus()
case "-":
tok = Token.Dash()
case "!":
if self.peek() == "=":
self.read_char()
tok = Token.NotEqual()
else:
tok = Token.Bang()
case ">":
tok = Token.GreaterThan()
case "<":
tok = Token.LessThan()
case "*":
tok = Token.Asterisk()
case "/":
tok = Token.ForwardSlash()
case "=":
if self.peek() == "=":
self.read_char()
tok = Token.Equal()
else:
tok = Token.Assign()
case t if t.isalpha() or t == "_":
match self.read_ident():
case "fn":
tok = Token.Function()
case "let":
tok = Token.Let()
case "if":
tok = Token.If()
case "false":
tok = Token.False_()
case "true":
tok = Token.True_()
case "return":
tok = Token.Return()
case "else":
tok = Token.Else()
case _ as val:
tok = Token.Ident(val)
case t if t.isdigit():
tok = Token.Int(self.read_int())
case "":
tok = Token.Eof()
case _:
tok = Token.Illegal
self.read_char()
return tok
Not bad!
Checking our work
The last task is to reimplement the tests. Here’s the smaller of the two tests in the reference, taking advantage of the __iter__
implementation that we added to the lexer:
def test_next_small():
text = "=+(){},;"
lexer = Lexer(text)
tokens = [
Token.Assign(),
Token.Plus(),
Token.Lparen(),
Token.Rparen(),
Token.LSquirly(),
Token.RSquirly(),
Token.Comma(),
Token.Semicolon(),
]
for exp, res in zip(tokens, lexer):
print(f"expected: {exp}, received {res}")
assert exp == res
>>> test_next_small()
expected: Token.Assign, received Token.Assign
expected: Token.Plus, received Token.Plus
expected: Token.Lparen, received Token.Lparen
expected: Token.Rparen, received Token.Rparen
expected: Token.LSquirly, received Token.LSquirly
expected: Token.RSquirly, received Token.RSquirly
expected: Token.Comma, received Token.Comma
expected: Token.Semicolon, received Token.Semicolon
As usual, the full source for this post is available here.
Many contributors submitted equivalent lexers in a variety of languages for comparison, including one in python that even includes a parser. ↩︎