RLox - Interpreting Lox with Rust

5/26/2020

Lox is a dynamically-typed scripting language designed by Bob Nystrom for use in his book, Crafting Interpreters. The book is an excellent introduction to programming language design and implementation. It's free and extremely well-written. In it, Nystrom walks readers through building two Lox interpreters. The first, JLox, is a Tree-Walk interpreter written in Java. The second, CLox, is a more performant Single-Pass Bytecode Compiler + Stack-based VM written in C.

After working my way through the book, I decided to implement a third version of the interpreter in Rust. My implementation, imaginatively named RLox, is a hybrid of JLox and CLox. The parser builds an AST, then a compiler turns that AST into bytecode that is executed on a Stack-based VM similar to that of CLox. In the following sections, I will provide a brief overview of RLox, with a focus on the places where it differs the most from RLox and CLox. Along the way, I'll highlight some things I enjoyed about programming in Rust.

If you just want to check out the RLox source code, it is available here.

The Scanner

pub struct Scanner<'a> {
/// An Iterator over the source
characters: Chars<'a>,

/// The source that makes up the current Token
current: String,

/// The index in source where current begins
current_start_index: usize,
}

The RLox scanner, like the CLox scanner, does scanning 'on-demand' instead of loading every token from a source file into a collection. It is implemented as a single struct that exposes a new() and a next() method. The new() method accepts a &str and initializes a new Scanner struct that owns an iterator over the characters in the source &str. The next() method then uses that iterator to return the next token in the source file. Instead of returning a special end-of-file token, next() returns None once there are no more tokens.

Having a next() implementation means that Scanner has Rust's Iterator trait. That means Rust provides implementations for a whole bunch of other potentially useful methods for free. For example, if we want to store all of the tokens in a collection instead of a scanning on-demand, we could use the collect() method on Scanner, without having to implement it.

pub struct Token {
pub span: Span,
pub kind: Kind,
}

pub struct Span {
pub start: usize,
pub end: usize,
}

pub enum Kind {
LeftBrace,
NumberLiteral(f64),
...
}

A Token is represented by a Kind and a Span. Kind is an enum with values like Kind::LeftBrace, Kind::LessEqual, and Kind::NumberLiteral. Rust enums variants can each hold their own state, so Kind::NumberLiteral holds the numeric value of the literal it represents. Span represents a range in the source file with a start and end index. My Span is a simplified version of the Rust Standard library Span type. Each Token is tagged with a Span so that later stages in the interpreter pipeline can output errors that indicate where in the source file the problem originated. RLox can output nicer errors than CLox or JLox, which only track the line number for each token.

The Parser

pub struct Parser<'a> {
scanner: Peekable<Scanner<'a>>,
}

The RLox Parser is a struct that owns a Peekable<Scanner>. The Peekable part is a Rust type that wraps any iterator and adds a peek() method to it. This is all the state we need to implement a recursive descent parser for the Lox's LL(1) grammar. Like the JLox parser, the RLox parser takes a source string and returns an abstract syntax tree (AST) representing the program described by the source string. There is no compilation in the parser module.

pub struct SpannedAstNode {
pub span: Span,
pub node: Option<AstNode>,
}

pub enum AstNode {
Unary {
operator: Token,
expression: Box<SpannedAstNode>,
},
Binary {
left: Box<SpannedAstNode>,
operator: Token,
right: Box<SpannedAstNode>,
},
...
}

A SpannedAstNode is a lot like a Token - it owns some data and a Span that tells us where that data came from. In this case, though, the data is a node in an AST. Rust's Box<T> type is like a pointer to T. They're necessary here so that each AstNode variant is a fixed size, known at compile time.

As in most recursive descent parsers, each production in the Lox Grammar gets its own method in the RLox parser. Each of these methods return a Result<SpannedAstNode, ParsingError>. If there is a syntax error, then the Err variant is returned with a relevant message and Span. In most of the parsing methods, these errors are propagated (using the convenient ? operator) up the call chain. The top-level parsing method, which recognizes a sequence of declaration productions, handles the errors by adding them to a list of all detected errors, calling a synchronize method, and then attempting to parse more of the program so that any remaining errors can be detected as well.

The Compiler

impl Compiler {
fn compile_node(
&mut self,
bin: &mut Executable,
spanned_node: &SpannedAstNode,
) -> Result<(), CompilerError> {
...
}
}

If the source is syntactically valid, then the full AST is passed to the Compiler. Most of the work is done in the compile_node() method, which converts the AST into executable OpCodes, which it appends to the provided Executable. There are a few intermediate methods that wraps the Executable in an ObjFunction and an ObjClosure to simplify later stages of the interpreter pipeline.

pub enum OpCode {
Constant(usize),
Return,
Add,
...
}

Unlike CLox OpCodes, RLox OpCodes are not true bytecodes - they are Rust enum variants, each of which are ~9 Bytes. I sacrificed some memory usage and portability to embedded devices in favor of simplicity.

Also unlike CLox, RLox does not aspire to be a single-pass compiler. This generally makes compilation easier, because more information is available during compilation. This is most apparent when compiling a for-loop. CLox generates some intricate jumps when compiling a for-loop because it must compile statements as it sees them. RLox has full access to every statement making up the for-loop declaration at compile time and can compile them each individually, whenever it wants. This results in Bytecode that is more straightforward and identical to the bytecode generated by an equivalent while loop (cf. the following loops each result in the same bytecode).

{
for (var i = 0; i < 5; i = i + 1) {
print i;
}

var i = 0;
while (i < 5) {
print i;
i = i + 1;
}
}

Index OpCode Arguments
------------------------------------
00000 Constant 0[Number(0)]
00001 GetLocal 0
00002 Constant 1[Number(5)]
00003 Less
00004 JumpIfFalse 14
00005 Pop
00006 GetLocal 0
00007 Print
00008 GetLocal 0
00009 Constant 2[Number(1)]
00010 Add
00011 SetLocal 0
00012 Pop
00013 Jump 1
00014 Pop
00015 Pop

The VM

pub struct VM {
ip: usize,
base: usize,
stack: Vec<Value>,
globals: HashMap<String, Value>,
}

The RLox VM is shown above. ip is an index into an Executable list of OpCodes, base is an index into the stack of Values that points to the first Value related to the current function, and globals holds all of the currently defined global variables.

impl VM {
pub fn execute<W: Write>(
&mut self,
closure: &ObjClosure,
output_stream: &mut W,
) -> Result<(), RuntimeError> {
...
}
}

The most important method on VM is execute(), which accepts a closure to execute and a Write stream to which any program output is written. The output stream is accepted as a parameter to enable end-to-end language tests that make assertions about program output. The body of execute() is simply a while loop that reads an OpCode from the provided ObjClosure (which holds an ObjFunction, which holds an Executable), matches it to some action, and takes that action.

Memory

pub enum Value {
Number(f64),
Bool(bool),
Nil,
Function(Rc<ObjFunction>),
Closure(Rc<ObjClosure>),
String(Rc<ObjString>),
Class(Rc<ObjClass>),
Instance(Rc<ObjInstance>),
BoundMethod(Rc<ObjBoundMethod>),
}

The VM is stack-based. It has no registers and all of the operations are done on Values that live on the stack. Value is defined as yet another Rust enum. Values hold either an immediate value (Number, Bool, Nil), or a reference-counted pointer to some memory that lives on the heap. The Rc pointers provide garbage collection (unfortunately susceptible to reference cycles) and enable cloning Values whenever necessary.

Any ObjX type is a struct that lives on the heap. Each of these types implements the fmt::Debug and fmt::Display traits so that they can be printed. Some of these types, such as ObjClosure and ObjInstance require mutabilty. Since the VM can only access them through Rc pointers, they include RefCell type fields so that they can be borrowed mutably at runtime. Unfortunately, this means that we get slightly weaker compile-time guarantees.

Conclusion

RLox was my first Rust project, and I learned a lot about the language while working with it. It's certainly frustrating at times, but I attribute that more to my own inexperience than to Rust itself. After implementing a hashtable and a dynamically sized array from scratch while implemented CLox, Rust's standard library types were a breath of fresh air. The trait system is extremely powerful, once you get used to it. And I never had to deal with a segfault.

I've become really interested in language design and implementation in the last year or so, and building these three interpreters was a valuable experience. Again, I strongly recommend Crafting Interpreters, it's approachable, well-written, and includes just enough thoery to learn something new while doing something practical. Lox itself is a straightforward language, but the inclusion of classes and closures made it a challenge beyond the simple imperative languages often found in university classes or other online tutorials.

If you want to check out the RLox source code, it is available here.