Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
creating:expressions [2025/04/07 00:23] – Wrote tutorial mirroring chapter 5 ahelwer | creating:expressions [2025/05/13 15:58] (current) – Fixed lookahead in set literal parsing ahelwer | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ======= Parsing |
- | This tutorial page covers the next three chapters in //Crafting Interpreters//: | + | This tutorial page covers the next two chapters in //Crafting Interpreters//: |
- [[https:// | - [[https:// | ||
- [[https:// | - [[https:// | ||
- | - [[https:// | ||
Same as the book, we //could// build a parser for our entire minimal TLA⁺ language subset before moving on to interpreting it, but that would be boring! | Same as the book, we //could// build a parser for our entire minimal TLA⁺ language subset before moving on to interpreting it, but that would be boring! | ||
Line 15: | Line 14: | ||
First read the section of the book, then read the corresponding commentary & modifications given by this tutorial. | First read the section of the book, then read the corresponding commentary & modifications given by this tutorial. | ||
- | ===== Chapter 5: Representing Code ===== | + | ====== Chapter 5: Representing Code ====== |
[[https:// | [[https:// | ||
However, since this tutorial is intended to be self-contained code-wise, all necessary code is reproduced. | However, since this tutorial is intended to be self-contained code-wise, all necessary code is reproduced. | ||
- | ==== Section 5.1: Context-Free Grammars ==== | + | ===== Section 5.1: Context-Free Grammars |
Everything before [[https:// | Everything before [[https:// | ||
The ambiguous first-draft grammar for Lox expressions can be adapted to TLA⁺: | The ambiguous first-draft grammar for Lox expressions can be adapted to TLA⁺: | ||
- | < | + | < |
expression | expression | ||
| unary | | unary | ||
| binary | | binary | ||
| ternary | | ternary | ||
- | | + | |
| grouping ; | | grouping ; | ||
Line 37: | Line 36: | ||
binary | binary | ||
ternary | ternary | ||
- | set | + | variadic |
- | operator | + | operator |
</ | </ | ||
There are a few interesting differences. | There are a few interesting differences. | ||
The '' | The '' | ||
The '' | The '' | ||
- | There is also the '' | ||
The operators are changed to the set of operators defined in our TLA⁺ implementation. | The operators are changed to the set of operators defined in our TLA⁺ implementation. | ||
These are all the expressions we can use without introducing the concept of identifiers referring to something else. | These are all the expressions we can use without introducing the concept of identifiers referring to something else. | ||
- | ==== Section 5.2: Implementing Syntax Trees ==== | + | There is also the '' |
+ | This one is so named because it's where we'll put operators accepting varying numbers of parameters. | ||
+ | It's kind of weird to think of the finite set literal '' | ||
+ | The only difference between '' | ||
+ | This is the perspective of a language implementer. | ||
+ | |||
+ | Note that we //could// also include the conjunction ''/ | ||
+ | |||
+ | ===== Section 5.2: Implementing Syntax Trees ===== | ||
While some programmers might have an aversion to generating code, the approach taken in the book is actually very convenient - as you will discover if you take some time to prototype & play around with different class representations of the parse tree! | While some programmers might have an aversion to generating code, the approach taken in the book is actually very convenient - as you will discover if you take some time to prototype & play around with different class representations of the parse tree! | ||
Line 53: | Line 59: | ||
<code java [highlight_lines_extra=" | <code java [highlight_lines_extra=" | ||
- | package | + | package tool; |
import java.io.IOException; | import java.io.IOException; | ||
Line 73: | Line 79: | ||
" | " | ||
" | " | ||
- | "Set | + | "Variadic |
)); | )); | ||
} | } | ||
Line 88: | Line 94: | ||
PrintWriter writer = new PrintWriter(path, | PrintWriter writer = new PrintWriter(path, | ||
- | writer.println(" | + | writer.println(" |
writer.println(); | writer.println(); | ||
writer.println(" | writer.println(" | ||
Line 138: | Line 144: | ||
- | ==== Section 5.3: Working with Trees ==== | + | ===== Section 5.3: Working with Trees ===== |
[[https:// | [[https:// | ||
Line 195: | Line 201: | ||
Our generated syntax tree node types now support the visitor pattern! | Our generated syntax tree node types now support the visitor pattern! | ||
- | ==== Section 5.4: A (Not Very) Pretty Printer ==== | + | ===== Section 5.4: A (Not Very) Pretty Printer |
[[https:// | [[https:// | ||
Line 201: | Line 207: | ||
<code java [highlight_lines_extra=" | <code java [highlight_lines_extra=" | ||
- | package | + | package tla; |
class AstPrinter implements Expr.Visitor< | class AstPrinter implements Expr.Visitor< | ||
Line 242: | Line 248: | ||
@Override | @Override | ||
- | public string | + | public string |
- | return parenthesize(" | + | return parenthesize(expr.operator.lexeme, |
+ | | ||
} | } | ||
</ | </ | ||
Line 265: | Line 272: | ||
It isn't necessary to define a '' | It isn't necessary to define a '' | ||
+ | |||
+ | ====== Chapter 6: Parsing Expressions ====== | ||
+ | |||
+ | In [[https:// | ||
+ | |||
+ | ===== Section 6.1: Ambiguity and the Parsing Game ===== | ||
+ | |||
+ | First, as in [[https:// | ||
+ | The way that precedence works in TLA⁺ is different from Lox, and indeed different from most other languages! | ||
+ | One side-quote from this section of the book reads: | ||
+ | |||
+ | >While not common these days, some languages specify that certain pairs of operators have no relative precedence. That makes it a syntax error to mix those operators in an expression without using explicit grouping. | ||
+ | |||
+ | > | ||
+ | |||
+ | TLA⁺ has both of these features! | ||
+ | Instead of operators occupying a slot in some hierarchy of precedence, each operator has a precedence //range//. | ||
+ | When the precedence ranges of two operators overlap it is a parse error. | ||
+ | For example, the '' | ||
+ | < | ||
+ | ENABLED x' | ||
+ | </ | ||
+ | Users must add parentheses to indicate their desired grouping in the expression. | ||
+ | Similarly, some operators like '' | ||
+ | Both of these factors combine to make our operator parsing code quite a bit different from that given in the book. | ||
+ | Worry not, it can still be made terse and understandable! | ||
+ | |||
+ | ===== Section 6.2: Recursive Descent Parsing ===== | ||
+ | |||
+ | [[https:// | ||
+ | Recursive descent can seem a bit magical even if you're used to reasoning about recursive functions! | ||
+ | Unfortunately TLA⁺ requires us to push this magic even further. | ||
+ | We'll take it step by step. | ||
+ | |||
+ | We start with the same basic '' | ||
+ | We also add an import for '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | package tla; | ||
+ | |||
+ | import java.util.List; | ||
+ | import java.util.ArrayList; | ||
+ | |||
+ | import static tla.TokenType.*; | ||
+ | |||
+ | class Parser { | ||
+ | private final List< | ||
+ | private int current = 0; | ||
+ | |||
+ | Parser(List< | ||
+ | this.tokens = tokens; | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Now we hit our first major difference. | ||
+ | In Lox, precedence is given by a small hierarchy of named rules like '' | ||
+ | TLA⁺ is more complicated than that. | ||
+ | The full language has around 100 operators spanning precedences 1 to 15! | ||
+ | If we wanted to match the book's style we'd have to write a tower of recursive functions like: | ||
+ | <code java> | ||
+ | private Expr operatorExpressionPrec1() { ... } | ||
+ | private Expr operatorExpressionPrec2() { ... } | ||
+ | private Expr operatorExpressionPrec3() { ... } | ||
+ | ... | ||
+ | private Expr operatorExpressionPrec15() { ... } | ||
+ | </ | ||
+ | where each '' | ||
+ | Life is too short for this. | ||
+ | Instead, we'll adopt a technique alluded to in the text: | ||
+ | >If you wanted to do some clever Java 8, you could create a helper method for parsing a left-associative series of binary operators given a list of token types, and an operand method handle to simplify this redundant code. | ||
+ | |||
+ | Here's the skeleton of our operator parsing function; the trick is to make the precedence a parameter to the function instead of a component of the name. | ||
+ | Add this code after the '' | ||
+ | <code java> | ||
+ | private Expr expression() { | ||
+ | return operatorExpression(1); | ||
+ | } | ||
+ | |||
+ | private Expr operatorExpression(int prec) { | ||
+ | if (prec == 16) return primary(); | ||
+ | |||
+ | Expr expr = operatorExpression(prec + 1); | ||
+ | |||
+ | return expr; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Before filling out '' | ||
+ | |||
+ | <code java> | ||
+ | private boolean match(TokenType... types) { | ||
+ | for (TokenType type : types) { | ||
+ | if (check(type)) { | ||
+ | advance(); | ||
+ | return true; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | return false; | ||
+ | } | ||
+ | |||
+ | private boolean check(TokenType type) { | ||
+ | if (isAtEnd()) return false; | ||
+ | return peek().type == type; | ||
+ | } | ||
+ | |||
+ | private Token advance() { | ||
+ | if (!isAtEnd()) current++; | ||
+ | return previous(); | ||
+ | } | ||
+ | |||
+ | private boolean isAtEnd() { | ||
+ | return peek().type == EOF; | ||
+ | } | ||
+ | |||
+ | private Token peek() { | ||
+ | return tokens.get(current); | ||
+ | } | ||
+ | |||
+ | private Token previous() { | ||
+ | return tokens.get(current - 1); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Now we have to define a table of operators with their details. | ||
+ | First use the handy Java 17 [[https:// | ||
+ | Put it at the top of the '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | class Parser { | ||
+ | private static enum Fix { PREFIX, INFIX, POSTFIX } | ||
+ | private static record Operator(Fix fix, TokenType token, | ||
+ | boolean assoc, int lowPrec, int highPrec) {} | ||
+ | |||
+ | private final List< | ||
+ | </ | ||
+ | |||
+ | You can find operator attributes on page 271 of // | ||
+ | We use a small subset of these operators. | ||
+ | Record their attributes in a table in '' | ||
+ | |||
+ | <code java> | ||
+ | private static final Operator[] operators = new Operator[] { | ||
+ | new Operator(Fix.PREFIX, | ||
+ | new Operator(Fix.PREFIX, | ||
+ | new Operator(Fix.PREFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.INFIX, | ||
+ | new Operator(Fix.POSTFIX, | ||
+ | }; | ||
+ | </ | ||
+ | |||
+ | Here's something a bit odd. | ||
+ | In TLA⁺, the infix minus operator (subtraction) has higher precedence at 11-11 than the infix plus operator (addition) at 10-10! | ||
+ | In grade school you probably learned acronyms like PEMDAS or BEDMAS to remember the order of operations in arithmetic. | ||
+ | Really, you learned parsing rules! | ||
+ | Now when writing our own parsing algorithm we come to understand that the order of these operations is not inscribed in the bedrock of mathematical truth but instead is a simple convention of mathematical notation. | ||
+ | TLA⁺ subverts this by parsing the expression '' | ||
+ | While this design decision is unusual, it is [[https:// | ||
+ | |||
+ | We need one more helper method before we can start working on '' | ||
+ | Add this code above '' | ||
+ | <code java> | ||
+ | private Operator matchOp(Fix fix, int prec) { | ||
+ | for (Operator op : operators) { | ||
+ | if (op.fix == fix && op.lowPrec == prec) { | ||
+ | if (match(op.token)) return op; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | return null; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Okay, we're all set for the main event! | ||
+ | Here is how we modify '' | ||
+ | You can see a strong resemblance to the infix operator parsing code given in the book (new lines highlighted): | ||
+ | <code java [highlight_lines_extra=" | ||
+ | private Expr operatorExpression(int prec) { | ||
+ | if (prec == 16) return primary(); | ||
+ | |||
+ | Operator op; | ||
+ | |||
+ | Expr expr = operatorExpression(prec + 1); | ||
+ | while ((op = matchOp(Fix.INFIX, | ||
+ | Token operator = previous(); | ||
+ | Expr right = operatorExpression(op.highPrec + 1); | ||
+ | expr = new Expr.Binary(expr, | ||
+ | } | ||
+ | |||
+ | return expr; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | We use Java's combined conditional-assignment atop the '' | ||
+ | The interior of the '' | ||
+ | It takes a bit of thinking to understand why this works for parsing expressions according to the precedence rules we want, but it does. | ||
+ | Take a minute to ponder it. | ||
+ | If you still don't get it, wait until we have a fully-functional expression parser then play around with this line to see how it changes the behavior. | ||
+ | |||
+ | Our code implements precedence ranges, but assumes all infix operators are associative. | ||
+ | We need to modify the loop to return immediately if the infix operator is not associative: | ||
+ | <code java [highlight_lines_extra=" | ||
+ | while ((op = matchOp(Fix.INFIX, | ||
+ | Token operator = previous(); | ||
+ | Expr right = operatorExpression(op.highPrec + 1); | ||
+ | expr = new Expr.Binary(expr, | ||
+ | if (!op.assoc) return expr; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | And that's how we parse infix operators! | ||
+ | That was the most complicated case, so let's move on to prefix operators. | ||
+ | Again our code resembles the prefix operator parsing logic given in the book. | ||
+ | Add this snippet near the top of '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | private Expr operatorExpression(int prec) { | ||
+ | if (prec == 16) return primary(); | ||
+ | |||
+ | Operator op; | ||
+ | if ((op = matchOp(Fix.PREFIX, | ||
+ | Token opToken = previous(); | ||
+ | Expr expr = operatorExpression( | ||
+ | op.assoc ? op.lowPrec : op.highPrec + 1); | ||
+ | return new Expr.Unary(opToken, | ||
+ | } | ||
+ | |||
+ | Expr expr = operatorExpression(prec + 1); | ||
+ | </ | ||
+ | |||
+ | Of note is the expression '' | ||
+ | To understand this expression, it is helpful to consider an example. | ||
+ | The negative prefix operator '' | ||
+ | In order to do that after consuming the first '' | ||
+ | Then we will consume the second '' | ||
+ | In contrast, the prefix operator '' | ||
+ | In that case we want to reject the expression '' | ||
+ | Thus the second '' | ||
+ | |||
+ | All that remains is to parse postfix operators. | ||
+ | These are not covered in the book, but they pretty much resemble infix operators without matching an expression on the right-hand side of the operator. | ||
+ | Add the highlighted code below the '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | } | ||
+ | |||
+ | while ((op = matchOp(Fix.POSTFIX, | ||
+ | Token opToken = previous(); | ||
+ | expr = new Expr.Unary(opToken, | ||
+ | if (!op.assoc) break; | ||
+ | } | ||
+ | |||
+ | return expr | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | And that completes our operator parsing logic! | ||
+ | All that remains is our '' | ||
+ | It is fairly similar to that given in the book, with some omissions since our minimal TLA⁺ subset does not contain strings or null values. | ||
+ | Add this code after '' | ||
+ | |||
+ | <code java> | ||
+ | private Expr primary() { | ||
+ | if (match(FALSE)) return new Expr.Literal(false); | ||
+ | if (match(TRUE)) return new Expr.Literal(true); | ||
+ | |||
+ | if (match(NUMBER)) { | ||
+ | return new Expr.Literal(previous().literal); | ||
+ | } | ||
+ | |||
+ | if (match(LEFT_PAREN)) { | ||
+ | Expr expr = expression(); | ||
+ | consume(RIGHT_PAREN, | ||
+ | return new Expr.Grouping(expr); | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | We have two final additions to '' | ||
+ | Here's how we parse '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | return new Expr.Grouping(expr); | ||
+ | } | ||
+ | |||
+ | if (match(IF)) { | ||
+ | Token operator = previous(); | ||
+ | Expr condition = expression(); | ||
+ | consume(THEN, | ||
+ | Expr yes = expression(); | ||
+ | consume(ELSE, | ||
+ | Expr no = expression(); | ||
+ | return new Expr.Ternary(operator, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | And here's how we parse the set construction operator. | ||
+ | Again add the highlighted code to the bottom of '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | return new Expr.Ternary(operator, | ||
+ | } | ||
+ | |||
+ | if (match(LEFT_BRACE)) { | ||
+ | Token operator = previous(); | ||
+ | List< | ||
+ | if (!check(RIGHT_BRACE)) { | ||
+ | do { | ||
+ | elements.add(expression()); | ||
+ | } while (match(COMMA)); | ||
+ | } | ||
+ | consume(RIGHT_BRACE, | ||
+ | return new Expr.Variadic(operator, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | This will handle expressions like '' | ||
+ | |||
+ | The '' | ||
+ | |||
+ | ===== Section 6.3: Syntax Errors ===== | ||
+ | |||
+ | We don't need to modify most of the error reporting code given in [[https:// | ||
+ | Here it is; add the '' | ||
+ | |||
+ | <code java> | ||
+ | private Token consume(TokenType type, String message) { | ||
+ | if (check(type)) return advance(); | ||
+ | |||
+ | throw error(peek(), | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Then add the '' | ||
+ | |||
+ | <code java> | ||
+ | private ParseError error(Token token, String message) { | ||
+ | Lox.error(token, | ||
+ | return new ParseError(); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | In the '' | ||
+ | |||
+ | <code java> | ||
+ | static void error(Token token, String message) { | ||
+ | if (token.type == TokenType.EOF) { | ||
+ | report(token.line, | ||
+ | } else { | ||
+ | report(token.line, | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Add the '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | class Parser { | ||
+ | private static enum Fix { PREFIX, INFIX, POSTFIX } | ||
+ | private static record Operator(Fix fix, TokenType token, | ||
+ | boolean assoc, int lowPrec, int highPrec) {} | ||
+ | private static class ParseError extends RuntimeException {} | ||
+ | |||
+ | private final List< | ||
+ | </ | ||
+ | |||
+ | One piece of code we //do// need to modify is the '' | ||
+ | TLA⁺ doesn' | ||
+ | |||
+ | <code java> | ||
+ | private void synchronize() { | ||
+ | advance(); | ||
+ | |||
+ | while(!isAtEnd()) { | ||
+ | if (previous().type == EQUAL_EQUAL) return; | ||
+ | |||
+ | advance(); | ||
+ | } | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Fully-featured TLA⁺ parsers generally look ahead to identify the start of a new operator definition (not trivial!) and insert a synthetic zero-width token for the parser to consume. | ||
+ | For an example, see SANY's very complicated '' | ||
+ | We won't be doing that here, but it's a good technique to know about; looking ahead then inserting synthetic tokens to disambiguate syntax is a common method when parsing complicated & ambiguous languages like TLA⁺. | ||
+ | |||
+ | ===== Section 6.4: Wiring up the Parser ===== | ||
+ | |||
+ | In [[https:// | ||
+ | In the '' | ||
+ | |||
+ | <code java [highlight_lines_extra=" | ||
+ | List< | ||
+ | |||
+ | Parser parser = new Parser(tokens); | ||
+ | Expr expression = parser.parse(); | ||
+ | |||
+ | // Stop if there was a syntax error. | ||
+ | if (hadError) return; | ||
+ | |||
+ | System.out.println(new AstPrinter().print(expression)); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Fire up the interpreter and parse a TLA⁺ expression! | ||
+ | < | ||
+ | > {1 + 2, IF TRUE THEN 3 ELSE 4, {}} | ||
+ | ({ (+ 1 2) (IF true 3 4) ({)) | ||
+ | </ | ||
+ | If you got out of sync, you can find a snapshot of the expected state of the code in [[https:// | ||
+ | Next tutorial: [[creating: | ||
+ | |||
+ | ====== Challenges ====== | ||
+ | |||
+ | Here are some optional challenges to flesh out your TLA⁺ interpreter, | ||
+ | |||
+ | [[creating: | ||