This is an old revision of the document!
Parsing TLA⁺ Syntax
TLA⁺ is a large, complicated, and ambiguous language that takes a huge amount of work to parse correctly & completely. This is assisted by a comprehensive test corpus exercising the language's nooks & crannies, but here we will make life easy on ourselves. This tutorial only uses a minimal subset of TLA⁺ syntax, enough to handle the generalized DieHard spec with some inconvenience. While that may seem limiting, this tutorial tries to focus on the difficult & interesting parts of bringing TLA⁺ to life instead of more mundane drudgework like handling all hundred-some user-definable operator symbols. You are encouraged to extend this minimal core as you wish; language tooling is best developed incrementally! Slowly filling in the details of this rough language sketch has a satisfying meditative aspect.
Here is what our minimal language subset includes:
- Named operator definitions like
op == ...
andop(x, y, z) == ...
- Single-line comments like
\* comment text
- Declaration of
VARIABLES
- Finite set literals, like
{1, 2, 3, 4, 5}
- Quantification over sets with
\E
and\A
- Function construction and application, like
[e \in S |-> op(e)]
andf[e]
- The
ENABLED
and logical negation (~
) prefix operators - Some infix operators:
\in
,=
,+
,-
,..
, and<
- The variable-priming suffix operator
- Vertically-aligned conjunction & disjunction lists
- The
IF
/THEN
/ELSE
construct - Whole natural numbers like
0
,5
,10
, etc. - Boolean values
TRUE
andFALSE
Notably, we do not use the familiar ---- MODULE Name ----
/====
encapsulation.
Files simply consist of a series of unit-level TLA⁺ definitions.
We also do not leave room for a separate model configuration file; all CONSTANT
values must instead be hardcoded, and well-known names like Init
, Next
, TypeOK
, and Inv
identify important definitions.
As outlined above, you are free to add missing features (or even your own invented features!) as you wish.
Preparation
Read part I (the first three chapters) of free online textbook Crafting Interpreters by Robert Nystrom. We will be closely following the material in this book, modifying it to our uses. The first two chapters are a nice introduction and overview of language implementation generally. Chapter three specifies a toy language called Lox, to be used as an object of study for the remainder of the book. Our minimal TLA⁺ subset has some similarity to Lox with regard to expressions involving integers and booleans, but also differences - you can skip the section on closures (unless you want to implement higher-order operator parameters yourself) and the section on classes. What's important is that it's similar enough for all the fundamentals to still apply!
Code snippets are shown often in this tutorial. When a snippet is similar to code given in the book with the exception of a few lines, the differing lines will be highlighted. Highlighting is also used to identify new code being added to a large section of code given previously, with surrounding lines included for context. Sometimes the code given here is so different from that provided by the book that highlighting is simply ommitted to reduce visual noise. A parenthetical will specify the meaning of highlighted code when ambiguous.
Section 4.1: The Interpreter Framework
Part II:: A Tree-Walk Interpreter is where we start to merge with and modify the given code.
Everything in section 4.1 can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar.
You should thus have followed along and arrived at something similar to the below file TlaPlus.java
:
package com.craftinginterpreters.tla; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.nio.charset.Charset; import java.nio.file.Files; import java.nio.file.Paths; import java.util.List; public class TlaPlus { static boolean hadError = false; if (args.length > 1) { } else if (args.length == 1) { runFile(args[0]); } else { runPrompt(); } } byte[] bytes = Files.readAllBytes(Paths.get(path)); // Indicate an error in the exit code. } for (;;) { if (line == null) break; run(line); hadError = false; } } Scanner scanner = new Scanner(source); List<Token> tokens = scanner.scanTokens(); // For now, just print the tokens. for (Token token : tokens) { } } report(line, "", message); } "[line " + line + "] Error" + where + ": " + message); hadError = true; } }
Section 4.2: Lexemes and Tokens
The TokenType
class in section 4.2 is our first major departure from the book.
Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset defined above.
Adapting the snippet in section 4.2.1 we get:
package com.craftinginterpreters.tla; enum TokenType { // Single-character tokens. LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE, LEFT_BRACKET, RIGHT_BRACKET, COMMA, COLON, MINUS, PLUS, LESS_THAN, NEGATION, PRIME, // Short fixed-length tokens. EQUAL, EQUAL_EQUAL, AND, OR, DOT_DOT, ALL_MAP_TO, // Literals. IDENTIFIER, NAT_NUMBER, TRUE, FALSE, // Keywords. VARIABLES, ENABLED, IF, THEN, ELSE, // Symbols. IN, EXISTS, FOR_ALL, EOF }
There is a very minor design decision here, of the type encountered innumerable times when writing a parser.
TLA⁺ includes boolean values TRUE
and FALSE
, but we don't need to give them their own token types; they can be parsed as ordinary IDENTIFIER
tokens then later resolved to the built-in operators TRUE
and FALSE
.
However, since we are writing our interpreter in Java and Java has primitive boolean literals, it is convenient to add specialized tokens for TRUE
and FALSE
so we can resolve them to primitive Java values on the spot.
The existing TLA⁺ tools do not take this approach, and instead later resolve the identifiers TRUE
and FALSE
from a set of built-in operators.
This sort of works-either-way design dilemma occurs often, and we will see it again when trying to decide whether to disallow a snippet of invalid TLA⁺ at the syntactic or semantic level.
In section 4.2.3, we make one very critical addition to the information about each token tracked in the Token
class: the token's start column.
This will come in useful when parsing vertically-aligned conjunction & disjunction lists.
package com.craftinginterpreters.tla; class Token { final TokenType type; final int line; final int column; this.type = type; this.lexeme = lexeme; this.literal = literal; this.line = line; this.column = column; } return type + " " + lexeme + " " + literal; } }
Section 4.4: The Scanner Class
We now move on to the very important Scanner
class in section 4.4.
Our first modification to the code given in the book is to track the column in addition to the line, mirroring our addition to the Token
class:
package com.craftinginterpreters.tla; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import static com.craftinginterpreters.tla.TokenType.*; class Scanner { private final List<Token> tokens = new ArrayList<>(); private int start = 0; private int current = 0; private int line = 1; private int column = 0; this.source = source; } List<Token> scanTokens() { while (!isAtEnd()) { // We are at the beginning of the next lexeme. start = current; scanToken(); } tokens.add(new Token(EOF, "", null, line, column)); return tokens; } private boolean isAtEnd() { return current >= source.length(); } }
Section 4.5: Recognizing Lexemes
Section 4.5 finally gets around to actually handling text input!
Of course, our unambiguous single-character lexing code in scanToken
has some differences (including the removal of Lox tokens DOT
, SEMICOLON
, and STAR
).
Note also that we increment the column position in advance()
, and there is some index arithmetic necessary when adding a token as we must calculate the token's start column:
private void scanToken() { char c = advance(); switch (c) { case '(': addToken(LEFT_PAREN); break; case ')': addToken(RIGHT_PAREN); break; case '{': addToken(LEFT_BRACE); break; case '}': addToken(RIGHT_BRACE); break; case '[': addToken(LEFT_BRACKET); break; case ']': addToken(RIGHT_BRACKET); break; case ',': addToken(COMMA); break; case ':': addToken(COLON); break; case '-': addToken(MINUS); break; case '+': addToken(PLUS); break; case '<': addToken(LESS_THAN); break; case '~': addToken(NEGATION); break; case '\'': addToken(PRIME); break; default: TlaPlus.error(line, "Unexpected character."); break; } } private char advance() { column++; return source.charAt(current++); } private void addToken(TokenType type) { addToken(type, null); } int start_column = column - (current - start); tokens.add(new Token(type, text, literal, line, start_column)); } }
Moving on to lexing short fixed-length tokens, the differences widen considerably; we want to lex =
, ==
, /\
, \/
, ..
, and |->
(new code highlighted):
case '\'': addToken(PRIME); break; case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break; case '\\': if (match('/')) { addToken(OR); } else { TlaPlus.error(line, "Unexpected character."); } break; case '/': if (consume('\\')) addToken(AND); break; case '.': if (consume('.')) addToken(DOT_DOT); break; case '|': if (consume('-') && consume('>')) addToken(ALL_MAP_TO); break; default: TlaPlus.error(line, "Unexpected character."); break;
Two helper methods are used: the first, match(char)
, is given in the text.
We have to modify it slightly by updating the value of column
if current
advances.
The second helper method is of our own invention.
consume(char)
is similar to match(char)
but logs an error if the expected character is not found.
You might be wondering why we didn't write the \
case using consume(char)
, and that's because we will very shortly be adding to it.
private boolean match(char expected) { if (isAtEnd()) return false; if (source.charAt(current) != expected) return false; column++; current++; return true; } private boolean consume(char expected) { if (match(expected)) return true; else TlaPlus.error(line, "Unexpected character."); return false; }
Section 4.6: Longer Lexemes
In section 4.6 we learn how to handle lexemes of longer or unlimited length.
Let's begin with single-line comments, which in TLA⁺ start with \*
(new code highlighted):
case '\\': if (match('/')) { addToken(OR); } else if (match('*')) { // A comment goes until the end of the line. while (peek() != '\n' && !isAtEnd()) advance(); } else { TlaPlus.error(line, "Unexpected character."); } break;
This uses the peek()
helper given in the book:
private char peek() { if (isAtEnd()) return '\0'; return source.charAt(current); }
Also add the code to skip whitespace:
break; case ' ': case '\r': case '\t': // Ignore whitespace. break; case '\n': line++; break; default: TlaPlus.error(line, "Unexpected character."); break;
Critically, we have to set column
to zero when processing a newline.
This is what makes column tracking work!
case '\n': column = 0; line++; break;
We aren't supporting strings (unless you want to go the extra mile) so next up is numbers:
default: if (isDigit(c)) { number(); } else { TlaPlus.error(line, "Unexpected character."); } break;
This uses the isDigit(char)
and number()
helpers.
We're only handling whole natural numbers, no decimals, so our number()
method is much simpler than the one from the book:
private boolean isDigit(char c) { return c >= '0' && c <= '9'; } private void number() { while (isDigit(peek())) advance(); addToken(NAT_NUMBER, Integer.parseInt(source.substring(start, current))); }
Section 4.7: Reserved Words and Identifiers
In section 4.7 we learn to lex general identifiers; start by modifying the default
case of the scanToken()
method:
default: if (isDigit(c)) { number(); } else if (isAlpha(c)) { identifier(); } else { TlaPlus.error(line, "Unexpected character."); }
Then include these helper methods:
private void identifier() { while (isAlphaNumeric(peek())) advance(); addToken(IDENTIFIER); } private boolean isAlpha(char c) { return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_'; } private boolean isAlphaNumeric(char c) { return isAlpha(c) || isDigit(c); }
Note that this is close to TLA⁺ identifier rules, but not quite correct.
TLA⁺ identifiers can be any sequence of alphanumeric ASCII characters including underscore, as long as at least one character in the sequence is alphabetical ([a-zA-Z]
).
You are encouraged to modify your code to match this constraint, but for simplicity we'll stick with the close-enough identifier logic given in the book.
Fixing it requires modifying the number lexing logic from the previous section.
Some valid identifiers are actually keywords and should be properly tokenized as such. So, define a keyword map:
private static final Map<String, TokenType> keywords; static { keywords = new HashMap<>(); keywords.put("ELSE", ELSE); keywords.put("ENABLED", ELSE); keywords.put("FALSE", FALSE); keywords.put("IF", IF); keywords.put("THEN", THEN); keywords.put("TRUE", TRUE); keywords.put("VARIABLE", VARIABLES); keywords.put("VARIABLES", VARIABLES); }
Then update our implementation of identifier()
to first check whether an identifier is a keyword:
private void identifier() { while (isAlphaNumeric(peek())) advance(); String text = source.substring(start, current); TokenType type = keywords.get(text); if (type == null) type = IDENTIFIER; addToken(type); }
We're nearly done!
All we need to do is repeat the keyword map method for the LaTeX-style symbols \in
, \E
, and \A
.
Start by adding yet another branch to the \
case (it's getting quite crowded!):
case '\\': if (match('/')) { addToken(OR); } else if (match('*')) { // A comment goes until the end of the line. while (peek() != '\n' && !isAtEnd()) advance(); } else if (isAlpha(peek())) { symbol(); } else { TlaPlus.error(line, "Unexpected character."); } break;
Then define the symbol map and symbol()
helper:
private static final Map<String, TokenType> symbols; static { symbols = new HashMap<>(); symbols.put("\\in", IN); symbols.put("\\E", EXISTS); symbols.put("\\exists", EXISTS); symbols.put("\\A", FOR_ALL); symbols.put("\\forall", FOR_ALL); } private void symbol() { while (isAlpha(peek())) advance(); String text = source.substring(start, current); TokenType type = symbols.get(text); if (type == null) TlaPlus.error(line, "Unexpected character."); else addToken(type); }
We are now done lexing our minimal TLA⁺ subset. Give it a whirl! Running your Java program should bring up the REPL, where you can enter input to see how it is lexed; for example:
> \A e \in S : P(e) FOR_ALL \A null IDENTIFIER e null IN \in null IDENTIFIER S null COLON : null IDENTIFIER P null LEFT_PAREN ( null IDENTIFIER e null RIGHT_PAREN ) null EOF null
Isn't it amazing how quickly this is coming together? The simplicity of the required code is one of the great wonders of language implementation.