creating:syntax

This is an old revision of the document!


PHP's gd library is missing or unable to create PNG images

Parsing TLA⁺ Syntax

TLA⁺ is a large, complicated, and ambiguous language that takes a huge amount of work to parse correctly & completely. This is assisted by a comprehensive test corpus exercising the language's nooks & crannies, but here we will make life easy on ourselves. This tutorial only uses a minimal subset of TLA⁺ syntax, just enough to handle the classic DieHard spec. While that may seem limiting, this tutorial tries to focus on the difficult & interesting parts of parsing the language instead of more mundane drudgework like handling all hundred-some user-definable operator symbols. You are encouraged to extend this minimal core as you wish; language tooling is best developed incrementally! Slowly filling in the details of this rough language sketch has a satisfying meditative aspect.

Here is what our minimal language subset includes:

  • The ---- MODULE Name ---- header and ==== footer
  • Named parameterized operator definitions like op == ... and op(x, y, z) == ...
  • Comments, both single-line like \* comment text and nestable multi-line like (* comment text *)
  • Declaration of VARIABLES
  • Finite set literals, like {1, 2, 3, 4, 5}
  • The ENABLED, negative (-x), and logical negation (~) prefix operators
  • Some infix operators: \in, =, +, -, and <
  • The variable-priming suffix operator
  • Parentheses to control expression grouping
  • Vertically-aligned conjunction & disjunction lists
  • The IF/THEN/ELSE construct
  • Natural numbers like 0, 5, 10, etc.
  • Boolean values TRUE and FALSE

Notably, the following will not be covered in this tutorial:

  • PlusCal
  • Declaring CONSTANTS
  • Nesting modules or importing other modules through EXTENDS or INSTANCE
  • Set map & set filter syntax, like {e \in S : P(e)}
  • Universal & existential quantification, like \E e \in S : P(e)
  • The CHOOSE operator
  • Functions, like [x \in Nat |-> 2*x] or the function EXCEPT syntax
  • Sequences, like <<1, 2, 3, 4, 5>>
  • The LET/IN construct
  • Temporal- or action-level operators like [], <>, [A]_v, or WF_v(A)
  • Higher-order operator parameters like op(F(_), x) == ...
  • The TLA⁺ proof language

As outlined above, you are free to add these missing features (or others) as you wish.

Read part I (the first three chapters) of free online textbook Crafting Interpreters by Robert Nystrom. We will be closely following the material in this book, modifying it to our uses. The first two chapters are a nice introduction and overview of language implementation generally. Chapter three specifies a toy language called Lox, to be used as an object of study for the remainder of the book. Our minimal TLA⁺ subset has some similarity to Lox with regard to expressions involving integers and booleans, but also differences - you can skip the section on closures (unless you want to implement higher-order operator parameters yourself) and the section on classes. What's important is that it's similar enough for all the fundamentals to still apply!

Part II (A Tree-Walk Interpreter) is where we start to merge with and modify the given code. Everything before section 4.2 can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. You should thus have followed along and arrived at the following file TlaPlus.java:

package com.craftinginterpreters.tla;
 
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
 
public class TlaPlus {
  static boolean hadError = false;
 
  public static void main(String[] args) throws IOException {
    if (args.length > 1) {
      System.out.println("Usage: jlox [script]");
      System.exit(64);
    } else if (args.length == 1) {
      runFile(args[0]);
    } else {
      runPrompt();
    }
  }
 
  private static void runFile(String path) throws IOException {
    byte[] bytes = Files.readAllBytes(Paths.get(path));
    run(new String(bytes, Charset.defaultCharset()));
 
    // Indicate an error in the exit code.
    if (hadError) System.exit(65);
  }
 
  private static void runPrompt() throws IOException {
    InputStreamReader input = new InputStreamReader(System.in);
    BufferedReader reader = new BufferedReader(input);
 
    for (;;) {
      System.out.print("> ");
      String line = reader.readLine();
      if (line == null) break;
      run(line);
      hadError = false;
    }
  }
 
  private static void run(String source) {
    Scanner scanner = new Scanner(source);
    List<Token> tokens = scanner.scanTokens();
 
    // For now, just print the tokens.
    for (Token token : tokens) {
      System.out.println(token);
    }
  }
 
  static void error(int line, String message) {
    report(line, "", message);
  }
 
  private static void report(int line, String where,
                             String message) {
    System.err.println(
        "[line " + line + "] Error" + where + ": " + message);
    hadError = true;
  }
}

The TokenType class is our first major departure from the book. We instead use the atomic components of our minimal TLA⁺ language subset defined above. Adapting the snippet in section 4.2.1:

package com.craftinginterpreters.tla;
 
enum TokenType {
  // Single-character tokens.
  LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
  COMMA, MINUS, PLUS, EQUALS, LESS_THAN, NEGATION, PRIME
 
  // One or two character tokens.
  AND, OR, DEF_EQ, IN
 
  // Literals.
  IDENTIFIER, NUMBER,
 
  // Keywords.
  VARIABLES, ENABLED, IF, THEN, ELSE,
  SINGLE_LINE, MODULE, DOUBLE_LINE
 
  EOF
}

There is a very minor design decision here, of the type encountered innumerable times when writing a parser. Our language includes boolean values TRUE and FALSE, but there is no reason to give them their own token types; they can easily be parsed as ordinary IDENTIFIER token types then later resolved to the built-in operators TRUE and FALSE. This is the approach we take here, and is also the approach taken by the existing tools. This sort of works-either-way design dilemma occurs often, and we will see it again when trying to decide whether to disallow a snippet of invalid TLA⁺ at the syntactic or semantic level.

In section 4.2.3, we make one very critical addition to the information about each token tracked in the Token class: the token's start column. This will come in useful when parsing vertically-aligned conjunction & disjunction lists.

package com.craftinginterpreters.tla;
 
class Token {
  final TokenType type;
  final String lexeme;
  final Object literal;
  final int line;
  final int column;
 
  Token(TokenType type, String lexeme, Object literal, int line, int column) {
    this.type = type;
    this.lexeme = lexeme;
    this.literal = literal;
    this.line = line;
    this.column = column;
  }
 
  public String toString() {
    return type + " " + lexeme + " " + literal;
  }
}

We now move on to section 4.4, the very important Scanner class. Here we again make several logical modifications. The first is to track the column in addition to the line, mirroring our addition to the Token class:

  private final List<Token> tokens = new ArrayList<>();
  private int start = 0;
  private int current = 0;
  private int line = 1;
  private int column = 0;
  • creating/syntax.1743106733.txt.gz
  • Last modified: 2025/03/27 20:18
  • by ahelwer