creating:scanning

Scanning TLA⁺ Tokens

This page corresponds to chapter 4 of Crafting Interpreters by Robert Nystrom, henceforth referred to as "the book". We will learn how to chunk up TLA⁺ source file string input into a series of tokens suitable for parsing. For each section in the chapter, first read the section in the book and then read the corresponding section tutorial on this page to see how to adapt the concepts to TLA⁺.

Almost everything in section 4.1 can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. We do make one small functional modification: TLA⁺ source files are assumed to be encoded in UTF-8, a variable-width ASCII-compatible encoding, so we specify that when performing the file read. Here's our main file, TlaPlus.java:

  1. package com.craftinginterpreters.tla;
  2.  
  3. import java.io.BufferedReader;
  4. import java.io.IOException;
  5. import java.io.InputStreamReader;
  6. import java.nio.charset.StandardCharsets;
  7. import java.nio.file.Files;
  8. import java.nio.file.Paths;
  9. import java.util.List;
  10.  
  11. public class TlaPlus {
  12. static boolean hadError = false;
  13.  
  14. public static void main(String[] args) throws IOException {
  15. if (args.length > 1) {
  16. System.out.println("Usage: jlox [script]");
  17. System.exit(64);
  18. } else if (args.length == 1) {
  19. runFile(args[0]);
  20. } else {
  21. runPrompt();
  22. }
  23. }
  24.  
  25. private static void runFile(String path) throws IOException {
  26. byte[] bytes = Files.readAllBytes(Paths.get(path));
  27. run(new String(bytes, StandardCharsets.UTF_8));
  28.  
  29. // Indicate an error in the exit code.
  30. if (hadError) System.exit(65);
  31. }
  32.  
  33. private static void runPrompt() throws IOException {
  34. BufferedReader reader = new BufferedReader(input);
  35.  
  36. for (;;) {
  37. System.out.print("> ");
  38. String line = reader.readLine();
  39. if (line == null) break;
  40. run(line);
  41. hadError = false;
  42. }
  43. }
  44.  
  45. private static void run(String source) {
  46. Scanner scanner = new Scanner(source);
  47. List<Token> tokens = scanner.scanTokens();
  48.  
  49. // For now, just print the tokens.
  50. for (Token token : tokens) {
  51. System.out.println(token);
  52. }
  53. }
  54.  
  55. static void error(int line, String message) {
  56. report(line, "", message);
  57. }
  58.  
  59. private static void report(int line, String where,
  60. String message) {
  61. System.err.println(
  62. "[line " + line + "] Error" + where + ": " + message);
  63. hadError = true;
  64. }
  65. }

The TokenType class in section 4.2 is our first major departure from the book. Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset. Adapting the snippet in section 4.2.1 we get:

  1. package com.craftinginterpreters.tla;
  2.  
  3. enum TokenType {
  4. // Single-character tokens.
  5. LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
  6. LEFT_BRACKET, RIGHT_BRACKET, COMMA, COLON,
  7. MINUS, PLUS, LESS_THAN, NEGATION, PRIME,
  8.  
  9. // Short fixed-length tokens.
  10. EQUAL, EQUAL_EQUAL, AND, OR, DOT_DOT, ALL_MAP_TO,
  11.  
  12. // Literals.
  13. IDENTIFIER, NAT_NUMBER, TRUE, FALSE,
  14.  
  15. // Keywords.
  16. VARIABLES, ENABLED, IF, THEN, ELSE,
  17.  
  18. // Symbols.
  19. IN, EXISTS, FOR_ALL,
  20.  
  21. EOF
  22. }

There is a very minor design decision here, of the type encountered innumerable times when writing a parser. TLA⁺ includes boolean values TRUE and FALSE, but we don't need to give them their own token types; they can be parsed as ordinary IDENTIFIER tokens then later resolved to the built-in operators TRUE and FALSE. However, since we are writing our interpreter in Java and Java has primitive boolean literals, it is convenient to add specialized tokens for TRUE and FALSE so we can resolve them to primitive Java values on the spot. The existing TLA⁺ tools do not take this approach, and instead later resolve the identifiers TRUE and FALSE from a set of built-in operators. This sort of works-either-way design dilemma occurs often, and we will see it again when trying to decide whether to disallow a snippet of invalid TLA⁺ at the syntactic or semantic level.

In section 4.2.3, we make one very critical addition to the information about each token tracked in the Token class: the token's start column. This will come in useful when parsing vertically-aligned conjunction & disjunction lists.

  1. package com.craftinginterpreters.tla;
  2.  
  3. class Token {
  4. final TokenType type;
  5. final String lexeme;
  6. final Object literal;
  7. final int line;
  8. final int column;
  9.  
  10. Token(TokenType type, String lexeme, Object literal, int line, int column) {
  11. this.type = type;
  12. this.lexeme = lexeme;
  13. this.literal = literal;
  14. this.line = line;
  15. this.column = column;
  16. }
  17.  
  18. public String toString() {
  19. return type + " " + lexeme + " " + literal;
  20. }
  21. }

Nothing in section 4.3 requires modification, so we can move on to the very important Scanner class in section 4.4. Our first modification to the code given in the book is to track the column in addition to the line, mirroring our addition to the Token class:

  1. package com.craftinginterpreters.tla;
  2.  
  3. import java.util.ArrayList;
  4. import java.util.HashMap;
  5. import java.util.List;
  6. import java.util.Map;
  7.  
  8. import static com.craftinginterpreters.tla.TokenType.*;
  9.  
  10. class Scanner {
  11. private final String source;
  12. private final List<Token> tokens = new ArrayList<>();
  13. private int start = 0;
  14. private int current = 0;
  15. private int line = 1;
  16. private int column = 0;
  17.  
  18. Scanner(String source) {
  19. this.source = source;
  20. }
  21.  
  22. List<Token> scanTokens() {
  23. while (!isAtEnd()) {
  24. // We are at the beginning of the next lexeme.
  25. start = current;
  26. scanToken();
  27. }
  28.  
  29. tokens.add(new Token(EOF, "", null, line, column));
  30. return tokens;
  31. }
  32.  
  33. private boolean isAtEnd() {
  34. return current >= source.length();
  35. }
  36. }

Section 4.5 finally gets around to actually handling text input! Of course, our unambiguous single-character lexing code in scanToken has some differences (including the removal of Lox tokens DOT, SEMICOLON, and STAR). Note also that we increment the column position in advance(), and there is some index arithmetic necessary when adding a token as we must calculate the token's start column:

  1. private void scanToken() {
  2. char c = advance();
  3. switch (c) {
  4. case '(': addToken(LEFT_PAREN); break;
  5. case ')': addToken(RIGHT_PAREN); break;
  6. case '{': addToken(LEFT_BRACE); break;
  7. case '}': addToken(RIGHT_BRACE); break;
  8. case '[': addToken(LEFT_BRACKET); break;
  9. case ']': addToken(RIGHT_BRACKET); break;
  10. case ',': addToken(COMMA); break;
  11. case ':': addToken(COLON); break;
  12. case '-': addToken(MINUS); break;
  13. case '+': addToken(PLUS); break;
  14. case '<': addToken(LESS_THAN); break;
  15. case '~': addToken(NEGATION); break;
  16. case '\'': addToken(PRIME); break;
  17. default:
  18. TlaPlus.error(line, "Unexpected character.");
  19. break;
  20. }
  21. }
  22.  
  23. private char advance() {
  24. column++;
  25. return source.charAt(current++);
  26. }
  27.  
  28. private void addToken(TokenType type) {
  29. addToken(type, null);
  30. }
  31.  
  32. private void addToken(TokenType type, Object literal) {
  33. String text = source.substring(start, current);
  34. int start_column = column - (current - start);
  35. tokens.add(new Token(type, text, literal, line, start_column));
  36. }
  37. }

Moving on to lexing short fixed-length tokens, the differences widen considerably; we want to lex =, ==, /\, \/, .., and |-> (new code highlighted):

  1. case '\'': addToken(PRIME); break;
  2. case '=':
  3. addToken(match('=') ? EQUAL_EQUAL : EQUAL);
  4. break;
  5. case '\\':
  6. if (match('/')) {
  7. addToken(OR);
  8. } else {
  9. TlaPlus.error(line, "Unexpected character.");
  10. }
  11. break;
  12. case '/':
  13. if (consume('\\')) addToken(AND);
  14. break;
  15. case '.':
  16. if (consume('.')) addToken(DOT_DOT);
  17. break;
  18. case '|':
  19. if (consume('-') && consume('>')) addToken(ALL_MAP_TO);
  20. break;
  21. default:
  22. TlaPlus.error(line, "Unexpected character.");
  23. break;

Two helper methods are used: the first, match(char), is given in the text. We have to modify it slightly by updating the value of column if current advances. The second helper method is of our own invention. consume(char) is similar to match(char) but logs an error if the expected character is not found. You might be wondering why we didn't write the \ case using consume(char), and that's because we will very shortly be adding to it.

  private boolean match(char expected) {
    if (isAtEnd()) return false;
    if (source.charAt(current) != expected) return false;
 
    column++;
    current++;
    return true;
  }
 
  private boolean consume(char expected) {
    if (match(expected)) return true;
    else TlaPlus.error(line, "Unexpected character.");
    return false;
  }

In section 4.6 we learn how to handle lexemes of longer or unlimited length. Let's begin with single-line comments, which in TLA⁺ start with \* (new code highlighted):

  1. case '\\':
  2. if (match('/')) {
  3. addToken(OR);
  4. } else if (match('*')) {
  5. // A comment goes until the end of the line.
  6. while (peek() != '\n' && !isAtEnd()) advance();
  7. } else {
  8. TlaPlus.error(line, "Unexpected character.");
  9. }
  10. break;

This uses the peek() helper given in the book:

  private char peek() {
    if (isAtEnd()) return '\0';
    return source.charAt(current);
  }

Also add the code to skip whitespace:

  1. break;
  2.  
  3. case ' ':
  4. case '\r':
  5. case '\t':
  6. // Ignore whitespace.
  7. break;
  8.  
  9. case '\n':
  10. line++;
  11. break;
  12.  
  13. default:
  14. TlaPlus.error(line, "Unexpected character.");
  15. break;

Critically, we have to set column to zero when processing a newline. This is what makes column tracking work!

  1. case '\n':
  2. column = 0;
  3. line++;
  4. break;

We aren't supporting strings (unless you want to go the extra mile) so next up is numbers:

  1. default:
  2. if (isDigit(c)) {
  3. number();
  4. } else {
  5. TlaPlus.error(line, "Unexpected character.");
  6. }
  7. break;

This uses the isDigit(char) and number() helpers. We're only handling whole natural numbers, no decimals, so our number() method is much simpler than the one from the book:

  private boolean isDigit(char c) {
    return c >= '0' && c <= '9';
  }
 
  private void number() {
    while (isDigit(peek())) advance();
    addToken(NAT_NUMBER,
        Integer.parseInt(source.substring(start, current)));
  }

In section 4.7 we learn to lex general identifiers; start by modifying the default case of the scanToken() method:

  1. default:
  2. if (isDigit(c)) {
  3. number();
  4. } else if (isAlpha(c)) {
  5. identifier();
  6. } else {
  7. TlaPlus.error(line, "Unexpected character.");
  8. }

Then include these helper methods:

  private void identifier() {
    while (isAlphaNumeric(peek())) advance();
 
    addToken(IDENTIFIER);
  }
 
  private boolean isAlpha(char c) {
    return (c >= 'a' && c <= 'z') ||
           (c >= 'A' && c <= 'Z') ||
            c == '_';
  }
 
  private boolean isAlphaNumeric(char c) {
    return isAlpha(c) || isDigit(c);
  }

Note that this is close to TLA⁺ identifier rules, but not quite correct. TLA⁺ identifiers can be any sequence of alphanumeric ASCII characters including underscore, as long as at least one character in the sequence is alphabetical ([a-zA-Z]). You are encouraged to modify your code to match this constraint, but for simplicity we'll stick with the close-enough identifier logic given in the book. Fixing it requires modifying the number lexing logic from the previous section.

Some valid identifiers are actually keywords and should be properly tokenized as such. So, define a keyword map:

  private static final Map<String, TokenType> keywords;
 
  static {
    keywords = new HashMap<>();
    keywords.put("ELSE",       ELSE);
    keywords.put("ENABLED",    ELSE);
    keywords.put("FALSE",      FALSE);
    keywords.put("IF",         IF);
    keywords.put("THEN",       THEN);
    keywords.put("TRUE",       TRUE);
    keywords.put("VARIABLE",   VARIABLES);
    keywords.put("VARIABLES",  VARIABLES);
  }

Then update our implementation of identifier() to first check whether an identifier is a keyword:

  private void identifier() {
    while (isAlphaNumeric(peek())) advance();
 
    String text = source.substring(start, current);
    TokenType type = keywords.get(text);
    if (type == null) type = IDENTIFIER;
    addToken(type);
  }

We're nearly done! All we need to do is repeat the keyword map method for the LaTeX-style symbols \in, \E, and \A. Start by adding yet another branch to the \ case (it's getting quite crowded!):

  1. case '\\':
  2. if (match('/')) {
  3. addToken(OR);
  4. } else if (match('*')) {
  5. // A comment goes until the end of the line.
  6. while (peek() != '\n' && !isAtEnd()) advance();
  7. } else if (isAlpha(peek())) {
  8. symbol();
  9. } else {
  10. TlaPlus.error(line, "Unexpected character.");
  11. }
  12. break;

Then define the symbol map and symbol() helper - we can also throw in a few token symbol variants, why not:

  private static final Map<String, TokenType> symbols;
 
  static {
    symbols = new HashMap<>();
    symbols.put("\\land",     AND);
    symbols.put("\\E",        EXISTS);
    symbols.put("\\exists",   EXISTS);
    symbols.put("\\A",        FOR_ALL);
    symbols.put("\\forall",   FOR_ALL);
    symbols.put("\\in",       IN);
    symbols.put("\\lnot",     NEGATION);
    symbols.put("\\neg",      NEGATION);
    symbols.put("\\lor",      OR);
  }
 
  private void symbol() {
    while (isAlpha(peek())) advance();
 
    String text = source.substring(start, current);
    TokenType type = symbols.get(text);
    if (type == null) TlaPlus.error(line, "Unexpected character.");
    else addToken(type);
  }

We are now done lexing our minimal TLA⁺ subset. Give it a whirl! Running your Java program should bring up the REPL, where you can enter input to see how it is lexed; for example:

> \A e \in S : P(e)
FOR_ALL \A null
IDENTIFIER e null
IN \in null
IDENTIFIER S null
COLON : null
IDENTIFIER P null
LEFT_PAREN ( null
IDENTIFIER e null
RIGHT_PAREN ) null
EOF  null

Isn't it amazing how quickly this is coming together? The simplicity of the required code is one of the great wonders of language implementation. If you got lost somewhere along the way, you can find a snapshot of the code on this page here. Next we learn how to collect our tokens into a parse tree! Continue the tutorial at Parsing TLA⁺ Syntax.

Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult. You should save a copy of your code before attempting these.

  1. Our error reporting functionality only reports the line on which the error occurs, even though we now also track the column. Modify the error reporting functions to pipe through and print out the column location of the error.
  2. Implement token recognition for the ---- MODULE Name ----/==== header and footer. The ---- and ==== tokens must be of length four or greater. It can be tricky to gracefully integrate their logic with the existing MINUS and EQUAL/EQUAL_EQUAL case blocks.
  3. Modify number() and identifier() to properly implement TLA⁺ identifiers, which can consist of any string of alphanumeric or underscore characters as long as at least one character is alphabetical. This corresponds to the regex [a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*.
  4. Add support for nestable block comments like (* text (* text *) text *). This requires a much deeper modification of the scanner than might first be apparent. Currently, our lexing grammer is regular: it does not need to store any unbounded state to lex everything. However, to properly handle block comments you'll need to add another class field like int block_comment_nest_level = -1 and increment/decrement it as you encounter (* and *) tokens. In technical terms this addition makes the lexing grammar context-free instead of regular.
  5. Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a ---- MODULE sequence. Then, after detecting termination of the module with ====, the scanner should revert to ignoring the text. Supporting nested modules complicates this further, since you'll need to keep track of the module nesting level to know when you can start ignoring text again.
  6. Add Unicode support. Instead of using the char type, Java represents Unicode codepoints as an int. So, you'll be iterating over an array of ints instead of the characters of a string. Character literals can still be directly compared against ints; our case statement should be nearly unchanged. Look at the Java 8 string.codePoints() method. Add support for Unicode symbol variants like , , , , , and . Our code reads files assuming the UTF-8 encoding so that's already sorted.
  • creating/scanning.txt
  • Last modified: 2025/03/29 23:50
  • by ahelwer