creating:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
creating:start [2025/04/14 12:27] – Improved tutorial navigation links ahelwercreating:start [2025/05/21 19:44] (current) – ToC: changed some titles ahelwer
Line 1: Line 1:
-======= Scanning TLA⁺ Tokens =======+====== Create your own TLA⁺ tools ======
  
-This page corresponds to [[https://craftinginterpreters.com/scanning.html|chapter 4]] of //Crafting Interpreters// by Robert Nystrom, henceforth referred to as "the book"+Building your own simplified parsing, model-checking, and even proof-checking tools for TLA⁺ is the best way to prepare yourself for serious work on the existing TLA⁺ tools. 
-We will learn how to chunk up TLA⁺ source file string input into series of tokens suitable for parsing+This process will give you a strong understanding of how & why the tools are written the way they are
-For each section in the chapter, first read the section in the book and then read the corresponding section tutorial on this page to see how to adapt the concepts to TLA⁺.+Even though your solution will not match their workings line-for-line, you'll learn to recognize the tricky parts and can compare & contrast your own approach with the existing implementation. 
 +You'll also be able to rapidly prototype new capabilities in your own codebase, an experience that will guide you as you interface with the complexity of an industrial-strength implementation. 
 +While writing your own TLA⁺ tools may seem like tall order, it is a proven strategy for smoothing out the otherwise-near-vertical learning curve of contributing to tooling for any formal language
 +Do not think you have to fully flesh out your implementation before you can begin working on the existing TLA⁺ tools! 
 +The purpose of this exercise is to acquire specific knowledge; often you will expand your implementation in tandem with your work on the existing tools, each informing the other. 
 +Your implementation will grow with you over time.
  
-====== Section 4.1: The Interpreter Framework ======+This tutorial series strives to build a culture of demystification, where knowledge of TLA⁺ inner workings is not arcane but commonplace. 
 +The tools do not use complicated algorithms; their workings can all be understood by any software engineer. 
 +Language tooling (for any formal language) is also a somewhat blessed area of software development in that specifications of expected program behavior are relatively complete and unambiguous. 
 +Thus, with enough living knowledge (and comprehensive test batteries) we can confidently make changes to code long after its original authors have moved on; the challenges of legacy code and technical debt can be made tractable.
  
-Almost everything in [[https://craftinginterpreters.com/scanning.html#the-interpreter-framework|section 4.1]] can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar+Approaching TLA⁺ from the perspective of an implementer will also bring you better knowledge of how to use TLA⁺ itself. 
-We do make one small functional modification: TLA⁺ source files are assumed to be encoded in UTF-8a variable-width ASCII-compatible encodingso we specify that when performing the file read. +Developing an intuition of what TLC is doing when it model-checks your spec will enable you to write specifications that make better use of its capabilities
-Here's our main file, ''TlaPlus.java'':+Andif you reach the edge of those capabilities wellyou're well-placed to extend TLC to handle it!
  
-<code java [enable_line_numbers="true",highlight_lines_extra="1,6,11,27"]> +===== Table of Contents =====
-package tla;+
  
-import java.io.BufferedReader; +  - [[creating:start|Introduction (this page)]] 
-import java.io.IOException; +  - [[creating:scanning|Scanning TLA⁺ Tokens]] 
-import java.io.InputStreamReader; +  - [[creating:expressions|Parsing Constant TLA⁺ Expressions]] 
-import java.nio.charset.StandardCharsets; +  - [[creating:evaluation|Evaluating Constant TLA⁺ Expressions]] 
-import java.nio.file.Files; +  - [[creating:statements|Handling TLA⁺ Statements]] 
-import java.nio.file.Paths; +  - [[creating:jlists|Conjunction & Disjunction Lists]] 
-import java.util.List;+  - [[creating:operators|Functions, Operators, and Parameters]] 
 +  - [[creating:actions|Variables, States, and Actions]] 
 +  - [[creating:safety|Model-Checking Safety Properties]]
  
-public class TlaPlus { +===== Overview =====
-  static boolean hadError false;+
  
-  public static void main(String[] args) throws IOException { +This tutorial uses Java
-    if (args.length > 1) { +There are two good reasons for this: 
-      System.out.println("Usagejlox [script]"); +  - At time of writing, [[codebase:start|the most mature TLA⁺ tools]] are written in Java, and this tutorial prepares you to work on them
-      System.exit(64); +  - The high-quality free online textbook //[[https://craftinginterpreters.com/contents.html|Crafting Interpreters]]// by Robert Nystrom uses Java, and this tutorial is a simple TLA⁺-themed wrapper around its concepts. That book states its own reasons for using Java, which you can read.
-    } else if (args.length == 1) { +
-      runFile(args[0]); +
-    } else { +
-      runPrompt(); +
-    } +
-  }+
  
-  private static void runFile(String path) throws IOException { +You will need to install the [[https://adoptium.net/|Java Development Kit]] appropriate for your system. 
-    byte[] bytes = Files.readAllBytes(Paths.get(path)); +This tutorial uses the Java 17 [[https://openjdk.org/jeps/395|records]] feature for terse definition of dataclasses, so you must install JDK 17 or higher
-    run(new String(bytesStandardCharsets.UTF_8));+While all code artifacts can be produced entirely by reading this tutorialif you get out of sync you can find working implementations in [[https://github.com/tlaplus/devkit|this git repository]].
  
-    // Indicate an error in the exit code+The tutorial has you construct the TLA⁺ tools from the bottom up
-    if (hadError) System.exit(65); +First you'll learn to parse its syntax, then how to check its semantics, then how to interpret its expressions, and finally how to check the safety of a simple specification
-  }+Future expansions will cover liveness checking and, possibly, proof checking. 
 +A minimal subset of the TLA⁺ language is used; users are encouraged to extend their tools to handle additional TLA⁺ syntax as desired.
  
-  private static void runPrompt() throws IOException { +===== The Language =====
-    InputStreamReader input new InputStreamReader(System.in); +
-    BufferedReader reader new BufferedReader(input);+
  
-    for (;;) { +TLA⁺ is a large, complicated, and ambiguous language that takes a huge amount of work to parse correctly & completely. 
-      System.out.print("> "); +This is assisted by [[https://github.com/tlaplus/rfcs/tree/2a772d9dd11acec5d7dedf30abfab91a49de48b8/language_standard/tests/tlaplus_syntax|a comprehensive test corpus]] exercising the language's nooks & crannies, but here we will make life easy on ourselves
-      String line = reader.readLine(); +This tutorial only uses a minimal subset of TLA⁺ syntax, enough to handle [[https://github.com/tlaplus/Examples/blob/37236893f14527b4fc9f3963891eb2316c3de62e/specifications/DieHard/DieHarder.tla|the generalized DieHard spec]] with some inconvenience
-      if (line == null) break; +While that may seem limiting, this tutorial tries to focus on the difficult & interesting parts of bringing TLA⁺ to life instead of more mundane drudgework like handling all hundred-some user-definable operator symbols. 
-      run(line)+You are encouraged to extend this minimal core as you wishlanguage tooling is best developed incrementally! 
-      hadError = false; +Slowly filling in the details of this rough language sketch has a satisfying meditative aspect.
-    } +
-  }+
  
-  private static void run(String source) +Here is what our minimal language subset includes: 
-    Scanner scanner = new Scanner(source); +  * Whole natural numbers like ''0'', ''5'', ''10'', etc. 
-    List<Tokentokens scanner.scanTokens();+  * Boolean values ''TRUE'' and ''FALSE'' 
 +  * Finite set literals, like ''{1, 2, 3, 4, 5}'' 
 +  * Parentheses for expression grouping 
 +  * ''IF''/''THEN''/''ELSE'' expressions 
 +  * Single-line comments like ''\* comment text'' 
 +  * The ''ENABLED'', negative (''-''), and logical negation (''~'') prefix operators 
 +  * Some infix operators: ''\in'', ''='', ''+'', ''-'', ''..'', ''/\'', ''\/'', and ''<'' 
 +  * The variable-priming suffix operator 
 +  * Quantification over sets with ''\E'' and ''\A'' 
 +  * Function construction and application, like ''[e \in S |-op(e)]'' and ''f[e]'' 
 +  * Named operator definitions like ''op == ...'' and ''op(x, y, z== ...'' 
 +  * Declaration of ''VARIABLES'' 
 +  * Vertically-aligned conjunction & disjunction lists
  
-    // For nowjust print the tokens+Notablywe do not use the familiar ''---- MODULE Name ----''/''===='' encapsulation
-    for (Token token : tokens) { +Files simply consist of a series of unit-level TLA⁺ definitions
-      System.out.println(token); +We also do not leave room for a separate model configuration file; all ''CONSTANT'' values must instead be hardcoded, and well-known names like ''Init'', ''Next'', ''TypeOK'', and ''Inv'' identify important definitions.
-    } +
-  }+
  
-  static void error(int lineString message) { +As outlined aboveyou are free to add missing features (or even your own invented features!as you wish. 
-    report(line, "", message); +Each chapter ends with a set of challenges that often involve adding these missing features.
-  }+
  
-  private static void report(int line, String where, +===== Getting Started =====
-                             String message) { +
-    System.err.println( +
-        "[line " + line + "] Error" + where + ": " + message); +
-    hadError true; +
-  } +
-+
-</code>+
  
-====== Section 4.2: Lexemes and Tokens ====== +Read part I (the first three chapters) of free online textbook //[[https://craftinginterpreters.com/contents.html|Crafting Interpreters]]// by Robert Nystrom
- +We will be closely following the material in this bookmodifying it to our uses
-The ''TokenType'' class in [[https://craftinginterpreters.com/scanning.html#lexemes-and-tokens|section 4.2]] is our first major departure from the book. +The first two chapters are a nice introduction and overview of language implementation generally
-Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset. +Chapter three specifies a toy language called Lox, to be used as an object of study for the remainder of the book
-Adapting the snippet in [[https://craftinginterpreters.com/scanning.html#token-type|section 4.2.1]] we get: +Our minimal TLA⁺ subset has some similarity to Lox with regard to expressions involving integers and booleansbut also differences - you can skip the section on closures (unless you want to implement higher-order operator parameters yourself) and the section on classes
- +What's important is that it'similar enough for all the fundamentals to still apply!
-<code java [enable_line_numbers="true"]> +
-package tla; +
- +
-enum TokenType { +
-  // Single-character tokens+
-  LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE, +
-  LEFT_BRACKET, RIGHT_BRACKET, COMMA, COLON, +
-  MINUS, PLUS, LESS_THAN, NOT, PRIME, +
- +
-  // Short fixed-length tokens. +
-  EQUAL, EQUAL_EQUAL, AND, OR, DOT_DOT, ALL_MAP_TO, +
- +
-  // Literals. +
-  IDENTIFIER, NUMBER, TRUE, FALSE, +
- +
-  // Keywords. +
-  VARIABLES, ENABLED, IF, THEN, ELSE, +
- +
-  // Symbols. +
-  IN, EXISTS, FOR_ALL, +
- +
-  EOF +
-+
-</code> +
- +
-There is a very minor design decision here, of the type encountered innumerable times when writing a parser. +
-TLA⁺ includes boolean values ''TRUE'' and ''FALSE'', but we don't //need// to give them their own token types; they can be parsed as ordinary ''IDENTIFIER'' tokens then later resolved to the built-in operators ''TRUE'' and ''FALSE''+
-However, since we are writing our interpreter in Java and Java has primitive boolean literals, it is convenient to add specialized tokens for ''TRUE'' and ''FALSE'' so we can resolve them to primitive Java values on the spot+
-The existing TLA⁺ tools do not take this approach, and instead later resolve the identifiers ''TRUE'' and ''FALSE'' from a set of built-in operators+
-This sort of works-either-way design dilemma occurs oftenand we will see it again when trying to decide whether to disallow a snippet of invalid TLA⁺ at the syntactic or semantic level. +
- +
-In [[https://craftinginterpreters.com/scanning.html#location-information|section 4.2.3]], we make one //very critical// addition to the information about each token tracked in the ''Token'' class: the token's start column. +
-This will come in useful when parsing vertically-aligned conjunction & disjunction lists. +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="8,10,15"]> +
-package tla; +
- +
-class Token { +
-  final TokenType type; +
-  final String lexeme; +
-  final Object literal; +
-  final int line; +
-  final int column; +
- +
-  Token(TokenType type, String lexeme, Object literal, int line, int column) { +
-    this.type = type; +
-    this.lexeme = lexeme; +
-    this.literal = literal; +
-    this.line = line; +
-    this.column = column; +
-  } +
- +
-  public String toString() { +
-    return type + " " + lexeme + " " + literal; +
-  } +
-+
-</code> +
- +
-====== Section 4.4: The Scanner Class ====== +
- +
-Nothing in section 4.3 requires modification, so we can move on to the very important ''Scanner'' class in [[https://craftinginterpreters.com/scanning.html#the-scanner-class|section 4.4]]+
-Our first modification to the code given in the book is to track the column in addition to the linemirroring our addition to the ''Token'' class: +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="1,8,16,29"]> +
-package tla; +
- +
-import java.util.ArrayList; +
-import java.util.HashMap; +
-import java.util.List; +
-import java.util.Map; +
- +
-import static tla.TokenType.*; +
- +
-class Scanner { +
-  private final String source; +
-  private final List<Token> tokens = new ArrayList<>(); +
-  private int start = 0; +
-  private int current = 0; +
-  private int line = 1; +
-  private int column = 0; +
- +
-  Scanner(String source) { +
-    this.source = source; +
-  } +
- +
-  List<Token> scanTokens() { +
-    while (!isAtEnd()) { +
-      // We are at the beginning of the next lexeme. +
-      start = current; +
-      scanToken(); +
-    } +
- +
-    tokens.add(new Token(EOF, "", null, line, column)); +
-    return tokens; +
-  } +
- +
-  private boolean isAtEnd() { +
-    return current >= source.length(); +
-  } +
-+
-</code> +
- +
-====== Section 4.5: Recognizing Lexemes ====== +
- +
-[[https://craftinginterpreters.com/scanning.html#recognizing-lexemes|Section 4.5]] finally gets around to actually handling text input! +
-Of course, our unambiguous single-character lexing code in ''scanToken'' has some differences (including the removal of Lox tokens ''DOT'', ''SEMICOLON'', and ''STAR'')+
-Note also that we increment the column position in ''advance()'', and there is some index arithmetic necessary when adding a token as we must calculate the token'start column: +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="8,9,11,14,15,16,18,24,34,35",start_line_numbers_at=37]> +
-  private void scanToken() { +
-    char c = advance(); +
-    switch (c) { +
-      case '(': addToken(LEFT_PAREN); break; +
-      case ')': addToken(RIGHT_PAREN); break; +
-      case '{': addToken(LEFT_BRACE); break; +
-      case '}': addToken(RIGHT_BRACE); break; +
-      case '[': addToken(LEFT_BRACKET); break; +
-      case ']': addToken(RIGHT_BRACKET); break; +
-      case ',': addToken(COMMA); break; +
-      case ':': addToken(COLON); break; +
-      case '-': addToken(MINUS); break; +
-      case '+': addToken(PLUS); break; +
-      case '<': addToken(LESS_THAN); break; +
-      case '~': addToken(NOT); break; +
-      case '\'': addToken(PRIME); break; +
-      default: +
-        TlaPlus.error(line, "Unexpected character."); +
-        break; +
-    } +
-  }+
  
 +Code snippets are shown often in this tutorial.
 +When a snippet is similar to code given in the book with the exception of a few lines, the differing lines will be highlighted.
 +For example:
 +<code java [highlight_lines_extra="2"]>
   private char advance() {   private char advance() {
     column++;     column++;
     return source.charAt(current++);     return source.charAt(current++);
   }   }
- 
-  private void addToken(TokenType type) { 
-    addToken(type, null); 
-  } 
- 
-  private void addToken(TokenType type, Object literal) { 
-    String text = source.substring(start, current); 
-    int start_column = column - (current - start); 
-    tokens.add(new Token(type, text, literal, line, start_column)); 
-  } 
-} 
 </code> </code>
  
-Moving on to lexing short fixed-length tokens, the differences widen considerably; we want to lex ''='', ''=='', ''/\'', ''\/'', ''..'', and ''|->'' (new code highlighted): +Highlighting is also used to identify //new// code being added to a large section of code given previously, with line numbers and surrounding lines included for context
- +For example:
-<code java [enable_line_numbers="true",highlight_lines_extra="2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19",start_line_numbers_at=52]> +
-      case '\'': addToken(PRIME); break; +
-      case '=': +
-        addToken(match('=') ? EQUAL_EQUAL : EQUAL); +
-        break; +
-      case '\\': +
-        if (match('/')) { +
-          addToken(OR); +
-        } else { +
-          TlaPlus.error(line, "Unexpected character."); +
-        } +
-        break; +
-      case '/': +
-        if (consume('\\')) addToken(AND); +
-        break; +
-      case '.': +
-        if (consume('.')) addToken(DOT_DOT); +
-        break; +
-      case '|': +
-        if (consume('-') && consume('>')) addToken(ALL_MAP_TO); +
-        break; +
-      default: +
-        TlaPlus.error(line, "Unexpected character."); +
-        break; +
-</code+
- +
-Two helper methods are used: the first, ''match(char)'', is given in the text. +
-We have to modify it slightly by updating the value of ''column'' if ''current'' advances. +
-The second helper method is of our own invention. +
-''consume(char)'' is similar to ''match(char)'' but logs an error if the expected character is not found. +
-You might be wondering why we didn't write the ''\'' case using ''consume(char)'', and that's because we will very shortly be adding to it. +
- +
-<code java [highlight_lines_extra="5,10,11,12,13,14"]> +
-  private boolean match(char expected) { +
-    if (isAtEnd()) return false; +
-    if (source.charAt(current) != expected) return false; +
- +
-    column++; +
-    current++; +
-    return true; +
-  } +
- +
-  private boolean consume(char expected) { +
-    if (match(expected)) return true; +
-    else TlaPlus.error(line, "Unexpected character."); +
-    return false; +
-  } +
-</code> +
- +
-====== Section 4.6: Longer Lexemes ====== +
- +
-In [[https://craftinginterpreters.com/scanning.html#longer-lexemes|section 4.6]] we learn how to handle lexemes of longer or unlimited length. +
-Let's begin with single-line comments, which in TLA⁺ start with ''\*'' (new code highlighted): +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="4,5,6",start_line_numbers_at=56]> +
-      case '\\': +
-        if (match('/')) { +
-          addToken(OR); +
-        } else if (match('*')) { +
-          // A comment goes until the end of the line. +
-          while (peek() != '\n' && !isAtEnd()) advance(); +
-        } else { +
-          TlaPlus.error(line, "Unexpected character."); +
-        } +
-        break; +
-</code> +
- +
-This uses the ''peek()'' helper given in the book: +
- +
-<code java> +
-  private char peek() { +
-    if (isAtEnd()) return '\0'; +
-    return source.charAt(current); +
-  } +
-</code> +
- +
-Also add the code to skip whitespace: +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="3,4,5,6,7,9,10,11",start_line_numbers_at=74]> +
-      break; +
- +
-      case ' ': +
-      case '\r': +
-      case '\t': +
-        // Ignore whitespace. +
-        break; +
- +
-      case '\n': +
-        line++; +
-        break; +
- +
-      default: +
-        TlaPlus.error(line, "Unexpected character."); +
-        break; +
-</code> +
- +
-Critically, we have to set ''column'' to zero when processing a newline. +
-This is what makes column tracking work! +
 <code java [enable_line_numbers="true",highlight_lines_extra="2",start_line_numbers_at=82]> <code java [enable_line_numbers="true",highlight_lines_extra="2",start_line_numbers_at=82]>
       case '\n':       case '\n':
Line 342: Line 107:
 </code> </code>
  
-We aren't supporting strings (unless you want to go the extra mile) so next up is numbers: +Sometimes the code given here is so different from that provided by the book that highlighting is simply ommitted to reduce visual noise
- +parenthetical will specify the meaning of highlighted code when ambiguous.
-<code java [enable_line_numbers="true",highlight_lines_extra="2,3,4,5,6",start_line_numbers_at=87]> +
-      default: +
-        if (isDigit(c)) { +
-          number(); +
-        } else { +
-          TlaPlus.error(line, "Unexpected character."); +
-        } +
-        break; +
-</code> +
- +
-This uses the ''isDigit(char)'' and ''number()'' helpers. +
-We're only handling whole natural numbers, no decimals, so our ''number()'' method is much simpler than the one from the book+
- +
-<code java [highlight_lines_extra="7,8"]> +
-  private boolean isDigit(char c) { +
-    return c >= '0' && c <= '9'; +
-  } +
- +
-  private void number() { +
-    while (isDigit(peek())) advance(); +
-    addToken(NUMBER, +
-        Integer.parseInt(source.substring(start, current))); +
-  } +
-</code> +
- +
-====== Section 4.7: Reserved Words and Identifiers ====== +
- +
-In [[https://craftinginterpreters.com/scanning.html#reserved-words-and-identifiers|section 4.7]] we learn to lex general identifiers; start by modifying the ''default'' case of the ''scanToken()'' method: +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="4,5",start_line_numbers_at=87]> +
-      default: +
-        if (isDigit(c)) { +
-          number(); +
-        } else if (isAlpha(c)) { +
-          identifier(); +
-        } else { +
-          TlaPlus.error(line, "Unexpected character."); +
-        } +
-</code> +
- +
-Then include these helper methods: +
- +
-<code java> +
-  private void identifier() { +
-    while (isAlphaNumeric(peek())) advance(); +
- +
-    addToken(IDENTIFIER); +
-  } +
- +
-  private boolean isAlpha(char c) { +
-    return (c >= 'a' && c <= 'z') || +
-           (c >= 'A' && c <= 'Z') || +
-            c == '_'; +
-  } +
- +
-  private boolean isAlphaNumeric(char c) { +
-    return isAlpha(c) || isDigit(c); +
-  } +
-</code> +
- +
-Note that this is //close// to TLA⁺ identifier rules, but not quite correct+
-TLA⁺ identifiers can be any sequence of alphanumeric ASCII characters including underscore, as long as at least one character in the sequence is alphabetical (''[a-zA-Z]''). +
-You are encouraged to modify your code to match this constraint, but for simplicity we'll stick with the close-enough identifier logic given in the book. +
-Fixing it requires modifying the number lexing logic from the previous section. +
- +
-Some valid identifiers are actually keywords and should be properly tokenized as such. +
-So, define a keyword map: +
- +
-<code java> +
-  private static final Map<String, TokenType> keywords; +
- +
-  static { +
-    keywords = new HashMap<>(); +
-    keywords.put("ELSE",       ELSE); +
-    keywords.put("ENABLED",    ENABLED); +
-    keywords.put("FALSE",      FALSE); +
-    keywords.put("IF",         IF); +
-    keywords.put("THEN",       THEN); +
-    keywords.put("TRUE",       TRUE); +
-    keywords.put("VARIABLE",   VARIABLES); +
-    keywords.put("VARIABLES",  VARIABLES); +
-  } +
-</code> +
- +
-Then update our implementation of ''identifier()'' to first check whether an identifier is a keyword: +
- +
-<code java [highlight_lines_extra="4,5,6,7"]> +
-  private void identifier() { +
-    while (isAlphaNumeric(peek())) advance(); +
- +
-    String text = source.substring(start, current); +
-    TokenType type = keywords.get(text); +
-    if (type == null) type = IDENTIFIER; +
-    addToken(type); +
-  } +
-</code> +
- +
-We're nearly done! +
-All we need to do is repeat the keyword map method for the LaTeX-style symbols ''\in'', ''\E'', and ''\A''+
-Start by adding yet another branch to the ''\'' case (it's getting quite crowded!): +
- +
-<code java [enable_line_numbers="true",highlight_lines_extra="7,8",start_line_numbers_at=56]> +
-      case '\\': +
-        if (match('/')) { +
-          addToken(OR); +
-        } else if (match('*')) { +
-          // comment goes until the end of the line. +
-          while (peek() != '\n' && !isAtEnd()) advance(); +
-        } else if (isAlpha(peek())) { +
-          symbol(); +
-        } else { +
-          TlaPlus.error(line, "Unexpected character."); +
-        } +
-        break; +
-</code+
- +
-Then define the symbol map and ''symbol()'' helper - we can also throw in a few token symbol variants, why not: +
- +
-<code java> +
-  private static final Map<String, TokenType> symbols; +
- +
-  static { +
-    symbols = new HashMap<>(); +
-    symbols.put("\\land",     AND); +
-    symbols.put("\\E",        EXISTS); +
-    symbols.put("\\exists",   EXISTS); +
-    symbols.put("\\A",        FOR_ALL); +
-    symbols.put("\\forall",   FOR_ALL); +
-    symbols.put("\\in",       IN); +
-    symbols.put("\\lnot",     NOT); +
-    symbols.put("\\neg",      NOT); +
-    symbols.put("\\lor",      OR); +
-  } +
- +
-  private void symbol() { +
-    while (isAlpha(peek())) advance(); +
- +
-    String text = source.substring(start, current); +
-    TokenType type = symbols.get(text); +
-    if (type == null) TlaPlus.error(line, "Unexpected character."); +
-    else addToken(type); +
-  } +
-</code> +
- +
-We are now done lexing our minimal TLA⁺ subset. +
-Give it a whirl! +
-Running your Java program should bring up the REPL, where you can enter input to see how it is lexed; for example: +
- +
-<code> +
-> \A e \in S : P(e) +
-FOR_ALL \A null +
-IDENTIFIER e null +
-IN \in null +
-IDENTIFIER S null +
-COLON : null +
-IDENTIFIER P null +
-LEFT_PAREN ( null +
-IDENTIFIER e null +
-RIGHT_PAREN ) null +
-EOF  null +
-</code> +
- +
-Isn't it amazing how quickly this is coming together? +
-The simplicity of the required code is one of the great wonders of language implementation. +
-If you got lost somewhere along the way, you can find a snapshot of the code on this page [[https://github.com/tlaplus-community/tlaplus-creator/tree/main/2-scanning|here]]. +
-Next we learn how to collect our tokens into a parse tree! +
-Continue the tutorial at [[creating:expressions|Handling Constant TLA⁺ Expressions]]. +
- +
-====== Challenges ====== +
- +
-Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult. +
-You should save a copy of your code before attempting these. +
-  - Our error reporting functionality only reports the line on which the error occurs, even though we now also track the column. Modify the error reporting functions to pipe through and print out the column location of the error. +
-  - Implement token recognition for the ''---- MODULE Name ----''/''===='' header and footer. The ''----'' and ''===='' tokens must be of length four or greater. It can be tricky to gracefully integrate their logic with the existing ''MINUS'' and ''EQUAL''/''EQUAL_EQUAL'' case blocks. +
-  - Modify ''number()'' and ''identifier()'' to properly implement TLA⁺ identifiers, which can consist of any string of alphanumeric or underscore characters as long as at least one character is alphabetical. This corresponds to the regex ''[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*''+
-  - Add support for nestable block comments like ''(* text (* text *) text *)''. This requires a much deeper modification of the scanner than might first be apparent. Currently, our lexing grammer is //regular//: it does not need to store any unbounded state to lex everything. However, to properly handle block comments you'll need to add another class field like ''int block_comment_nest_level = -1'' and increment/decrement it as you encounter ''(*'' and ''*)'' tokens. In technical terms this addition makes the lexing grammar context-free instead of regular. +
-  - Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a ''---- MODULE'' sequence. Then, after detecting termination of the module with ''===='', the scanner should revert to ignoring the text. Supporting nested modules complicates this further, since you'll need to keep track of the module nesting level to know when you can start ignoring text again. +
-  - Add Unicode support. Instead of using the ''char'' type, Java represents Unicode codepoints as an ''int''. So, you'll be iterating over an array of ''int''s instead of the characters of a string. Character literals can still be directly compared against ''int''s; our ''case'' statement should be nearly unchanged. Look at the Java 8 [[https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/CharSequence.html#codePoints()|string.codePoints() method]]. Add support for Unicode symbol variants like ''≜'', ''∈'', ''∧'', ''∨'', ''∃'', and ''∀''. Our code reads files assuming the UTF-8 encoding so that's already sorted.+
  
-[[https://docs.tlapl.us/creating:start|< Previous Page]] | [[https://docs.tlapl.us/creating:expressions|Next Page >]]+To begin the tutorial, start at page [[creating:scanning|Scanning TLA⁺ Tokens]].
  
 +[[creating:start#table_of_contents|Table of Contents]] | [[creating:scanning|Next Page >]]
  
  • creating/start.1744633632.txt.gz
  • Last modified: 2025/04/14 12:27
  • by ahelwer