Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
creating:start [2025/04/14 12:27] – Improved tutorial navigation links ahelwer | creating:start [2025/05/21 19:44] (current) – ToC: changed some titles ahelwer | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ======= Scanning | + | ====== |
- | This page corresponds | + | Building your own simplified parsing, model-checking, |
- | We will learn how to chunk up TLA⁺ | + | This process will give you a strong understanding |
- | For each section in the chapter, first read the section in the book and then read the corresponding section tutorial on this page to see how to adapt the concepts to TLA⁺. | + | Even though your solution |
+ | You'll also be able to rapidly prototype new capabilities in your own codebase, an experience that will guide you as you interface with the complexity of an industrial-strength implementation. | ||
+ | While writing your own TLA⁺ | ||
+ | Do not think you have to fully flesh out your implementation before you can begin working on the existing | ||
+ | The purpose of this exercise is to acquire specific knowledge; often you will expand your implementation in tandem with your work on the existing tools, each informing the other. | ||
+ | Your implementation will grow with you over time. | ||
- | ====== Section 4.1: The Interpreter Framework ====== | + | This tutorial series strives to build a culture of demystification, |
+ | The tools do not use complicated algorithms; their workings can all be understood by any software engineer. | ||
+ | Language tooling (for any formal language) is also a somewhat blessed area of software development in that specifications of expected program behavior are relatively complete and unambiguous. | ||
+ | Thus, with enough living knowledge (and comprehensive test batteries) we can confidently make changes to code long after its original authors have moved on; the challenges of legacy code and technical debt can be made tractable. | ||
- | Almost everything in [[https:// | + | Approaching TLA⁺ from the perspective of an implementer will also bring you better knowledge of how to use TLA⁺ itself. |
- | We do make one small functional modification: | + | Developing an intuition of what TLC is doing when it model-checks your spec will enable you to write specifications that make better use of its capabilities. |
- | Here's our main file, '' | + | And, if you reach the edge of those capabilities |
- | <code java [enable_line_numbers=" | + | ===== Table of Contents ===== |
- | package tla; | + | |
- | import java.io.BufferedReader; | + | - [[creating: |
- | import java.io.IOException; | + | - [[creating: |
- | import java.io.InputStreamReader; | + | - [[creating: |
- | import java.nio.charset.StandardCharsets; | + | - [[creating: |
- | import java.nio.file.Files; | + | - [[creating: |
- | import java.nio.file.Paths; | + | - [[creating: |
- | import java.util.List; | + | - [[creating: |
+ | - [[creating: | ||
+ | - [[creating: | ||
- | public class TlaPlus { | + | ===== Overview ===== |
- | static boolean hadError | + | |
- | public static void main(String[] args) throws IOException { | + | This tutorial uses Java. |
- | if (args.length > 1) { | + | There are two good reasons for this: |
- | | + | - At time of writing, |
- | System.exit(64); | + | - The high-quality free online textbook //[[https:// |
- | } else if (args.length == 1) { | + | |
- | runFile(args[0]); | + | |
- | } else { | + | |
- | runPrompt(); | + | |
- | } | + | |
- | } | + | |
- | private static void runFile(String path) throws IOException { | + | You will need to install the [[https:// |
- | byte[] bytes = Files.readAllBytes(Paths.get(path)); | + | This tutorial uses the Java 17 [[https:// |
- | | + | While all code artifacts can be produced entirely by reading this tutorial, if you get out of sync you can find working implementations in [[https:// |
- | // Indicate an error in the exit code. | + | The tutorial has you construct |
- | if (hadError) System.exit(65); | + | First you'll learn to parse its syntax, then how to check its semantics, then how to interpret its expressions, |
- | } | + | Future expansions will cover liveness checking and, possibly, proof checking. |
+ | A minimal subset of the TLA⁺ language is used; users are encouraged to extend their tools to handle additional TLA⁺ syntax as desired. | ||
- | private static void runPrompt() throws IOException { | + | ===== The Language ===== |
- | InputStreamReader input = new InputStreamReader(System.in); | + | |
- | BufferedReader reader | + | |
- | for (;;) { | + | TLA⁺ is a large, complicated, |
- | | + | This is assisted by [[https:// |
- | | + | This tutorial only uses a minimal subset of TLA⁺ syntax, enough to handle [[https:// |
- | if (line == null) break; | + | While that may seem limiting, this tutorial tries to focus on the difficult & interesting parts of bringing TLA⁺ to life instead of more mundane drudgework like handling all hundred-some user-definable operator symbols. |
- | | + | You are encouraged to extend this minimal core as you wish; language tooling is best developed incrementally! |
- | | + | Slowly filling in the details of this rough language sketch has a satisfying meditative aspect. |
- | } | + | |
- | } | + | |
- | private static void run(String source) | + | Here is what our minimal language subset includes: |
- | | + | * Whole natural numbers like '' |
- | List<Token> tokens | + | * Boolean values '' |
+ | * Finite set literals, like '' | ||
+ | * Parentheses for expression grouping | ||
+ | * '' | ||
+ | * Single-line comments like '' | ||
+ | * The '' | ||
+ | * Some infix operators: '' | ||
+ | * The variable-priming suffix operator | ||
+ | * Quantification over sets with '' | ||
+ | * Function construction and application, | ||
+ | * Named operator definitions like '' | ||
+ | * Declaration of '' | ||
+ | * Vertically-aligned conjunction & disjunction lists | ||
- | // For now, just print the tokens. | + | Notably, we do not use the familiar '' |
- | for (Token token : tokens) { | + | Files simply consist of a series of unit-level TLA⁺ definitions. |
- | System.out.println(token); | + | We also do not leave room for a separate model configuration file; all '' |
- | } | + | |
- | } | + | |
- | static void error(int line, String message) { | + | As outlined above, you are free to add missing features |
- | report(line, "", | + | Each chapter ends with a set of challenges that often involve adding these missing features. |
- | } | + | |
- | private static void report(int line, String where, | + | ===== Getting Started ===== |
- | | + | |
- | System.err.println( | + | |
- | "[line " + line + "] Error" + where + ": " + message); | + | |
- | hadError | + | |
- | } | + | |
- | } | + | |
- | </ | + | |
- | ====== Section 4.2: Lexemes and Tokens ====== | + | Read part I (the first three chapters) of free online textbook |
- | + | We will be closely following | |
- | The '' | + | The first two chapters are a nice introduction |
- | Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset. | + | Chapter three specifies a toy language called Lox, to be used as an object |
- | Adapting the snippet in [[https:// | + | Our minimal TLA⁺ subset has some similarity |
- | + | What's important | |
- | <code java [enable_line_numbers=" | + | |
- | package tla; | + | |
- | + | ||
- | enum TokenType { | + | |
- | | + | |
- | | + | |
- | LEFT_BRACKET, | + | |
- | MINUS, PLUS, LESS_THAN, NOT, PRIME, | + | |
- | + | ||
- | // Short fixed-length tokens. | + | |
- | EQUAL, EQUAL_EQUAL, | + | |
- | + | ||
- | // Literals. | + | |
- | IDENTIFIER, NUMBER, TRUE, FALSE, | + | |
- | + | ||
- | // Keywords. | + | |
- | VARIABLES, ENABLED, IF, THEN, ELSE, | + | |
- | + | ||
- | // Symbols. | + | |
- | IN, EXISTS, FOR_ALL, | + | |
- | + | ||
- | EOF | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | There is a very minor design decision here, of the type encountered innumerable times when writing a parser. | + | |
- | TLA⁺ includes boolean values '' | + | |
- | However, since we are writing our interpreter in Java and Java has primitive boolean literals, it is convenient | + | |
- | The existing TLA⁺ tools do not take this approach, | + | |
- | This sort of works-either-way design dilemma occurs often, and we will see it again when trying | + | |
- | + | ||
- | In [[https:// | + | |
- | This will come in useful when parsing vertically-aligned conjunction & disjunction lists. | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | package tla; | + | |
- | + | ||
- | class Token { | + | |
- | final TokenType type; | + | |
- | final String lexeme; | + | |
- | final Object literal; | + | |
- | final int line; | + | |
- | final int column; | + | |
- | + | ||
- | Token(TokenType type, String lexeme, Object literal, int line, int column) { | + | |
- | this.type = type; | + | |
- | this.lexeme = lexeme; | + | |
- | this.literal = literal; | + | |
- | this.line = line; | + | |
- | this.column = column; | + | |
- | } | + | |
- | + | ||
- | public String toString() { | + | |
- | return type + " " + lexeme + " " + literal; | + | |
- | } | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | ====== Section 4.4: The Scanner Class ====== | + | |
- | + | ||
- | Nothing in section 4.3 requires modification, | + | |
- | Our first modification | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | package tla; | + | |
- | + | ||
- | import java.util.ArrayList; | + | |
- | import java.util.HashMap; | + | |
- | import java.util.List; | + | |
- | import java.util.Map; | + | |
- | + | ||
- | import static tla.TokenType.*; | + | |
- | + | ||
- | class Scanner { | + | |
- | private final String source; | + | |
- | private final List< | + | |
- | private int start = 0; | + | |
- | private int current = 0; | + | |
- | private int line = 1; | + | |
- | private int column = 0; | + | |
- | + | ||
- | Scanner(String source) { | + | |
- | this.source = source; | + | |
- | } | + | |
- | + | ||
- | List< | + | |
- | while (!isAtEnd()) { | + | |
- | // We are at the beginning of the next lexeme. | + | |
- | start = current; | + | |
- | scanToken(); | + | |
- | } | + | |
- | + | ||
- | tokens.add(new Token(EOF, "", | + | |
- | return tokens; | + | |
- | } | + | |
- | + | ||
- | private boolean isAtEnd() { | + | |
- | return current >= source.length(); | + | |
- | } | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | ====== Section 4.5: Recognizing Lexemes ====== | + | |
- | + | ||
- | [[https:// | + | |
- | Of course, our unambiguous single-character lexing code in '' | + | |
- | Note also that we increment the column position in '' | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | private void scanToken() { | + | |
- | char c = advance(); | + | |
- | switch (c) { | + | |
- | case ' | + | |
- | case ' | + | |
- | case ' | + | |
- | case ' | + | |
- | case ' | + | |
- | case ' | + | |
- | case ',': | + | |
- | case ':': | + | |
- | case ' | + | |
- | case ' | + | |
- | case '<': | + | |
- | case ' | + | |
- | case ' | + | |
- | default: | + | |
- | TlaPlus.error(line, | + | |
- | break; | + | |
- | } | + | |
- | } | + | |
+ | Code snippets are shown often in this tutorial. | ||
+ | When a snippet is similar to code given in the book with the exception of a few lines, the differing lines will be highlighted. | ||
+ | For example: | ||
+ | <code java [highlight_lines_extra=" | ||
private char advance() { | private char advance() { | ||
column++; | column++; | ||
return source.charAt(current++); | return source.charAt(current++); | ||
} | } | ||
- | |||
- | private void addToken(TokenType type) { | ||
- | addToken(type, | ||
- | } | ||
- | |||
- | private void addToken(TokenType type, Object literal) { | ||
- | String text = source.substring(start, | ||
- | int start_column = column - (current - start); | ||
- | tokens.add(new Token(type, text, literal, line, start_column)); | ||
- | } | ||
- | } | ||
</ | </ | ||
- | Moving on to lexing short fixed-length tokens, the differences widen considerably; | + | Highlighting is also used to identify |
- | + | For example: | |
- | <code java [enable_line_numbers=" | + | |
- | case ' | + | |
- | case ' | + | |
- | addToken(match(' | + | |
- | break; | + | |
- | case ' | + | |
- | if (match(' | + | |
- | addToken(OR); | + | |
- | } else { | + | |
- | TlaPlus.error(line, | + | |
- | } | + | |
- | break; | + | |
- | case '/': | + | |
- | if (consume(' | + | |
- | break; | + | |
- | case ' | + | |
- | if (consume(' | + | |
- | break; | + | |
- | case ' | + | |
- | if (consume(' | + | |
- | break; | + | |
- | default: | + | |
- | TlaPlus.error(line, | + | |
- | break; | + | |
- | </code> | + | |
- | + | ||
- | Two helper methods are used: the first, '' | + | |
- | We have to modify it slightly by updating the value of '' | + | |
- | The second helper method is of our own invention. | + | |
- | '' | + | |
- | You might be wondering why we didn't write the '' | + | |
- | + | ||
- | <code java [highlight_lines_extra=" | + | |
- | private boolean match(char expected) { | + | |
- | if (isAtEnd()) return false; | + | |
- | if (source.charAt(current) != expected) return false; | + | |
- | + | ||
- | column++; | + | |
- | current++; | + | |
- | return true; | + | |
- | } | + | |
- | + | ||
- | private boolean consume(char expected) { | + | |
- | if (match(expected)) return true; | + | |
- | else TlaPlus.error(line, | + | |
- | return false; | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | ====== Section 4.6: Longer Lexemes ====== | + | |
- | + | ||
- | In [[https:// | + | |
- | Let's begin with single-line comments, which in TLA⁺ start with '' | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | case ' | + | |
- | if (match('/' | + | |
- | addToken(OR); | + | |
- | } else if (match(' | + | |
- | // A comment goes until the end of the line. | + | |
- | while (peek() != ' | + | |
- | } else { | + | |
- | TlaPlus.error(line, | + | |
- | } | + | |
- | break; | + | |
- | </ | + | |
- | + | ||
- | This uses the '' | + | |
- | + | ||
- | <code java> | + | |
- | private char peek() { | + | |
- | if (isAtEnd()) return ' | + | |
- | return source.charAt(current); | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Also add the code to skip whitespace: | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | break; | + | |
- | + | ||
- | case ' ': | + | |
- | case ' | + | |
- | case ' | + | |
- | // Ignore whitespace. | + | |
- | break; | + | |
- | + | ||
- | case ' | + | |
- | line++; | + | |
- | break; | + | |
- | + | ||
- | default: | + | |
- | TlaPlus.error(line, | + | |
- | break; | + | |
- | </ | + | |
- | + | ||
- | Critically, we have to set '' | + | |
- | This is what makes column tracking work! | + | |
<code java [enable_line_numbers=" | <code java [enable_line_numbers=" | ||
case ' | case ' | ||
Line 342: | Line 107: | ||
</ | </ | ||
- | We aren't supporting strings (unless you want to go the extra mile) so next up is numbers: | + | Sometimes |
- | + | A parenthetical will specify | |
- | <code java [enable_line_numbers=" | + | |
- | default: | + | |
- | if (isDigit(c)) { | + | |
- | number(); | + | |
- | } else { | + | |
- | TlaPlus.error(line, | + | |
- | } | + | |
- | break; | + | |
- | </ | + | |
- | + | ||
- | This uses the '' | + | |
- | We're only handling whole natural numbers, no decimals, | + | |
- | + | ||
- | <code java [highlight_lines_extra=" | + | |
- | private boolean isDigit(char c) { | + | |
- | return c >= ' | + | |
- | } | + | |
- | + | ||
- | private void number() { | + | |
- | while (isDigit(peek())) advance(); | + | |
- | addToken(NUMBER, | + | |
- | Integer.parseInt(source.substring(start, | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | ====== Section 4.7: Reserved Words and Identifiers ====== | + | |
- | + | ||
- | In [[https:// | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | default: | + | |
- | if (isDigit(c)) { | + | |
- | number(); | + | |
- | } else if (isAlpha(c)) { | + | |
- | identifier(); | + | |
- | } else { | + | |
- | TlaPlus.error(line, | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Then include these helper methods: | + | |
- | + | ||
- | <code java> | + | |
- | private void identifier() { | + | |
- | while (isAlphaNumeric(peek())) advance(); | + | |
- | + | ||
- | addToken(IDENTIFIER); | + | |
- | } | + | |
- | + | ||
- | private boolean isAlpha(char c) { | + | |
- | return (c >= ' | + | |
- | (c >= ' | + | |
- | c == ' | + | |
- | } | + | |
- | + | ||
- | private boolean isAlphaNumeric(char c) { | + | |
- | return isAlpha(c) || isDigit(c); | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Note that this is // | + | |
- | TLA⁺ identifiers can be any sequence of alphanumeric ASCII characters including underscore, as long as at least one character in the sequence is alphabetical ('' | + | |
- | You are encouraged to modify your code to match this constraint, but for simplicity we'll stick with the close-enough identifier logic given in the book. | + | |
- | Fixing it requires modifying the number lexing logic from the previous section. | + | |
- | + | ||
- | Some valid identifiers are actually keywords and should be properly tokenized as such. | + | |
- | So, define a keyword map: | + | |
- | + | ||
- | <code java> | + | |
- | private static final Map< | + | |
- | + | ||
- | static { | + | |
- | keywords = new HashMap<> | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | keywords.put(" | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Then update our implementation of '' | + | |
- | + | ||
- | <code java [highlight_lines_extra=" | + | |
- | private void identifier() { | + | |
- | while (isAlphaNumeric(peek())) advance(); | + | |
- | + | ||
- | String text = source.substring(start, | + | |
- | TokenType type = keywords.get(text); | + | |
- | if (type == null) type = IDENTIFIER; | + | |
- | addToken(type); | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | We're nearly done! | + | |
- | All we need to do is repeat the keyword map method for the LaTeX-style symbols '' | + | |
- | Start by adding yet another branch to the '' | + | |
- | + | ||
- | <code java [enable_line_numbers=" | + | |
- | case ' | + | |
- | if (match('/' | + | |
- | addToken(OR); | + | |
- | } else if (match(' | + | |
- | // A comment goes until the end of the line. | + | |
- | while (peek() != ' | + | |
- | } else if (isAlpha(peek())) { | + | |
- | symbol(); | + | |
- | } else { | + | |
- | TlaPlus.error(line, | + | |
- | } | + | |
- | break; | + | |
- | </code> | + | |
- | + | ||
- | Then define the symbol map and '' | + | |
- | + | ||
- | <code java> | + | |
- | private static final Map< | + | |
- | + | ||
- | static { | + | |
- | symbols = new HashMap<> | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | symbols.put(" | + | |
- | } | + | |
- | + | ||
- | private void symbol() { | + | |
- | while (isAlpha(peek())) advance(); | + | |
- | + | ||
- | String text = source.substring(start, | + | |
- | TokenType type = symbols.get(text); | + | |
- | if (type == null) TlaPlus.error(line, | + | |
- | else addToken(type); | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | We are now done lexing our minimal TLA⁺ subset. | + | |
- | Give it a whirl! | + | |
- | Running your Java program should bring up the REPL, where you can enter input to see how it is lexed; for example: | + | |
- | + | ||
- | < | + | |
- | > \A e \in S : P(e) | + | |
- | FOR_ALL \A null | + | |
- | IDENTIFIER e null | + | |
- | IN \in null | + | |
- | IDENTIFIER S null | + | |
- | COLON : null | + | |
- | IDENTIFIER P null | + | |
- | LEFT_PAREN ( null | + | |
- | IDENTIFIER e null | + | |
- | RIGHT_PAREN ) null | + | |
- | EOF null | + | |
- | </ | + | |
- | + | ||
- | Isn't it amazing how quickly this is coming together? | + | |
- | The simplicity of the required code is one of the great wonders of language implementation. | + | |
- | If you got lost somewhere along the way, you can find a snapshot of the code on this page [[https:// | + | |
- | Next we learn how to collect our tokens into a parse tree! | + | |
- | Continue the tutorial at [[creating: | + | |
- | + | ||
- | ====== Challenges ====== | + | |
- | + | ||
- | Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult. | + | |
- | You should save a copy of your code before attempting these. | + | |
- | - Our error reporting functionality only reports the line on which the error occurs, even though we now also track the column. Modify the error reporting functions to pipe through and print out the column location of the error. | + | |
- | - Implement token recognition for the '' | + | |
- | - Modify '' | + | |
- | - Add support for nestable block comments like '' | + | |
- | - Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a '' | + | |
- | - Add Unicode support. Instead of using the '' | + | |
- | [[https:// | + | To begin the tutorial, start at page [[creating:scanning|Scanning TLA⁺ Tokens]]. |
+ | [[creating: | ||