creating:scanning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
creating:scanning [2025/03/29 14:17] – Moved syntax page to scanning page ahelwercreating:scanning [2025/03/29 23:50] (current) – Read files assuming the UTF-8 encoding. ahelwer
Line 7: Line 7:
 ==== Section 4.1: The Interpreter Framework ==== ==== Section 4.1: The Interpreter Framework ====
  
-Everything in [[https://craftinginterpreters.com/scanning.html#the-interpreter-framework|section 4.1]] can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. +Almost everything in [[https://craftinginterpreters.com/scanning.html#the-interpreter-framework|section 4.1]] can be left unchanged from what's given, although of course it makes sense to change some of the names from "Lox" to "TlaPlus" or similar. 
-You should thus have followed along and arrived at something similar to the below file ''TlaPlus.java'':+We do make one small functional modification: TLA⁺ source files are assumed to be encoded in UTF-8, a variable-width ASCII-compatible encoding, so we specify that when performing the file read. 
 +Here's our main file, ''TlaPlus.java'':
  
-<code java [enable_line_numbers="true",highlight_lines_extra="1,11"]>+<code java [enable_line_numbers="true",highlight_lines_extra="1,6,11,27"]>
 package com.craftinginterpreters.tla; package com.craftinginterpreters.tla;
  
Line 16: Line 17:
 import java.io.IOException; import java.io.IOException;
 import java.io.InputStreamReader; import java.io.InputStreamReader;
-import java.nio.charset.Charset;+import java.nio.charset.StandardCharsets;
 import java.nio.file.Files; import java.nio.file.Files;
 import java.nio.file.Paths; import java.nio.file.Paths;
Line 37: Line 38:
   private static void runFile(String path) throws IOException {   private static void runFile(String path) throws IOException {
     byte[] bytes = Files.readAllBytes(Paths.get(path));     byte[] bytes = Files.readAllBytes(Paths.get(path));
-    run(new String(bytes, Charset.defaultCharset()));+    run(new String(bytes, StandardCharsets.UTF_8));
  
     // Indicate an error in the exit code.     // Indicate an error in the exit code.
Line 82: Line 83:
  
 The ''TokenType'' class in [[https://craftinginterpreters.com/scanning.html#lexemes-and-tokens|section 4.2]] is our first major departure from the book. The ''TokenType'' class in [[https://craftinginterpreters.com/scanning.html#lexemes-and-tokens|section 4.2]] is our first major departure from the book.
-Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset defined above.+Instead of Lox tokens, we use the atomic components of our minimal TLA⁺ language subset.
 Adapting the snippet in [[https://craftinginterpreters.com/scanning.html#token-type|section 4.2.1]] we get: Adapting the snippet in [[https://craftinginterpreters.com/scanning.html#token-type|section 4.2.1]] we get:
  
Line 145: Line 146:
 ==== Section 4.4: The Scanner Class ==== ==== Section 4.4: The Scanner Class ====
  
-We now move on to the very important ''Scanner'' class in [[https://craftinginterpreters.com/scanning.html#the-scanner-class|section 4.4]].+Nothing in section 4.3 requires modification, so we can move on to the very important ''Scanner'' class in [[https://craftinginterpreters.com/scanning.html#the-scanner-class|section 4.4]].
 Our first modification to the code given in the book is to track the column in addition to the line, mirroring our addition to the ''Token'' class: Our first modification to the code given in the book is to track the column in addition to the line, mirroring our addition to the ''Token'' class:
  
Line 356: Line 357:
 We're only handling whole natural numbers, no decimals, so our ''number()'' method is much simpler than the one from the book: We're only handling whole natural numbers, no decimals, so our ''number()'' method is much simpler than the one from the book:
  
-<code java>+<code java [highlight_lines_extra="7,8"]>
   private boolean isDigit(char c) {   private boolean isDigit(char c) {
     return c >= '0' && c <= '9';     return c >= '0' && c <= '9';
Line 459: Line 460:
 </code> </code>
  
-Then define the symbol map and ''symbol()'' helper:+Then define the symbol map and ''symbol()'' helper - we can also throw in a few token symbol variants, why not:
  
 <code java> <code java>
Line 466: Line 467:
   static {   static {
     symbols = new HashMap<>();     symbols = new HashMap<>();
-    symbols.put("\\in",       IN);+    symbols.put("\\land",     AND);
     symbols.put("\\E",        EXISTS);     symbols.put("\\E",        EXISTS);
     symbols.put("\\exists",   EXISTS);     symbols.put("\\exists",   EXISTS);
     symbols.put("\\A",        FOR_ALL);     symbols.put("\\A",        FOR_ALL);
     symbols.put("\\forall",   FOR_ALL);     symbols.put("\\forall",   FOR_ALL);
 +    symbols.put("\\in",       IN);
 +    symbols.put("\\lnot",     NEGATION);
 +    symbols.put("\\neg",      NEGATION);
 +    symbols.put("\\lor",      OR);
   }   }
  
Line 503: Line 508:
 Isn't it amazing how quickly this is coming together? Isn't it amazing how quickly this is coming together?
 The simplicity of the required code is one of the great wonders of language implementation. The simplicity of the required code is one of the great wonders of language implementation.
 +If you got lost somewhere along the way, you can find a snapshot of the code on this page [[https://github.com/tlaplus-community/tlaplus-creator/tree/main/scanning|here]].
 +Next we learn how to collect our tokens into a parse tree!
 +Continue the tutorial at [[creating:syntax|Parsing TLA⁺ Syntax]].
 +
 +===== Challenges =====
 +
 +Here are some optional challenges to flesh out your TLA⁺ scanner, roughly ranked from simplest to most difficult.
 +You should save a copy of your code before attempting these.
 +  - Our error reporting functionality only reports the line on which the error occurs, even though we now also track the column. Modify the error reporting functions to pipe through and print out the column location of the error.
 +  - Implement token recognition for the ''---- MODULE Name ----''/''===='' header and footer. The ''----'' and ''===='' tokens must be of length four or greater. It can be tricky to gracefully integrate their logic with the existing ''MINUS'' and ''EQUAL''/''EQUAL_EQUAL'' case blocks.
 +  - Modify ''number()'' and ''identifier()'' to properly implement TLA⁺ identifiers, which can consist of any string of alphanumeric or underscore characters as long as at least one character is alphabetical. This corresponds to the regex ''[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*''.
 +  - Add support for nestable block comments like ''(* text (* text *) text *)''. This requires a much deeper modification of the scanner than might first be apparent. Currently, our lexing grammer is //regular//: it does not need to store any unbounded state to lex everything. However, to properly handle block comments you'll need to add another class field like ''int block_comment_nest_level = -1'' and increment/decrement it as you encounter ''(*'' and ''*)'' tokens. In technical terms this addition makes the lexing grammar context-free instead of regular.
 +  - Similar to nested block comments, add support for extramodular text & nested modules. TLA⁺ files are properly supposed to ignore all text outside of modules, treating it the same as comments. Lexing TLA⁺ tokens should only start after reading ahead and detecting a ''---- MODULE'' sequence. Then, after detecting termination of the module with ''===='', the scanner should revert to ignoring the text. Supporting nested modules complicates this further, since you'll need to keep track of the module nesting level to know when you can start ignoring text again.
 +  - Add Unicode support. Instead of using the ''char'' type, Java represents Unicode codepoints as an ''int''. So, you'll be iterating over an array of ''int''s instead of the characters of a string. Character literals can still be directly compared against ''int''s; our ''case'' statement should be nearly unchanged. Look at the Java 8 [[https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/CharSequence.html#codePoints()|string.codePoints() method]]. Add support for Unicode symbol variants like ''≜'', ''∈'', ''∧'', ''∨'', ''∃'', and ''∀''. Our code reads files assuming the UTF-8 encoding so that's already sorted.
  
  • creating/scanning.1743257850.txt.gz
  • Last modified: 2025/03/29 14:17
  • by ahelwer